Heterogeneous Relational Databases for a Grid-enabled Analysis Environment

合集下载

面向多源异构数据库的SQL_解析与转换方法研究

面向多源异构数据库的SQL_解析与转换方法研究

第 22卷第 12期2023年 12月Vol.22 No.12Dec.2023软件导刊Software Guide面向多源异构数据库的SQL解析与转换方法研究练金栋1,陈志1,岳文静2,赵培1,3,吕伟初1,3(1.南京邮电大学计算机学院; 2.南京邮电大学通信与信息工程学院,江苏南京 210003;3.金篆信科有限责任公司,江苏南京 210012)摘要:传统的单一数据库模式难以适应如今多样化的数据管理需求。

如何将多个异构独立的数据库进行集成,对数据库系统进行整体控制和协同操作成为研究热点。

针对此问题进行面向多源异构数据库的SQL解析与转换方法研究,通过建立通用的中间表示模型,对异构数据库请求进行语法树解析、语义分析与模型转换,实现了不同数据库之间的互操作。

在基于TPC-H基准测试数据集的功能测试中,测试系统对数据类型和语法操作的支持度达到100%。

在性能测试中,测试系统在跨平台的增删改查操作时间上,较官方工具分别快了13.1 ms、8.8 ms、22.5 ms与2.3ms。

实验验证了该方法的正确性与可行性。

关键词:异构数据库;中间表示;语法解析;语法转换DOI:10.11907/rjdk.232028开放科学(资源服务)标识码(OSID):中图分类号:TP391 文献标识码:A 文章编号:1672-7800(2023)012-0124-08Research on SQL Parsing and Transformation Method for Multi-SourceHeterogeneous DatabaseLIAN Jindong1, CHEN Zhi1, YUE Wenjing2, ZHAO Pei1,3, LYU Weichu1,3(1.School of Computer Science, Nanjing University of Posts and Telecommunications;2.School of Communications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China;3.JINZHUAN Information Technology Co., Ltd., Nanjing 210012, China)Abstract:Nowadays, the traditional mode with single database has difficulty meeting with the demand of in diversified data management. The integration of multi-mode heterogeneous databases has become a research hotspot for overall control and collaborative operation of the global database system. Aiming at this problem, this paper studies the SQL parsing and transformation methods for heterogeneous databases. And in‐teroperability between different database has been achieved through universal intermediate-representation-model establishing,syntax tree parsing, semantic analysis and model transformation. In the functional test based on the TPC-H benchmark dataset, the frame-based test sys‐tem has 100% support for data types and syntax operations, while the framework has advantages over official tools in terms of operation speed for cross platform addition, deletion, modification, and query, with 13.1 ms, 8.8 ms, 22.5 ms, and 2.3 ms, respectively. The experiment ver‐ifies the correctness and feasibility of the proposed method.Key Words:heterogeneous database; intermediate representation; syntax parsing; grammar transformation0 引言随着数据量的爆发式增长,传统的单一数据库模式愈发难以满足存储和查询的实时性要求。

人工智能领域中英文专有名词汇总

人工智能领域中英文专有名词汇总

名词解释中英文对比<using_information_sources> social networks 社会网络abductive reasoning 溯因推理action recognition(行为识别)active learning(主动学习)adaptive systems 自适应系统adverse drugs reactions(药物不良反应)algorithm design and analysis(算法设计与分析) algorithm(算法)artificial intelligence 人工智能association rule(关联规则)attribute value taxonomy 属性分类规范automomous agent 自动代理automomous systems 自动系统background knowledge 背景知识bayes methods(贝叶斯方法)bayesian inference(贝叶斯推断)bayesian methods(bayes 方法)belief propagation(置信传播)better understanding 内涵理解big data 大数据big data(大数据)biological network(生物网络)biological sciences(生物科学)biomedical domain 生物医学领域biomedical research(生物医学研究)biomedical text(生物医学文本)boltzmann machine(玻尔兹曼机)bootstrapping method 拔靴法case based reasoning 实例推理causual models 因果模型citation matching (引文匹配)classification (分类)classification algorithms(分类算法)clistering algorithms 聚类算法cloud computing(云计算)cluster-based retrieval (聚类检索)clustering (聚类)clustering algorithms(聚类算法)clustering 聚类cognitive science 认知科学collaborative filtering (协同过滤)collaborative filtering(协同过滤)collabrative ontology development 联合本体开发collabrative ontology engineering 联合本体工程commonsense knowledge 常识communication networks(通讯网络)community detection(社区发现)complex data(复杂数据)complex dynamical networks(复杂动态网络)complex network(复杂网络)complex network(复杂网络)computational biology 计算生物学computational biology(计算生物学)computational complexity(计算复杂性) computational intelligence 智能计算computational modeling(计算模型)computer animation(计算机动画)computer networks(计算机网络)computer science 计算机科学concept clustering 概念聚类concept formation 概念形成concept learning 概念学习concept map 概念图concept model 概念模型concept modelling 概念模型conceptual model 概念模型conditional random field(条件随机场模型) conjunctive quries 合取查询constrained least squares (约束最小二乘) convex programming(凸规划)convolutional neural networks(卷积神经网络) customer relationship management(客户关系管理) data analysis(数据分析)data analysis(数据分析)data center(数据中心)data clustering (数据聚类)data compression(数据压缩)data envelopment analysis (数据包络分析)data fusion 数据融合data generation(数据生成)data handling(数据处理)data hierarchy (数据层次)data integration(数据整合)data integrity 数据完整性data intensive computing(数据密集型计算)data management 数据管理data management(数据管理)data management(数据管理)data miningdata mining 数据挖掘data model 数据模型data models(数据模型)data partitioning 数据划分data point(数据点)data privacy(数据隐私)data security(数据安全)data stream(数据流)data streams(数据流)data structure( 数据结构)data structure(数据结构)data visualisation(数据可视化)data visualization 数据可视化data visualization(数据可视化)data warehouse(数据仓库)data warehouses(数据仓库)data warehousing(数据仓库)database management systems(数据库管理系统)database management(数据库管理)date interlinking 日期互联date linking 日期链接Decision analysis(决策分析)decision maker 决策者decision making (决策)decision models 决策模型decision models 决策模型decision rule 决策规则decision support system 决策支持系统decision support systems (决策支持系统) decision tree(决策树)decission tree 决策树deep belief network(深度信念网络)deep learning(深度学习)defult reasoning 默认推理density estimation(密度估计)design methodology 设计方法论dimension reduction(降维) dimensionality reduction(降维)directed graph(有向图)disaster management 灾害管理disastrous event(灾难性事件)discovery(知识发现)dissimilarity (相异性)distributed databases 分布式数据库distributed databases(分布式数据库) distributed query 分布式查询document clustering (文档聚类)domain experts 领域专家domain knowledge 领域知识domain specific language 领域专用语言dynamic databases(动态数据库)dynamic logic 动态逻辑dynamic network(动态网络)dynamic system(动态系统)earth mover's distance(EMD 距离) education 教育efficient algorithm(有效算法)electric commerce 电子商务electronic health records(电子健康档案) entity disambiguation 实体消歧entity recognition 实体识别entity recognition(实体识别)entity resolution 实体解析event detection 事件检测event detection(事件检测)event extraction 事件抽取event identificaton 事件识别exhaustive indexing 完整索引expert system 专家系统expert systems(专家系统)explanation based learning 解释学习factor graph(因子图)feature extraction 特征提取feature extraction(特征提取)feature extraction(特征提取)feature selection (特征选择)feature selection 特征选择feature selection(特征选择)feature space 特征空间first order logic 一阶逻辑formal logic 形式逻辑formal meaning prepresentation 形式意义表示formal semantics 形式语义formal specification 形式描述frame based system 框为本的系统frequent itemsets(频繁项目集)frequent pattern(频繁模式)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy data mining(模糊数据挖掘)fuzzy logic 模糊逻辑fuzzy set theory(模糊集合论)fuzzy set(模糊集)fuzzy sets 模糊集合fuzzy systems 模糊系统gaussian processes(高斯过程)gene expression data 基因表达数据gene expression(基因表达)generative model(生成模型)generative model(生成模型)genetic algorithm 遗传算法genome wide association study(全基因组关联分析) graph classification(图分类)graph classification(图分类)graph clustering(图聚类)graph data(图数据)graph data(图形数据)graph database 图数据库graph database(图数据库)graph mining(图挖掘)graph mining(图挖掘)graph partitioning 图划分graph query 图查询graph structure(图结构)graph theory(图论)graph theory(图论)graph theory(图论)graph theroy 图论graph visualization(图形可视化)graphical user interface 图形用户界面graphical user interfaces(图形用户界面)health care 卫生保健health care(卫生保健)heterogeneous data source 异构数据源heterogeneous data(异构数据)heterogeneous database 异构数据库heterogeneous information network(异构信息网络) heterogeneous network(异构网络)heterogenous ontology 异构本体heuristic rule 启发式规则hidden markov model(隐马尔可夫模型)hidden markov model(隐马尔可夫模型)hidden markov models(隐马尔可夫模型) hierarchical clustering (层次聚类) homogeneous network(同构网络)human centered computing 人机交互技术human computer interaction 人机交互human interaction 人机交互human robot interaction 人机交互image classification(图像分类)image clustering (图像聚类)image mining( 图像挖掘)image reconstruction(图像重建)image retrieval (图像检索)image segmentation(图像分割)inconsistent ontology 本体不一致incremental learning(增量学习)inductive learning (归纳学习)inference mechanisms 推理机制inference mechanisms(推理机制)inference rule 推理规则information cascades(信息追随)information diffusion(信息扩散)information extraction 信息提取information filtering(信息过滤)information filtering(信息过滤)information integration(信息集成)information network analysis(信息网络分析) information network mining(信息网络挖掘) information network(信息网络)information processing 信息处理information processing 信息处理information resource management (信息资源管理) information retrieval models(信息检索模型) information retrieval 信息检索information retrieval(信息检索)information retrieval(信息检索)information science 情报科学information sources 信息源information system( 信息系统)information system(信息系统)information technology(信息技术)information visualization(信息可视化)instance matching 实例匹配intelligent assistant 智能辅助intelligent systems 智能系统interaction network(交互网络)interactive visualization(交互式可视化)kernel function(核函数)kernel operator (核算子)keyword search(关键字检索)knowledege reuse 知识再利用knowledgeknowledgeknowledge acquisitionknowledge base 知识库knowledge based system 知识系统knowledge building 知识建构knowledge capture 知识获取knowledge construction 知识建构knowledge discovery(知识发现)knowledge extraction 知识提取knowledge fusion 知识融合knowledge integrationknowledge management systems 知识管理系统knowledge management 知识管理knowledge management(知识管理)knowledge model 知识模型knowledge reasoningknowledge representationknowledge representation(知识表达) knowledge sharing 知识共享knowledge storageknowledge technology 知识技术knowledge verification 知识验证language model(语言模型)language modeling approach(语言模型方法) large graph(大图)large graph(大图)learning(无监督学习)life science 生命科学linear programming(线性规划)link analysis (链接分析)link prediction(链接预测)link prediction(链接预测)link prediction(链接预测)linked data(关联数据)location based service(基于位置的服务) loclation based services(基于位置的服务) logic programming 逻辑编程logical implication 逻辑蕴涵logistic regression(logistic 回归)machine learning 机器学习machine translation(机器翻译)management system(管理系统)management( 知识管理)manifold learning(流形学习)markov chains 马尔可夫链markov processes(马尔可夫过程)matching function 匹配函数matrix decomposition(矩阵分解)matrix decomposition(矩阵分解)maximum likelihood estimation(最大似然估计)medical research(医学研究)mixture of gaussians(混合高斯模型)mobile computing(移动计算)multi agnet systems 多智能体系统multiagent systems 多智能体系统multimedia 多媒体natural language processing 自然语言处理natural language processing(自然语言处理) nearest neighbor (近邻)network analysis( 网络分析)network analysis(网络分析)network analysis(网络分析)network formation(组网)network structure(网络结构)network theory(网络理论)network topology(网络拓扑)network visualization(网络可视化)neural network(神经网络)neural networks (神经网络)neural networks(神经网络)nonlinear dynamics(非线性动力学)nonmonotonic reasoning 非单调推理nonnegative matrix factorization (非负矩阵分解) nonnegative matrix factorization(非负矩阵分解) object detection(目标检测)object oriented 面向对象object recognition(目标识别)object recognition(目标识别)online community(网络社区)online social network(在线社交网络)online social networks(在线社交网络)ontology alignment 本体映射ontology development 本体开发ontology engineering 本体工程ontology evolution 本体演化ontology extraction 本体抽取ontology interoperablity 互用性本体ontology language 本体语言ontology mapping 本体映射ontology matching 本体匹配ontology versioning 本体版本ontology 本体论open government data 政府公开数据opinion analysis(舆情分析)opinion mining(意见挖掘)opinion mining(意见挖掘)outlier detection(孤立点检测)parallel processing(并行处理)patient care(病人医疗护理)pattern classification(模式分类)pattern matching(模式匹配)pattern mining(模式挖掘)pattern recognition 模式识别pattern recognition(模式识别)pattern recognition(模式识别)personal data(个人数据)prediction algorithms(预测算法)predictive model 预测模型predictive models(预测模型)privacy preservation(隐私保护)probabilistic logic(概率逻辑)probabilistic logic(概率逻辑)probabilistic model(概率模型)probabilistic model(概率模型)probability distribution(概率分布)probability distribution(概率分布)project management(项目管理)pruning technique(修剪技术)quality management 质量管理query expansion(查询扩展)query language 查询语言query language(查询语言)query processing(查询处理)query rewrite 查询重写question answering system 问答系统random forest(随机森林)random graph(随机图)random processes(随机过程)random walk(随机游走)range query(范围查询)RDF database 资源描述框架数据库RDF query 资源描述框架查询RDF repository 资源描述框架存储库RDF storge 资源描述框架存储real time(实时)recommender system(推荐系统)recommender system(推荐系统)recommender systems 推荐系统recommender systems(推荐系统)record linkage 记录链接recurrent neural network(递归神经网络) regression(回归)reinforcement learning 强化学习reinforcement learning(强化学习)relation extraction 关系抽取relational database 关系数据库relational learning 关系学习relevance feedback (相关反馈)resource description framework 资源描述框架restricted boltzmann machines(受限玻尔兹曼机) retrieval models(检索模型)rough set theroy 粗糙集理论rough set 粗糙集rule based system 基于规则系统rule based 基于规则rule induction (规则归纳)rule learning (规则学习)rule learning 规则学习schema mapping 模式映射schema matching 模式匹配scientific domain 科学域search problems(搜索问题)semantic (web) technology 语义技术semantic analysis 语义分析semantic annotation 语义标注semantic computing 语义计算semantic integration 语义集成semantic interpretation 语义解释semantic model 语义模型semantic network 语义网络semantic relatedness 语义相关性semantic relation learning 语义关系学习semantic search 语义检索semantic similarity 语义相似度semantic similarity(语义相似度)semantic web rule language 语义网规则语言semantic web 语义网semantic web(语义网)semantic workflow 语义工作流semi supervised learning(半监督学习)sensor data(传感器数据)sensor networks(传感器网络)sentiment analysis(情感分析)sentiment analysis(情感分析)sequential pattern(序列模式)service oriented architecture 面向服务的体系结构shortest path(最短路径)similar kernel function(相似核函数)similarity measure(相似性度量)similarity relationship (相似关系)similarity search(相似搜索)similarity(相似性)situation aware 情境感知social behavior(社交行为)social influence(社会影响)social interaction(社交互动)social interaction(社交互动)social learning(社会学习)social life networks(社交生活网络)social machine 社交机器social media(社交媒体)social media(社交媒体)social media(社交媒体)social network analysis 社会网络分析social network analysis(社交网络分析)social network(社交网络)social network(社交网络)social science(社会科学)social tagging system(社交标签系统)social tagging(社交标签)social web(社交网页)sparse coding(稀疏编码)sparse matrices(稀疏矩阵)sparse representation(稀疏表示)spatial database(空间数据库)spatial reasoning 空间推理statistical analysis(统计分析)statistical model 统计模型string matching(串匹配)structural risk minimization (结构风险最小化) structured data 结构化数据subgraph matching 子图匹配subspace clustering(子空间聚类)supervised learning( 有support vector machine 支持向量机support vector machines(支持向量机)system dynamics(系统动力学)tag recommendation(标签推荐)taxonmy induction 感应规范temporal logic 时态逻辑temporal reasoning 时序推理text analysis(文本分析)text anaylsis 文本分析text classification (文本分类)text data(文本数据)text mining technique(文本挖掘技术)text mining 文本挖掘text mining(文本挖掘)text summarization(文本摘要)thesaurus alignment 同义对齐time frequency analysis(时频分析)time series analysis( 时time series data(时间序列数据)time series data(时间序列数据)time series(时间序列)topic model(主题模型)topic modeling(主题模型)transfer learning 迁移学习triple store 三元组存储uncertainty reasoning 不精确推理undirected graph(无向图)unified modeling language 统一建模语言unsupervisedupper bound(上界)user behavior(用户行为)user generated content(用户生成内容)utility mining(效用挖掘)visual analytics(可视化分析)visual content(视觉内容)visual representation(视觉表征)visualisation(可视化)visualization technique(可视化技术) visualization tool(可视化工具)web 2.0(网络2.0)web forum(web 论坛)web mining(网络挖掘)web of data 数据网web ontology lanuage 网络本体语言web pages(web 页面)web resource 网络资源web science 万维科学web search (网络检索)web usage mining(web 使用挖掘)wireless networks 无线网络world knowledge 世界知识world wide web 万维网world wide web(万维网)xml database 可扩展标志语言数据库附录 2 Data Mining 知识图谱(共包含二级节点15 个,三级节点93 个)间序列分析)监督学习)领域 二级分类 三级分类。

资料翻译_Heterogeneous (1)

资料翻译_Heterogeneous (1)

HeterogeneousComputer networks typically are heterogeneous. For example, the internal network of a small software company might be made up of multiple computing platforms. There might be a mainframe that handles transactional database access for order entry, UNIX workstations that supply hardware simulation environments and a software development backbone, personal computers that run Windows and provide desktop office automation tools, and other specialized systems such as network computers, telephony systems, routers, and measurement equipment. Small sections of a given network may be homogeneous, but the larger a network is, the more varied and diverse its composition is likely to be.There are several reasons for this heterogeneity. One obvious reason is that technology changes over time. Because networks tend to evolve rather than being built all at once, the best technologies from different time periods end up coexisting on the network. In this context, "best" may refer to qualities such as the lowest cost, the highest performance, the least expensive mass storage, the most transactions per minute, the tightest security, the flashiest graphics, or other qualities deemed important at the time of purchase. Another reason for network heterogeneity is that one size does not fit all. Any given combination of computer, operating system, and networking platform will work best for only a subset of the computing activities performed within a network. Still another reason is that diversity within a network can make it more resilient because any problems in a given machine type, operating system, or application are unlikely to affect other networked systems running different operating systems and applications.The factors that lead to heterogeneous computer networks are largely inevitable; thus, developers of practical distributed systems, whether they like it or not, must cope with heterogeneity. Whereas developing software for any distributed system is difficult, developing software for a heterogeneous distributed system sometimes borders on the impossible. Such software must deal with all the problems normally encountered in distributed systems programming, such as the failure of some of the systems in the network, partitioning of the network, problems associated with resource contention and sharing, and security-related risks. If you add heterogeneity to the picture, some of these problems become more acute, and new ones crop up.For example, problems you encounter while porting a networked application for use on a new platform in the network may result in two or more versions of the same application. If you make any changes to any version of the application, you must go back and modify all the other versions appropriately and then test them individually and in their various combinations to make sure they all work properly. The degree of difficulty presented by this situation increases dramatically as the number of different platforms in the network rises.Keep in mind that heterogeneity in this context does not refer only to computing hardware and operating systems. Writing a robust distributed application from top to bottom!afor example, from a custom graphicl user interface all the way down to the network transports and protocols!ais tremedously difficult for almost any real-world application because of the overwhelming complexity and the number of details involved.As a result, developers of distributed applications tend to make heavy use of tools and libraries. This means that distributed applications are themselves heterogeneous, often glued together from a number of layered applications and libraries. Unfortunately, in many cases, as the distributed system grows, the chance decreases dramatically that all the applications and libraries that compose it were actually designed to work together.At a very general level, you can tackle the problem of developing applications for heterogeneous distributed systems by following two key rules.Find platform-independent models and abstractions that you can apply to help solve a wide variety of problems.Hide as much low-level complexity as possible without sacrificing too much performance. These rules are general enough to be used to develop any portable application whether or not it is distributed. However, the additional complexities introduced by distribution make each rule carry more weight. Using the right abstractions and models can essentially provide a new homogeneous application development layer over the top of all the distributed heterogeneous complexity. This layer hides low-level details and allows application developers to solve their immediate problems without having to first solve the low-level networking details for all the diverse computing platforms used by their applications.异构计算机网络是典型的异构体系。

数据库系统英文文献

数据库系统英文文献

Database Systems1. Fundamental Concepts of DatabaseDatabase and database technology are having a major impact on the growing use of computers. It is fair to say that database will play a critical role in almost all areas where computers are used, including business, engineering, medicine, law, education, and library science, to name a few. The word "database" is in such common use that we must begin by defining what a database is. Our initial definition is quit general.A database is a collection of related data. By data, we mean known facts that can be recorded and that have implicit meaning. For example, consider the names, telephone numbers, and addresses of all the people you know. Y ou may have recorded this data in an indexed address book, or you may have stored it on a diskette using a personal computer and software such as DBASE III or Lotus 1-2-3. This is a collection of related data with an implic it meaning and hence is a database.The above definition of database is quite general; for example, we may consider the collection of words that make up thispage of text to be related data and hence a database. However, the common use of the term database is usually more restricted.A database has the following implicit properties:.A database is a logically coherent collection of data with some inherent meaning. A random assortment of data cannot bereferred to as a database..A database is designed, built, and populated with data for a specific purpose. It has an intended group of users and somepreconceived applications in which these users are interested..A database represents some aspect of the real world, sometimes called the mini world. Changes to the mini world are reflected in the database.In other words, a database has some source from which data are derived, some degree of interaction with events in the real world, and an audience that is actively interested in the contents of the database.A database can be of any size and of varying complexity. For example, the list of names and addresses referred to earlier may have only a couple of hundred records in it, each with asimple structure. On the other hand, the card catalog of a large library may contain half a million cards stored under different categories-by primary author’s last name, by subject, by book title, and the like-with each category organized in alphabetic order. A database of even greater size and complexity may be that maintained by the Internal Revenue Service to keep track of the tax forms filed by taxpayers of the United States. If we assume that there are 100million taxpayers and each taxpayer files an average of five forms with approximately 200 characters of information per form, we would get a database of 100*(106)*200*5 characters(bytes) of information. Assuming the IRS keeps the past three returns for each taxpayer in addition to the current return, we would get a database of 4*(1011) bytes. This huge amount of information must somehow be organized and managed so that users can search for, retrieve, and update the data as needed.A database may be generated and maintained manually or by machine. Of course, in this we are mainly interested in computerized database. The library card catalog is an example of a database that may be manually created and maintained. A computerized database may be created and maintained either by a group of application programs written specifically for that task or by a database management system.A data base management system (DBMS) is a collection of programs that enables users to create and maintain a database. The DBMS is hence a general-purpose software system that facilitates the processes of defining, constructing, and manipulating databases for various applications. Defining a database involves specifying the types of data to be stored in the database, along with a detailed description of each type of data. Constructing the database is the process of storing the data itself on some storage medium that is controlled by the DBMS. Manipulating a database includes such functions as querying the database to retrieve specific data, updating the database to reflect changes in the mini world, and generating reports from the data.Note that it is not necessary to use general-purpose DBMS software for implementing a computerized database. We could write our own set of programs to create and maintain the database, in effect creating our own special-purpose DBMS software. In either case-whether we use a general-purpose DBMS or not-we usually have a considerable amount of software to manipulate the database in addition to the database itself. The database and software are together called a database system.2. Data ModelsOne of the fundamental characteristics of the database approach is that it provides some level of data abstraction by hiding details of data storage that are not needed by most database users. A data model is the main tool for providing this abstraction. A data is a set of concepts that can beused to describe the structure of a database. By structure of a database, we mean the data types, relationships, and constraints that should hold on the data. Most data models also include a set of operations for specifying retrievals and updates on the database.Categories of Data ModelsMany data models have been proposed. We can categorize data models based on the types of concepts they provide to describe the database structure. High-level or conceptual data models provide concepts that are close to the way many users perceive data, whereas low-level or physical data models provide concepts that describe the details of how data is stored in the computer. Concepts provided by low-level data models are generally meant for computer specialists, not for typical end users. Between these two extremes is a class of implementation data models, which provide concepts that may be understood by end users but that are not too far removed from the way data is organized within the computer. Implementation data models hide some details of data storage but can be implemented on a computer system in a direct way.High-level data models use concepts such as entities, attributes, and relationships. An entity is an object that is represented in the database. An attribute is a property that describes some aspect of an object. Relationships among objects are easily represented in high-level data models, which are sometimes called object-based models because they mainly describe objects and their interrelationships.Implementation data models are the ones used most frequently in current commerc ial DBMSs and include the three most widely used data models-relational, network, and hierarchical. They represent data using record structures and hence are sometimes called record-based data modes.Physical data models describe how data is stored in the computer by representing information such as record formats, record orderings, and access paths. An access path is a structure that makes the search for particular database records much faster.3. Classification of Database Management SystemsThe main criterion used to classify DBMSs is the data model on which the DBMS is based. The data models used most often in current commercial DBMSs are the relational, network, and hierarchical models. Some recent DBMSs are based on conceptual or object-oriented models. We will categorize DBMSs as relational, hierarchical, and others.Another criterion used to classify DBMSs is the number of users supported by the DBMS. Single-user systems support only one user at a time and are mostly used with personal computer. Multiuser systems include the majority of DBMSs and support many users concurrently.A third criterion is the number of sites over which the database is distributed. Most DBMSs are centralized, meaning that their data is stored at a single computer site. A centralized DBMS can support multiple users, but the DBMS and database themselves reside totally at a single computer site. A distributed DBMS (DDBMS) can have the actual database and DBMS software distributed over many sites connected by a computer network. Homogeneous DDBMSs use the same DBMS software at multiple sites. A recent trend is to develop software to access several autonomous preexisting database stored under heterogeneous DBMSs. This leads to a federated DBMS (or multidatabase system),, where the participating DBMSs are loosely coupled and have a degree of local autonomy.We can also classify a DBMS on the basis of the types of access paty options available for storing files. One well-known family of DBMSs is based on inverted file structures. Finally, a DBMS can be general purpose of special purpose. When performance is a prime consideration, a special-purpose DBMS can be designed and built for a specific application and cannot be used for other applications, Many airline reservations and telephone directory systems are special-purpose DBMSs.Let us briefly discuss the main criterion for classifying DBMSs: the data mode. The relational data model represents a database as a collection of tables, which look like files. Mos t relational databases have high-level query languages and support a limited form of user views.The network model represents data as record types and also represents a limited type of 1:N relationship, called a set type. The network model, also known as the CODASYL DBTG model, has an associated record-at-a-time language that must be embedded in a host programming language.The hierarchical model represents data as hierarchical tree structures. Each hierarchy represents a number of related records. There is no standard language for the hierarchical model, although most hierarchical DBMSs have record-at-a-time languages.4. Client-Server ArchitectureMany varieties of modern software use a client-server architecture, in which requests by one process (the client) are sent to another process (the server) for execution. Database systems are no exception. In the simplest client/server architecture, the entire DBMS is a server, except for the query interfaces that interact with the user and send queries or other commands across to the server. For example, relational systems generally use the SQL language for representing requests from the client to the server. The database server then sends the answer, in the form of a table or relation, back to the client. The relationship between client and server can get more work in theclient, since the server will e a bottleneck if there are many simultaneous database users.。

heterogeneous data orchestration conceptual model

heterogeneous data orchestration conceptual model

heterogeneous data orchestrationconceptual modelHeterogeneous data orchestration conceptual model refers to a framework or approach for managing and整合 diverse and disparate data sources in a coordinated manner. This model involves several key components and concepts.One of the main components of the heterogeneous data orchestration conceptual model is the identification and understanding of various data sources. These sources can include structured data from relational databases, unstructured data from documents and files, semi-structured data from XML or JSON, and data from external sources such as the Internet of Things (IoT) devices or social media.Another important concept in this model is data integration and transformation. Since different data sources may have different formats, structures, or schemas, the model includes processes and techniques for整合 and transforming the data into a common or unified format. This can involve tasks such as data cleaning, normalization, and mapping to ensure compatibility and consistency across the data.The model also emphasizes the importance of metadata management. Metadata provides information about the data sources, their characteristics, and the relationships between them. Effective metadata management有助于更好地理解和管理数据,包括数据分类、标注和搜索等方面。

Data Mining - Concepts and Techniques CH01

Data Mining - Concepts and Techniques CH01

We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing Miing interesting knowledge (rules, regularities, patterns,
3
Chapter 1. Introduction
Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Are all the patterns interesting? Classification of data mining systems Major issues in data mining
Other Applications
Text mining (news group, email, documents) and Web mining
Stream data mining
DNA and bio-data analysis
September 14, 2019
Data Mining: Concepts and Techniques
multidimensional summary reports
statistical summary information (data central tendency and variation)

Combining three multi-agent based generalisation models AGENT CartACom and GAEL

Combining three multi-agent based generalisation models AGENT CartACom and GAEL

Combining Three Multi-agent Based Generalisation Models: AGENT, C ART AC OMand GAELCécile Duchêne, Julien GaffuriIGN, COGIT Laboratory, 2-4 avenue Pasteur, 94165 Saint-Mandé cedex, France.email: {cecile.duchene,julien.gaffuri}@ign.frAbstractThis paper is concerned with the automated generalisation of vector geo-graphic databases. It studies the possible synergies between three existing, complementary models of generalisation, all based on the multi-agent paradigm. These models are respectively well adapted for the generalisa-tion of urban spaces (AGENT model), rural spaces (C ART AC OM model) and background themes (GAEL model). In these models, the geographic objects are modelled as agents that apply generalisation algorithms to themselves, guided by cartographic constraints to satisfy. The differences between them particularly lie in their constraint modelling and their agent coordination model. Three complementary ways of combining these mod-els are proposed: separate use on separate zones, “interlaced” sequential use on the same zone, and shared use of data internal to the models. The last one is further investigated and a partial re-engineering of the models is proposed.Keywords: Automated generalisation, Multi-agent-systems, Generalisa-tion models, Models combination.278 C. Duchêne and J. Gaffuri1. IntroductionIn this paper, we deal with automated cartographic generalisation of topog-raphic vector databases. Cartographic generalisation aims at decreasing the level of detail of a vector database in order to make it suitable for a given display scale and a given set of symbols, while preserving the main charac-teristics of the data. It is often referred to as the derivation of a Digital Car-tographic Model (DCM) from a Digital Landscape Model (DLM) (Meyer 1986). In the DCM, the objects have to satisfy a set of constraints that rep-resent the specifications of the expected cartographic product (Beard 1991; Weibel and Dutton 1998). A constraint can be related to one object (build-ing minimum size, global shape preservation), several objects (minimum distance, spatial distribution preservation), or a part of object (road coales-cence, local shape preservation). Different approaches to automate gener-alisation handle the constraints expression in different ways. For instance, in approaches based on optimisation techniques (Sester 2000; Højholt 2000; Bader 2001), the constraints are translated into equations on the point coordinates.The work presented in this paper relies on an approach of generalisation that is step by step, local (Brassel and Weibel 1988; McMaster and Shea 1988), and explicitly constraint driven (Beard 1991). More precisely, our work is concerned with three complementary models based on this ap-proach, which also rely on the multi-agent paradigm. These three models are respectively dedicated to the generalisation of dense, well-structured data (AGENT model), low density, heterogeneous zones (C ART AC OM model), and to the management of background themes during generalisa-tion (GAEL model). The purpose of this paper is to investigate the possi-ble synergies between the three models.The next section of the paper presents in a comparative way the major aspects of the AGENT, C ART AC OM and GAEL models. In section 3, three complementary scenarios for a combined use of these models are pro-posed, and the underlying technical requirements are identified. One of them is further investigated in section 4, where a partial re-engineering of the models is proposed. Finally, section 5 concludes and draws some per-spectives for on-going work.Combining Three Multi-agent Based Generalisation Models 2792. Comparative presentation of AGENT, C ART AC OM and GAEL2.1. The AGENT modelThe AGENT generalisation model has first been proposed by Ruas (1998,Two levels of agents are considered. A micro agent is a single geographic object (e.g. road segment, building). A meso agent is composed of micro or meso agents that need to be considered together for generalisation (e.g. a group of aligned buildings, a urban block). This results in a pyramidal hierarchical structure where agents of one level are disjoints. Cartographic constraints can be defined for each agent (Figure 1). If a cartographic con-straint concerns several agents it is translated into a constraint on the meso agent they are part of, thus a constraint is always internal to an agent.Fig. 1. The AGENT model: agents and constraintsThe constraints are modelled as objects. A constraint object can be thought of as an entity, part of the “brain” of an agent, in charge of managing one of its cartographic constraints. In terms of data schema (cf Figure 2a), a generic Constraint class is defined, linked to the generic Agent class. The attributes defined on the Constraint class are as follows:• current_value : result of a measure of the constrained property (e.g. area, for the building size constraint). It is computed by the compute_current_value method,• goal_value : what the current value should be,一个constraint 对象可以看成是agent 的大脑的一部分280 C. Duchêne and J. Gaffuri• satisfaction : how satisfied the constraint is, i.e. how close the current value is from the goal value. It is computed by the compute_satisfaction method,• importance : how important it is according the specifications that this constraint is satisfied, on an absolute scale shared by all the constraints, • priority : how urgent it is for the agent to try and satisfy this constraint, compared to its other constraints. It is computed by the compute_priority method depending on the satisfactionTwo additional methods are defined:• compute_proposals : computes a list a possible plans (generalisation algorithms) that might help to better satisfy the constraint, and• re -evaluate : after a transformation assesses if the constraint has changed in a right way (if it has been enough improved, or at least if it has not been too much damaged) ConstraintAgent 1*BuildingSizeConstraint BuildingGranuConstraint RoadCoalescenceConstraint Constraint compute_current_value()compute_satisfaction()priority : integer compute_priority()compute_proposals()reevaluate()current_value : real goal_value : real satisfaction : integer importance : integer *MesoAgent compute_activation_order()BuildingAgent RoadAgent MesoAgent MicroAgent MicroAgent (a) The generic agent and constraint classes (b) Specialisation of the agent and constraint classes Agent1status : integerhappiness : reallife-cycle()trigger_plan()compute_happiness()choose_plan()trigger_agent()give_order()change_goal_value()Fig. 2. AGENT static model : data schemaThe generic Constraint class is specialised into several specific con-straints classes, one for every kind of cartographic constraint (cf Figure 2b). One agent is linked to one constraint object of every specific con-straint class that is relevant to its geographic nature (e.g. for a building, BuildingSizeConstraint, BuildingShapeConstraint, etc.).When a geographic agent is activated, it performs a life-cycle where it successively chooses one plan among those proposed by its constraints, tries it, validates its new state according to the constraints re-evaluation, and so on. The interactions between agents are hierarchical: a meso agent triggers its components, gives them orders or changes the goal values of their constraints (Ruas 2000).The AGENT model has been successfully applied to the generalisation of hierarchically structured data like topographical urban data (Lecordix et al 2007) and categorical land use data (Galanda 2003).Combining Three Multi-agent Based Generalisation Models 281 2.2. The C ART AC OM modelThe C ART AC OM model has been proposed by Duchêne (2004) to go be-yond the identified limits of the AGENT pyramidal model. It is intendedfor data where no obvious pyramidal organisation of the space is present, like topographical data of rural areas. In this kind of situation, it is difficult to identify pertinent disjoint groups of objects that should be generalisedtogether, and constraints shared by two objects are difficult to express as an internal constraint of a meso object.In C ART AC OM, only the micro level of agents is considered, and agentshave direct transversal interactions between each other. C ART AC OM fo-cuses on the management of constraints shared by two micro agents, thatwe call relational constraints. Examples of relational constraints are, for a building and a road, constraints of non overlapping, relative position, rela-tive orientation.The object representation of the constraints proposed in the AGENT model has been adapted to the relational constraints, which are shared by two agents instead of being internal to a single agent (Fig. 3). Two classesinstead of one are used to represent the constraints: Relation and Con-straint. The representation of a relational constraint is split into two parts:•the first part is relative to the objective description of the state of the constrained relation, which is identical from the point of view of bothagents and can thus be shared by them. This description is supported by a Relation object linked to both agents,•the second part is relative to the analysis and management of the constraint, which is different for each agent and should thus be separately described for each of them. This part is described by two Constraint objects: one for each agent sharing the relational constraint.Fig. 3. C ART AC OM static model: agents and constraints282 C. Duchêne and J. GaffuriIn order to improve the state of a relational constraint, in C ART AC OM an agent can use two kinds of “plans”: either apply to itself a generalisation algorithm, like in AGENT, or ask the other agent sharing the constraint to apply an algorithm to itself.When activated, an agent performs a life-cycle similar to the AGENT life-cycle. If AGENT internal constraints have been defined on the agent on top of its C ART AC OM relational constraints, the agent can perform its internal generalisation through a call to the AGENT life-cycle, which is then seen as a black box. In the case where the agent asks another agent to perform an action, it ends its life-cycle with a “waiting” status, and re-sumes action at the same point when it is next activated. The agents are ac-tivated in turn by a scheduler. Sending a message to another agent places it on the top of the scheduler’s stack, i.e. the agents trigger each others by sending messages.The C ART AC OM model has been successfully applied to low density, rural zones of topographical data, where the density is such that few con-textual elimination is necessary (Duchêne 2004).2.3. The GAEL modelThe GAEL model has been proposed by Gaffuri (2007). Its is intended for the management of the background themes like relief or land use, during an agent generalisation of “foreground” topographic themes by means of the AGENT or C ART AC OM model. The background themes differ from the foreground themes in that they are continuous (defined everywhere in the space) instead of being discrete and, from a generalisation point of view, they are more flexible than the foreground themes (thus they can ab-sorb most of the transformations of the foreground themes). Two types of cartographic constraints are considered in the GAEL model: constraints of shape preservation internal to a field theme, and constraints that aim to preserve a relation between a foreground object and a part of a background field (object-field constraint). An example of an object-field constraint is, for a river and the relief, the fact that the river has to remain in its drainage channel.In the GAEL model, a field theme is decomposed into subparts by means of a constrained Delaunay triangulation, like in (Højholt 2000). The field’s shape preservation constraints are expressed as constraints on sub-parts of the triangulations called sub-micro objects: segments, triangles, points (Figure 4a). The object-field constraints are expressed as relational constraints between a field agent and a micro agent of the AGENT or C ART AC OM model (Figure 4b, not represented in the class diagram ofCombining Three Multi-agent Based Generalisation Models 283 Figure 4a), and translated into constraints on sub-micro objects. The points that compose the triangulation are modelled as agents. The sub-micro ob-jects are thus groups of point agents. Each internal or object-field con-straint that concerns a sub-micro object is translated into forces on the point agents that compose it.Fig. 4. GAEL static model : sub-micro level, point agents, sub-micro and object-field constraintsWhen a point agent is activated, it computes and applies to itself a small displacement in the direction that would enable it to reach a balance be-tween the forces resulting from its constraints. Interactions between agents can be hierarchical or transversal. Field agents can trigger their point agents (hierarchical interaction), and point agents can directly trigger their neighbours (transversal interactions). This results in a progressive defor-mation of the field in answer to the deformations of the foreground themes. The GAEL model has been successfully applied (Gaffuri 2007) to the preservation of relations between buildings and relief (elevation) and hy-drographic network and relief (overland flow).2.4. Areas of applications of AGENT, C ART AC OM and GAEL: schematic summaryFigure 5 summarizes the main characteristics of the AGENT, C ART AC OM and GAEL models. AGENT is based on hierarchical interactions between agents that represent single geographic objects or groups of objects. The considered constraints are described as internal to a single agent and man-aged by this agent. This model is best suited for generalising dense areas where density and non-overlapping constraints are prevalent and strong contextual elimination is required. C ART AC OM is based on transversal in-teractions between agents that represent single geographic objects. The considered constraints are described as shared by two agents and managed by both concerned agents. This model is best suited for generalising low284 C. Duchêne and J. Gaffuridensity areas where more subtile relational constraints like relative orienta-tion are manageable. GAEL is based on transversal interactions between agents that represent points of geographic objects connected by a triangu-lation, and hierarchical interactions between these agents and agents that represent field geographical objects. The considered constraints are de-scribed either as shared by a field agent and a micro agent, or as internal to groups of connected point agents, and handled by these point agents. This model is best suited for the management of side-effects of generalisation on the background themes.Fig. 5. AGENT, C ART AC OM and GAEL model: target areas of application and levels at which constraints are described (GAEL object_field constraints are not represented)The three models are best suited for different kinds of situations that are all present on any complete topographic map. Thus they will have to be used together in a complete generalisation process. In the next section, scenarios are proposed for the combined use of the three models.3 Proposed scenarios to combine AGENT, C ART AC OM and GAELIn the subsections 3.1, 3.2 and 3.3, three complementary scenarios for the combined use of the models are studied, in which the synergy takes place at different levels. For each scenario, the underlying technical and research issues are identified.Combining Three Multi-agent Based Generalisation Models 2853.1. Scenario 1: separate use of AGENT, GAEL andC ART AC OM on a spatially and/or thematically partitioned datasetThis first scenario concerns the generalisation of a complete topographical dataset. Such a dataset contains both foreground and background themes (everywhere), and both rural and urban zones. In this scenario, we propose to split the space as shown in figure 5, both spatially and thematically, in order to apply each of the three models where it is a priori best suited: •urban foreground partitions are generalised using AGENT,•rural foreground partitions are generalised using C ART AC OM, •background partitions follow using GAEL.Let us notice that this scenario does not cover the complete generalisa-tion process but only a part of it. It is intended to be included in a larger generalisation process or Global Master Plan (Ruas and Plazanet 1996) that also includes steps for overall network pruning, road displacement us-ing e.g. the beams model (Bader 2001), and generalisation of background themes (on top of letting them follow the foreground themes). Actually, these additional steps would also be applied on either thematically or spa-tially split portions of the space.This scenario first requires adapted methods to partition the data in a pertinent way (here into foreground and background themes, into urban and rural zones). Then, whatever the partitioning, the resulting space por-tions are not independent because strong constraints exist between objects situated on each side of the borders: continuity of roads and other networks at spatial borders, inter-theme topological relations, etc. This interdepend-ence requires mechanisms for the management of side-effects on the data, i.e. to ensure that no spatial inconsistency is created with other portions of the space when applying one model on portion of the space. It also re-quires pertinent heuristics for the orchestration of the process: when to ap-ply which model on which partition.These issues are not new: they have already been discussed by (McMas-ter and Shea 1988; Brassel and Weibel 1988; Ruas and Plazanet 1996) re-garding the design of generalisation process composed of elementary algo-rithms. We are just a step forward here, since now we consider the combination of several generalisation processes based on different models.286 C. Duchêne and J. Gaffuri3.2. Scenario 2: “interlaced” sequential use of AGENT,C ART AC OM and GAEL on a set of objectsThis second scenario concerns the generalisation of a set of objects in-cluded in a single partition of the previous scenario i.e. a portion of either urban foreground space, rural foreground space, or background space. In this scenario, we propose to enable the “interlaced sequential use” of the models, i.e. calling successively two or more of the models on the same objects, possibly several times (e.g. AGENT then C ART AC OM then AGENT again).Indeed, experiments performed with the AGENT and C ART AC OM mod-els show that the previous scenario is not sufficient. The limit between a rural space that should a priori be generalised by C ART AC OM and a urban space that should a priori be generalised by AGENT is not so clear. In some zones of medium density, C ART AC OM enables to solve most of the conflicts while tackling also more subtile constraints like relative orienta-tion, but can locally encounter over-constrained situations. In this second scenario, such locally over-constrained situations can be solved by a dy-namic call to an AGENT hierarchical resolution. Conversely, not all the constraints shared by two objects in an urban zone can easily be expressed as an internal constraint of a group (meso agent) and solved at the group level. Thus, in scenario 1, some of them are given up, e.g. the constraint of relative orientation. Scenario 2 enables punctual use of C ART AC OM inside a urban zone, which could help in solving such subtile relational con-straints for which no group treatment is available. Regarding the thematic split between foreground and background, it seems that this distinction is not so well defined either. This is why in this scenario, some objects can be considered as foreground at one time of the process and background at other times. For instance, buildings are foreground when handling there re-lational constraints with the roads thanks to a C ART AC OM activation; but they are rather background when handling the overlapping constraints be-tween roads, as their behaviour at this time should just be to follow the other feature classes in order to prevent topological inconsistencies.To summarize, in scenario 2 the geographic objects of a dataset are able to play several roles during a generalisation process: an object can be mod-elled as an AGENT, C ART AC OM and GAEL agent at the same time and be successively triggered with an AGENT, C ART AC OM or GAEL behaviour (life-cycle). To be more precise, a same object of the micro level can be modelled and triggered both as an AGENT and C ART AC OM agent, and the points that compose it modelled and triggered as GAEL agents (as the GAEL model operates at the points level).To enable this, some mechanisms are required to detect the need to dy-namically switch to another model. This means, a mechanism is needed detect that the currently used model is unable to solve the situation, and identify the pertinent set of objects that should temporarily be activated with another model. Then, some consistency preservation mechanisms are required, not from a spatial point of view (this has already been tackled in scenario 1), but regarding the data in which an agent stores its representa-tion of the world. For instance, if a C ART AC OM activation is interrupted and an AGENT activation is performed that eliminates some agents, the neighbours of the eliminated agents should be warned when the C ART AC OM activation resumes, so that they can update their “mental state”. Otherwise, they could enter in an inconsistent state, with pending conversations and relational constraints with agents that do no longer exist.3.3. Scenario 3: simultaneous use of AGENT andC ART AC OM data on one objectThis third scenario concerns the generalisation decisions taken by an agent of the micro level (single geographic object) that is both modelled as an AGENT and as a C ART AC OM agent as proposed in scenario 2. Only the AGENT and C ART AC OM models are considered here since only these models operate at a common level (micro level).An agent that is both modelled as an AGENT and as a C ART AC OM agent has knowledge both of its internal constraints and of relational con-straints shared with other agents. But so far, including in scenario 2 above, only the internal constraints are taken into account when it behaves as an AGENT agent, and only the relational constraints are taken into account when it behaves as a C ART AC OM agent (during its C ART AC OM life-cycle, it can perform internal generalisation thanks to a call to the AGENT life-cycle as explained in 2.2, but the AGENT life-cycle is then seen as a “black box”). In this third scenario, an agent is able to consider both kinds of constraints at the same time when making a generalisation decision, be it in a C ART AC OM or in an AGENT activation. This means that, when choosing the next action to try, the agent takes into account both the pro-posals made by its internal and relational constraints (with the restriction that an agent activated by AGENT does not try an action consisting in ask-ing another agent to do something). And, to validate the action it has just tried, the agent takes into account the satisfaction improvement of both its internal and relational constraints. This scenario is not intended to intro-duce more relational constraints in urban zones than in scenario 2. It just proposes that, when such constraints have been defined (like the relativeorientation constraint), they can be taken into account at the same time asthe internal constraints. Provided relational constraints are parsimoniously added, and the relative importances and the relaxation rules of the internal and relational constraints are well defined, this scenario should not resultin over-constrained situations anywhere. And it has multiple advantages: •The aim of an agent activated by C ART AC OM (e.g. a rural building) is still to satisfy both internal and relational constraints, but it can satisfyall of them by trying the actions they suggest, while remaining in its C ART AC OM life-cycle. This is less computationally heavy than calling the AGENT life-cycle as a “black box”.•The aim of an agent activated by AGENT (e.g. a urban building) is still first to satisfy as well as possible its internal constraints. But, if it has relational constraints defined, they can prevent it from applying an internal algorithm that would decrease their satisfaction too much. For example, algorithms that square the angles of a building, or that transform it into a rectangle, can easily break relations of local parallelism between the building (or one of its walls) and another building or a road. (Steiniger, 2007, p. II-C-13) proposes to prevent this by forbidding the use of these algorithms in the parts of urban space classified as “inner city”, because this problem frequently occurs in this kind of area. This scenario 3 enables to avoid this kind of problem in a more adaptive way (only when it really occurs).•An agent activated by AGENT can also try internal actions specifically in order to improve the satisfaction of one of its relational constraints (like a small rotation in order to achieve parallelism with a neighbouring road). This is far less heavy than having to stop the AGENT activation and start a C ART AC OM activation on the whole urban block containing the building.•If micro-agents activated with AGENT cannot cope with some relational constraints because of “domino effects”, another way of solving these constraints can also be that the meso agent above seeks for a global solution by analysing the relational constraints of its components (e.g., in the above case the urban block identifies the buildings that should rotate).To enable this scenario 3, it is necessary to re-engineer the parts of theAGENT and C ART AC OM static models related to constraint representation so that internal (AGENT) and relational (C ART AC OM) constraints can both be handled by an agent within the same methods. Hence, the methods of the “Agent” class that use the constraints have to be modified, both in theAGENT and in the C ART AC OM model, in order to take into account the presence of both internal and relational constraints.4. How to put the proposed scenarios into practice4.1. Technical requirements underlying scenarios 1, 2 and 3: summaryIn sections 3.1, 3.2 and 3.3 we have presented three scenarios where the AGENT, C ART AC OM and GAEL models are used with an increasing de-gree of combination: separate use on separate zones (scenario 1), “inter-laced” sequential use on the same zone (scenario 2), shared use of data in-ternal to the models (scenario 3). The three scenarios are complementary and we intend to put all the three into practice in a medium term. The iden-tified underlying issues are summarized hereafter, starting from the most external elements of the models, to the most internal:1.Define methods to split the map space into relevant partitions, onspatial and/or thematic criteria [scenario 1]2.Define a strategy to know which model to apply when on whichportion of space [scenario 1]3.Define mechanisms to manage the side-effects at borders, whengeneralising one partition with one model [scenario 1]4.Define mechanisms to dynamically identify a set of geographicalobjects that require a temporary activation of another model than the currently active one [scenario 2]5.Define mechanisms to preserve the consistency of data internal to onemodel, when another model is running [scenario 2]6.Re-engineer the representation and management of constraints inAGENT and C ART AC OM so that internal and relational constraints can be handled together [scenario 3]The current status of the issues (1) to (5) is briefly described in the next section. The issue (6) is tackled more in deep in section 4.3.4.2. Status of the technical issues underlying scenarios 1 and 2The issues underlying the scenarios 1 and 2 (issues 1 to 5) in the list above) are part of a research that is currently beginning. However, for some of them we already have some elements of answer. Regarding the。

present, and future

present, and future

1 Motivation
Since its rst introduction in 1996, XML has been steadily progressing as the \format of choice" for data that has mostly textual content. Starting from nancial data, business transactions to data obtained from satellites - most of the data in today's web-driven world are being converted to XML. The two leading web application development platforms (.NET 9] and J2EE 16]) use XML web services - a standard mechanism for communication between applications. Given its huge success in the data and application domains, it is somewhat puzzling to see so little has been done in terms of conceptual and formal design areas involving XML. Literature shows di erent areas of application of design principles that apply to XML. Since XML has been around for a while, and only recently there has been an e ort towards formalizing and conceptualizing the model behind XML, such modeling techniques are still playing \catch-up" with the XML standard. The World-Wide-Web Consortium (W3C) has started an e ort towards a formal model for XML, which is termed DOM (Document Object Model) - a graph-based formal 1

IBM Informix软件产品系列:版本11.7说明书

IBM Informix软件产品系列:版本11.7说明书

IBM SoftwareIBM Informix software product family: Version 11.7 Discover the next decade of IBM Informix2 IBM Informix software product family: Version 11.7IBM® Informix® database software has been voted number one in customer satisfaction for two consecutive years, 2008 and 2009.1 Clients choose Informix because it is reliable, low-cost and hassle-free. Solution providers choose Informix for its best-of-breed embeddability. And it’s available on Microsoft® Windows®, Linux®, UNIX® and Mac OS X rmix 11.7 ushers in the next decade for Informix. As the engine that powers online transaction processing (OL TP) and decision support applications for businesses and partners of all sizes, it will continue to be a force in helping them to deliver smarter solutions. T ake a look at Informix 11.7 and you’ll discover unprecedented advantages: advanced performance, availability and efficiency, minimal complexity and lower computing costs. With the introduction of Informix Flexible Grid, it’s easy to perform routine maintenance and upgrades at all your geographic locations from a single site—and you can do it with zero downtime.ReliableInformix offers one of the most comprehensive sets ofhigh-availability options available from databases that can run on low-cost hardware, and Informix 11.7 takes that to new heights. With Informix Flexible Grid, you can distribute workloads to maximize your hardware investments while keeping data synchronized at all times, whether the servers are in the same room or on the other side of the world. In the case of a system failure, another server automatically takes over with no impact to your business. Further, Informix Flexible Grid lets you keep data available during planned outages without the need for temporary hardware or for cloning databases or applications. You can also useheterogeneous hardware and operating systems within the same grid. With Informix Flexible Grid, you can scalecapacity when and where needed, bring data on- or off-line, and change your data schema throughout your grid—all from a single location, all on demand and all with no downtime.Highlights• Reliable: Provides one of the industry’s widest sets ofoptions for keeping data available at all times, including zero downtime for maintenance• Low-cost: Cuts development costs by nearly eliminating up-front licensing costs typically incurred during the development phase• Hassle-free: Runs almost unattended with self-configuring, self-managing and self-healing capabilities• Best-of-breed embeddability: Provides a provenembedded data management platform for ISVs and OEMs to deliver integrated, world-class solutions, enabling platform independenceSoftware 3 Low-costInformix offers unparalleled cost-saving benefits by reducingthe up-front licensing costs typically incurred during thedevelopment phase. Businesses also benefit from the ongoingsavings in operational expenses, since Informix autonomicfeatures help eliminate outages and require less performancetuning. The autonomic capabilities can handle many of themanual tasks normally performed by a hands-onadministrator with other relational database management systems (RDBMS).Informix also offers data compression technology that can produce up to 80 percent savings on disk space and I/O improvements of up to 20 percent.2 With less data volumeto process, businesses can achieve faster search times, more efficient use of their hardware resources, and reduced backup and recovery times.Hassle-freeInformix is designed from the ground up for easy, scalable administration and includes features that allow Informix to “disappear” within an application. Thanks to the manyself-monitoring and self-healing features built into Informix, such as automated storage and memory management, the database software can be customized to proactively track and react to business thresholds—and resolve them before they become an issue.Small companies love Informix because they don’t needa large technical staff to run their data processing environment. As companies grow, Informix has them covered: customers with large deployments are able to manage an average of 2,000 Informix databases with a single DBA.3 Informix enables growing businesses to scale and manage their databases no matter where they are, on demand and with no downtime.Best-of-breed embeddabilityMore than 2,500 loyal business partners have established millions of Informix deployments around the world, most of which are never seen by the actual customers. For example, Informix is in thousands of worldwide retailers, 20 of the top 25 U.S. supermarkets and 95 percent of telecom service delivery providers.3 Market innovators such as China Mobile, Verizon, Lazare Kaplan International, DHL Express andT rafficmaster rely on Informix to deliver seamless and superior service to their customers.Informix is available on multiple versions of Windows, Linux, UNIX and Mac operating systems, so OEMs, ISVs and VARs are not locked into a single platform and can pursue more market opportunities. Informix is standards-based and supports a wide variety of integrated development environments (IDEs) and application development languages. The same installer can be used on all platforms and can be easily customized and hidden within an embedded solution. The software’s small footprint and self-managing capabilities make it an ideal choice for embedded data management solutions. Plus, Informix can be up and running in minutes, not days, and requires virtually zero administration.Informix is also widely deployed in virtualized environments, with a profile and feature set that lends itself well to virtualization. Customers can reduce costs through server consolidation and can prepare for moving to a true cloud-computing environment, where additional economies of scale are available when supporting customers in a Service Oriented Architecture (SOA). Want to learn more about how companies of all sizes are taking advantage of Informix? Read these success stories: /software/success/cssdb.nsf/softwareL2VW? OpenView&Count=30&RestrictToCategory=dmmain_Informix DynamicServer4 IBM Informix software product family: Version 11.7Informix: A complete data management solutionWhile Informix is low-cost and hassle-free, it still offers a comprehensive range of data management capabilities. In addition to those already described, Informix offers many other capabilities, such as:• Warehouse and business analytics: Provides strong performance and value as a single platform to address both OL TP and analytical requirements; higher performance for all types of workloads means fewer servers—and less space, power and cooling• Application development options and tools: Enables developers to choose their favorite language or IDE; now available are new or updated drivers for PHP, Ruby and Microsoft .NET, plus other compatibility features that make it easy to port applications to Informix• OpenAdmin Tool: Provides a graphical administration tool to monitor and administer all of your Informix environments from a single console• Native support for advanced data types:Provides enhanced data process functionality, enabling businesses to create richer applications with spatial, time series and geodetic data, as well as other capabilitiesInformix EditionsInformix offers many editions, including no-cost editions and for-purchase editions, to meet a wide variety of enterprise, ISV, VAR and OEM needs.Free EditionsDeveloper Edition: This edition provides all features of Informix for individual application development, testing and proof-of-concepts. Available on a wide variety of platforms, Developer Edition has some processing and storage limitations. Community support is available.Innovator-C Edition: Designed for low-cost embeddedor workgroup computing, the Innovator-C Edition allows small-to-midsize businesses to develop and deploy widely used database functionality, including replication and some clustering capabilities. The Innovator-C Edition is available on a wide variety of platforms and is limited to four cores over a maximum of one socket and 2 GB of memory. Selective Support at Elite Level is available as an optional purchase.For-purchase EditionsChoice Edition for Windows: This edition gives businesses, ISVs and OEMs with Windows-based environments the ability to develop and deploy near enterprise-class functionality at low cost. The Choice Edition is limited to eight cores over a maximum of two sockets and 8 GB of memory.Choice Edition for Mac OS X: This edition has the same functionality as the Choice Edition for Windows, but is available on Mac OS X.Growth Edition: This edition provides powerful data management capabilities for departmental solutions or small-to-midsize businesses, including unlimited replication cluster nodes to send or receive data updates from other nodes in the cluster. This edition supports up to two high-availability cluster nodes of any type. It is limited to 16 cores over a maximum of four sockets and 16 GB of memory. Growth Edition is available on all supported platforms.Ultimate Edition: This edition includes all Informix features on all supported platforms for development, deployment and distribution with unlimited scalability. It includes full grid and replication capabilities. Storage compression is available as an optional feature.Informix supports a vast number of platforms and versionsof operating systems. A complete list can be found online at /support/docview.wss?rs=630&uid=swg27013343&S_CMP=rnavSoftware 5Developer Edition Linux/UNIX/ Windows/Mac Free for one developerlicense; no distributionCommunity Publicdownload• 1 GB memory• 1 CPU• 8 GB storage limitStorage Optimization Featurenot availableInnovator-C Edition Linux/UNIX/ Windows/Mac Free to develop anddeploy (distributionrequires commerciallicense)Optional: SelectedSupport chargedper install/yearPublicdownload• 2 GB memory• 1 socket, total4 coresFeatures unavailable: StorageOptimization Feature, SDS, parallelfeatures, partitioning, AdvancedAccess Control, Informix WarehouseFeature (SQL Warehousing T ool),multiple secondary serversPartially available: 2-node grid;1 primary, 1 secondary server (HDR),ER (2 root nodes)Choice Editionfor Windowsand Choice Edition for Mac OS X Licensing via LU Socket,AUIncludes Supportand Subscriptionin first yearPassportAdvantage• 8 GB memory• 2 sockets, total8 coresFeatures unavailable: StorageOptimization Feature, SDS, parallelfeatures, partitioning, AdvancedAccess Control, Informix WarehouseFeature (SQL Warehousing Tool),multiple secondary serversPartially available: 2-node grid;1 primary, 1 secondary server (HDR,RSS), ER (2 root nodes)Growth Edition Linux/UNIX/ Windows/Mac OS X Licensing via LU Socket,AU, CS, PVUIncludes Supportand Subscriptionin first yearPassportAdvantage• 16 GB memory• 4 sockets, total16 coresFeatures unavailable: StorageOptimization Feature, parallel features,partitioningPartially available: 3-node grid;1 primary, any2 secondary servers(HDR, RSS, SDS), unlimited ER rootand leaf nodesUltimate Edition Linux/UNIX/ Windows/Mac Licensing via AU, CS,PVUIncludes Supportand Subscriptionin first yearPassportAdvantage• Unlimited For-charge feature: StorageOptimization Feature; everythingelse includedNote: AU: Authorized User Single Install; CAF: Continuous Availability Feature; CS: Concurrent Session; ER: Enterprise Replication; HDR: high-availability data replication; LU: Limited Use; PVU: Processor Value Unit; SDS: Shared Disk SecondaryInformix Editions summary table© Copyright IBM Corporation 2010IBM Corporation Software Group Route 100Somers, NY 10589 U.S.A.Produced in the United States of America October 2010All Rights ReservedIBM, the IBM logo, and Informix are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published.Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the web at“Copyright and trademark information” at /legal/copytrade.shtml Linux is a registered trademark of Linus T orvalds in the United States, other countries or both.Microsoft, Windows, Windows NT and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries or both.UNIX is a registered trademark of The Open Group in the United States and other countries.Other company, product or service names may be trademarks or service marks of others.1 The VendorRate 2009 Y ear End Report. /discoverinformix2 “Deep Compression with IBM Informix” /ibmdl/pub/software/data/sw-library/demos/informix/CompressionDemo.swf3 E mbedding Informix: Real World Experiences. /software/data/informix/embed/experience.htmlPlease RecycleIMB14078-USEN-02For more informationT o learn more about Informix and its capabilities, pleasecontact your IBM representative or IBM Business Partner or visit the following website: /software/data/informix。

数据库毕业设计外文翻译--图像系统简介

数据库毕业设计外文翻译--图像系统简介

附录1 外文原文COLOR SYSTEM OVERVIEWIn the age of office automation and electronic imaging, office documents are being processed, transported, and displayed in a variety of ways. The scope of document processing is enormous; it encompasses page layout, document length, collation, simplex/duplex, color, image quality, finishing, and binding. If the office system is networked, then another dimension of network-related issues-protocol, file format, page description language, compression/decompression, job management, error handling, user interface, and device driver-has to be addressed. Digital color-imaging systems process electronic information from various sources; images may come from a local-area network, a remote-sensing device, different color workstations, or a local scanner. After processing, a document is usually compressed and transmitted to several places via a computer network for viewing, editing, or printing. Moreover, the trend in the industry is moving toward an open environment. This means that various devices such as scanners, computers, workstations, modems, and printers from multiple vendors are assembled into one system. Implementations should be based on public-domain technology rather than proprietary standards. This will allow vendors equal access to the market for system components and give users the widest choice in selecting components. It is a vastly large task to enable the communication of all system components regardless of differences in the operating system, file format, page description language, and information content. Ideally, the exchange should not cause information loss or alteration. A closer look at a document may reveal that it consists of different types of images, primarily text, graphs, and pictorial images. These all have different image characteristics and representations such as ASCII (American Standard Code for Information Interchange) for text, vector for graphs, and raster for pictorial images. Each type of image and its associated attributes like the font, font size, halftone, gray level, resolution, and color have to be dealt with differently. In such a complex environment, there is no doubt that many compatibility problems occur when an image is acquired, transmitted, displayed, and rendered. ?With the fast development of Internet technology, large volumes of data in the form of electronic documents from the Web. For the purposes of data integration and data exchange, more and moreexisting sources, such as relational databases, support public XML export, and increasing amount of public and private data is described in a semi-structured way. A number of issues need to be addressed when we integrate data from different sources, including heterogeneous and duplicate data, multiple divisions and partners, and changes.? Data heterogeneity results from the use of different information management systems to store data and each system has its own data structure and access methods. Relational database management systems benefit from the universal acceptance of Structured Query Language (SQL) as the primary means of getting answers whilst document and email repositories are generally accessed using text search engines with varying interfaces and capabilities. Because these systems were not designed with interoperability in mind, each must generally be accessed using source-specific applications or application programming interfaces (APIs).? Another difficulty in data integration is data duplication-different systems represent the same piece of data in different ways. For example, customers may be identified by name in one database, but by account number in a second repository, may identify the same customer by email address. Frequently a required piece of information is derived from multiple data points. Data integration is further complicated when customers do business with multiple divisions within a large company, or with other partners. Similarly, answering questions about the state of a company's supply chain requires access to vendor and distributor information sources. Doing business electronically across the firewall gives rise to security and data ownership issues. Finally, data integration has to deal with different types of changes; change in business requirements and strategies, in IT systems, mergers and acquisitions, and new product launches. This demands that a data integration solution be sufficiently flexible and adaptable.One possible solution for the data integration problems mentioned above is to provide an XML Web services break down the barriers between different computing platforms, development environments and communications networks, allowing organizations to work together electronically without the expense and delay of agreeing on semantics, schema, interfaces, and other application integration. XML provides the flexibility for handling data with differing structures. As XML is becoming the principal medium for data exchange over the Web and for information integration in general,increasing amounts of public and private data are described in XML. XML data is usually defined in a tree or graph based, self-describing object instance model (Boncz and Kersten, 1999). However, semi-structured data is incompatible with the flat structure of relational database tables, and so the growth of XML data requires new and complex query optimization techniques.Creating XML files with a text editor would be a lot easier if you didn't have to close all those HTML tags. First you have to add the XML declaration and the root opening and closing HTML tags. Next, you start adding element opening and closing tags one at a time. Of course, once you have the initial sequence completed you can just copy and paste to repeat the required elements. After doing this hundreds of times you'll be looking for a faster way to create XML files.Some XML editors will automatically add the closing tag after you have finished typing the opening tag but, you still have to type the brackets around the opening tag. I kept thinking this process should be easier. So, I came up with a solution that allows you to create XML files without using HTML tags.This console application will create an XML file based on user input. Just enter the file name, how many element fields you want, and the name of each field. Optionally, you can include a data type separated by a comma after the field name. You can just enter the field name because the data type is not required. The structure of the XML file that is created will be compatible with the .NET Dataset and can be easily added to a database.In addition to creating the XML file, an XSL file and HTML file are also created. The HTML file uses client side JavaScript to transform the XML file using the XSL file. This provides an easy way to view the new XML file by displaying it in a table layout.The download includes both the source code and the already compiled application. You can start using the executable right away or customize it to meet your needs. All you will need is the .NET Framework and a text editor, like Notepad, to build this application.Improving ASP Performance with Data CachingOne of the nicest features of is the ability to cache page content. This can be used to substantially reduce load on a website's database - which is an obvious attraction if the site uses Microsoft's Access to store data rather than SQL Server. Unfortunately there is no built in cachingsystem in classic ASP, but it is easy to build one by using the Application object to store data.When to use ASP Caching. Caching is most useful for data that changes - but not too often. For example an e-commerce store could display a list of popular products, or an information site could display a list of press releases.Don't forget that it is also possible to build functionality into the admin part of the site so that the cache would be flushed if new content is added to the database. That way the website administrator would not have to wait until the cache timed out in order for new content to appear on the website. Remember that data stored in Application variables is visible by all the users of the website。

英语词汇学术语翻译

英语词汇学术语翻译

英语词汇学术语翻译 Ting Bao was revised on January 6, 20021T e r m i n o l o g y T r a n s l a t i o n s o n l e x i c o l o g y英语词汇学术语翻译Aacronym 首字母拼音词acronymy 首字母拼音法addition 增词adjective compound 复合形容词affective meaning 感情意义affix 词缀affixation 词缀法Albanian 阿尔巴尼亚语(族)aliens 非同化词alliteration 头韵(法)allomorph 词素(形位)变体ambiguity 歧义amelioration of meamng 词义的升华analogy 类推analytic language 分析性语言antithsis 对偶antonym 反义词antonymy 反义关系appreciative term 褒义词archaic word 古词archaism 古词语argot 隐语(黑话)Armenian 亚美尼亚语(族)Associated transfer 联想转移association 联想associative meanings 关联意义Bback-formation 逆生法back clipping 词尾截短Balto-Slavic 波罗斯拉夫语(族)bilinguall 双语的basic word stock 基本词汇blend 拼缀词blending 拼缀法borrowed word 借词bound form粘着形式bound morpheme 粘着语素(形位)bound root 粘着词根Ccasual style 随便文体catchPhrase 时髦语Celtic 凯尔特语(族)central meaning 中心意义Clipping 截短法collocability 搭配能力collocation 搭配collocative meaning 搭配意义colloquialism 口语词(口语体)complete synonym 完全同义词complex word 复杂词composition 复合法compound 复合词compounding 复合法concatenation 连锁型concept 概念conceptual meaning 概念意义connotation 内涵connotative meanins 内涵意义constituent 要素.成分consultative style 交谈体(咨询体)content word 实义词context 语境contradictory term矛盾反义词contrary terms 对立反义词conversion 转类法couplet 成对词Dde-adjective 由形容词转化的de-adjectival 由形容词转化的degradation of meaning 词义的降格deletion 减词denizen 同化同denominal 由名词转化的denotation外延denotative meaning 外延意义derivation 派生法derivational affiX 派生词缀derivative 派生词derived meaning 派生意义derogatory sense 贬义desk dictionary 案头词典deverbal noun 由动词转化的名词deverbal suffix 加于动词的后缀diachronic approach 历时角度dialectal word 方言词discipline 学科dismembering 肢解distribution分布doublet 成对词duplication of synonyms 同义词并举Fformal 正式的free form 自由形式free morpheme 自由语素(形位)free root 自由词根frontclipping 首部截短front and back clipping 首尾部截短frozen style 拘谨体full conversion 完全转换functional shift 功能转换Ggeneralisation of meaning 词义的扩大Germanic 日耳曼语族grammatical meaning 语法意义gradable adjective 等级形容词grammatical context 语法语境grammatical feature 语法特征graphology 书写法;图解法HHellenic 希腊语族heterogeneous 多质的highly-inflected 高度屈折化的homograph 同形异义词homonym 同形同音异义词homonymy 同形同音异义关系homphone 同音异义词hyperonym 上义(位)词hyponym 下义(位)词hyponymy 上下义(位)关系Iidiom 习语idiomatic expression 习惯表达idiomaticity 习语程度Indo-European Language Family 印欧语系Indo-Iranian 印伊语族inflection 屈折变化inflectional affix 屈折词缀intensity of meaning 意义强度initialism 首字母缩略词intermediate member 中间成分intimate style 亲昵语体Italic 意大利语族Jj uxtaposition of antonyms反义词并置L1exical context 词汇语境lexical item 词汇项目lexicography 词典学lexicology 词汇学lexis 词汇linguistic context 语言语境literary 书面的loan word 借词lexical meaning 词汇意义Mmarked term有标记项metaphor 暗喻metonymy 换喻monolingual 单语的morph 形素monomorphemic 单语素的monosemic 单义的morpheme 词素(形位)morphological structure 形态结构morphology 形态学motivation 理据motivated 有理据的Nnative word 本族语词neoclassical 新古典词的neologism 新词语notional word 实义词Oobjective meaning 客观意义obsolete 废弃词onomatopoeic motivation 拟声理据Orthographic feature 拼写特征PPartial conversion 部分转化Pejoration 贬义化Perpect homonym 同形同音异义词phonetic feature 语音特征phono1ogical 音位学的phonology 音位学phrasal verb 短语动词phrase clipping 短语截短pocket dictionary 袖珍词典polysemic 多义的polysemous 多义的polysemant 多义词polysemantic 多义的polysemy 多义关系pormanteau word 拼级词positionshifting 移位prefix 前缀prefixation 前缀法primary meaning 原始意义productivity 多产性pun 双关语Rradiation 辐射range of meaning 词义范围reduplication 重叠referent 所指物reference 所指关系referential meaning 所指意义regional variety 地域变体register 语域reiteration(意义)重复。

数据库设计外文翻译

数据库设计外文翻译

英文摘要Data Transformation ServicesDTS facilitates the import, export, and transformation of heterogeneous data. It supports transformations between source and target data using an OLE DB-based architecture. This allows you to move and transform data between the following data sources:∙Native OLE DB providers such as SQL Server, Microsoft Excel, Microsoft Works, Microsoft Access, and Oracle.∙ODBC data sources such as Sybase and Informix using the OLE DB Provider for ODBC.∙ASCII fixed-field length text files and ASCII delimited text files.For example, consider a training company with four regional offices, each responsible for a predefined geographical region. The company is using a central SQL Server to store sales data. At the beginning of each quarter, each regional manager populates an Excel spreadsheet with sales targets for each salesperson. These spreadsheets are imported to the central database using the DTS Import Wizard. At the end of each quarter, the DTS Export Wizard is used to create a regional spreadsheet that contains target versus actual sales figures for each region.DTS also can move data from a variety of data sources into data marts or data warehouses. Currently, data warehouse products are high-end, complex add-ons. As companies move toward more data warehousing and decision processing systems, the low cost and ease of configuration of SQL Server 7.0 will make it an attractive choice. For many, the fact that much of the legacy data to be analyzed may be housed in an Oracle system will focus their attention on finding the most cost-effective way to get at that data. With DTS, moving and massaging the data from Oracle to SQL Server is less complex and can be completely automated.DTS introduces the concept of a package, which is a series of tasks that are performed as a part of a transformation. DTS has its own in-process component object model (COM) server engine that can be used independent of SQL Server and that supports scripting for each column using Visual Basic® and JScript® development software. Each transformation can include data quality checks and validation, aggregation, and duplicate elimination. You can also combine multiple columns into a single column, or build multiple rows from a single input.Using the DTS Wizard, you can:∙Specify any custom settings used by the OLD DB provider to connect to the data source or destination.∙Copy an entire table, or the results of an SQL query, such as those involving joins of multiple tables or distributed queries. DTS also can copy schema and data between relational databases. However, DTS does not copy indexes, stored procedures, or referential integrity constraints.∙Build a query using the DTS Query Builder Wizard. This allows users inexperienced with the SQL language to build queries interactively.∙Change the name, data type, size, precision, scale, and nullability of a column when copying from the source to the destination, where a valid data-type conversion applies.∙Specify transformation rules that govern how data is copied between columns of different data types, sizes, precisions, scales, and nullabilities.∙Execute an ActiveX script (Visual Basic or JScript) that can modify (transform) the data when copied from the source to the destination. Or you can perform any operation supported by Visual Basic or JScript development software.∙Save the DTS package to the SQL Server MSDB database, Microsoft Repository, or a COM-structured storage file.Schedule the DTS package for later execution.Once the package is executed, DTS checks to see if the destination table already exists, then gives you the option of dropping and recreating the destination table. If the DTS Wizard does not properly create the destination table, verify that the column mappings are correct, select a different data type mapping, or create the table manually and then copy the data.Each database defines its own data types and column and object naming conventions. DTS attempts to define the best possible data-type matches between a source and a destination. However, you can override DTS mappings and specify a different destination data-type, size, precision, and scale properties in the Transform dialog box.Each source and destination may have binary large object (BLOB) limitations. For example, if the destination is ODBC, then a destination table can contain only one BLOB column and it must have a unique index before data can be imported. For more information, see the OLE DB for ODBC driver documentation.Note DTS functionality may be limited by the capabilities of specificdatabase management system (DBMS) or OLE DB drivers.DTS uses the source object’s name as a default. However, you can also add double quote marks (“ “) or square brackets ([ ])around multiword table and column names if this is supported by your DBMS.Data Warehousing and OLAPDTS can function independent of SQL Server and can be used as a stand-alone tool to transfer data from Oracle to any other ODBC- or OLE DB-compliant database. Accordingly, DTS can extract data from operational databases for inclusion in a data warehouse or data mart for query and analysis.Figure 4. DTS and data warehousingIn the previous diagram, the transaction data resides on an IBM DB2 transaction server. A package is created using DTS to transfer and clean the data from the DB2 transaction server and to move it into the data warehouse or data mart. In this example, the relational database server is SQL Server 7.0, and the data warehouse uses OLAP Services to provide analytical capabilities. Client programs (such as Excel) access the OLAP Services server using the OLE DB for OLAP interface, which is exposed through a client-side component called Microsoft PivotTable® Service. Client programs using PivotTable Service can manipulate data in the OLAP server and even change individual cells.SQL Server OLAP Services is a flexible, scalable OLAP solution, providing high-performance access to information in the data warehouse. OLAP Services supports all implementations of OLAP equally well: multidimensional OLAP (MOLAP), relational OLAP (ROLAP), and a hybrid (HOLAP). OLAP Services addresses the most significant challenges in scalability through partial preaggregation, smart client/server caching, virtual cubes, and partitioning.DTS and OLAP Services offer an attractive and cost-effective solution. Data warehousing and OLAP solutions using DTS and OLAP Services are developed with point-and-click graphical tools that are tightly integrated and easy to use. Furthermore, because the PivotTable Service client is using OLE DB, the interface is more open to access by a variety of client applications.Issues for Oracle versions 7.3 and 8.0Oracle does not support more than one BLOB data type per table. This prevents copying SQL Server tables that contain multiple text and image data types with modification. You may want to map one or more BLOBs to the varchar data type and allow truncation, or split a source table into multiple tables. Oracle returns numeric data types such as precision = 38 and scale =0, even when there are digits to the right of the decimal point. If you copy this information, it will be truncated to integer values. If mapped to SQL Server, the precision is reduced to a maximum of 28 digits.The Oracle ODBC driver does not work with DTS and is not supported by Microsoft. Use the Microsoft Oracle ODBC driver that comes with SQL Server. When exporting BLOB data to Oracle using ODBC, the destination table must have an existing unique primary key.Heterogeneous Distributed QueriesDistributed queries access not only data currently stored in SQL Server (homogeneous data), but also access data traditionally stored in a data store other than SQL Server (heterogeneous data). Distributed queries behave as if all data were stored in SQL Server. SQL Server 7.0 will support distributed queries by taking advantage of the UDA architecture (OLE DB) to access heterogeneous data sources, as illustrated in the following diagram.Figure 5. Accessing heterogeneous data sources with UDA翻译DTS 使进口,出口和不同的数据的转变变得容易。

chapter04-数据挖掘概念与技术PPT课件

chapter04-数据挖掘概念与技术PPT课件
Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing
Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is converted.
5
Data Warehouse—Time Variant
The time horizon for the data warehouse is significantly longer than that of operational systems Operational database: current value data Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
1
Chapter 4: Data Warehousing and On-line Analytical Processing
Data Warehouse: Basic Concepts Data Warehouse Modeling: Data Cube and OLAP Data Warehouse Design and Usage Data Warehouse Implementation Data Generalization by Attribute-Oriented

大模型知识库数据格式

大模型知识库数据格式

大模型知识库数据格式The format of a large-scale knowledge base is crucial for the effective retrieval and utilization of information. 大规模知识库的数据格式对于信息的有效检索和利用至关重要。

The data format determines how information is stored, organized, and accessed, which ultimately impacts the efficiency and accuracy of knowledge retrieval. 数据格式决定了信息的存储、组织和访问方式,最终影响到知识检索的效率和准确性。

One of the most common data formats for large-scale knowledge bases is the graph-based structure, which represents information as nodes and edges. 对于大规模知识库来说,最常见的数据格式之一是基于图形的结构,将信息表示为节点和边。

This format is well-suited for representing complex relationships between entities and allows for efficient traversal and querying of interconnected data. 这种格式非常适合表示实体之间的复杂关系,并允许对相互连接的数据进行高效的遍历和查询。

Another popular data format is the table-based structure, where information is stored in a tabular form with rows and columns. 另一种常见的数据格式是基于表格的结构,其中信息以行和列的形式存储。

计算机英语单词词

计算机英语单词词

计算机英语单词词部门: xxx时间: xxx整理范文,仅供参考,可下载自行编辑计算机英语词汇<1)1. artificial intelligence 人工智能2. paper-tape reader 纸空阅读机3. optical computer 光学计算机4. neural network 神经网络5. instruction set 指令集6. parallel processing 平行处理7. difference engine 差分机8. versatile logical element 通用逻辑器件9. silicon substrate 硅基10.vacuum tube 真空管<电子管)11. the storage and handling of data 数据的存储与处理12.very large-scale integrated circuit 超大规模集成电路13. central processing unit 中央处理器14.personal computer 个人计算机15. analogue computer 模拟计算机16.digital computer 数字计算机17. general-purpose computer 通用计算机18. processor chip 处理器芯片19. operating instructions 操作指令20. input device 输入设备21. circuit board 电路板22. beta testing β测试23. thin-client computer 瘦客户机电脑24. cell phone 蜂窝电话<移动电话)25. digital video 数码摄像机,数码影视26. Pentium processor 奔腾处理器27. virtual screen 虚拟屏幕28. desktop computer specifications 台式计算机规格29. radio frequency 射频30. wearable computer 可佩带式计算机31. Windows Registry 视窗注册表32. swap file 交换文件33. TMP file 临时文件34. power plug 电源插头35. free disk space 可用磁盘空间36. Control Panel 控制面板37. Start Menu 开始菜单38. Add/Remove Programs option 添加∕删除程序选项1. information retrieval 信息检索2. voice recognition module 语音识别模块3. touch-sensitive region 触感区,触摸区4. address bus 地址总线5. flatbed scanner 平板扫描仪6. dot-matrix printer 点阵打印机<针式打印机)7. parallel connection 并行连接8. cathode ray tube 阴极射线管9. video game 电子游戏<港台亦称电玩)10. audio signal 音频信号11. operating system 操作系统12. LCD (liquid crystal display>液晶显示<器)b5E2RGbCAP13. inkjet printer 喷墨打印机14. data bus 数据总线15. serial connection 串行连接16. volatile memory 易失性存储器17. laser printer 激光打印机18. disk drive 磁盘驱动器19. BIOS (Basic Input Output Sys tem> 基本输入输出系统p1EanqFDPw20. video display 视频显示器21. ISA slot ISA总线槽22. configuration register 配置寄存器23. still camera 静物照相机24. token packet 令牌包25. expansion hub 扩展集线器26. USB<Universal Serial Bus)通用串行总线27. root hub 根集线器28. I/O device 输入输出设备29. control frame 控制帧30. PCI (Peripheral Component In terconnect> 外部设备互连DXDiTa9E3d31. video tape 录像带32. aspect ratio <电视、电影图像的)高宽比,纵横比33. CD-RW 可擦写光驱34. laser diode 激光二极管35. reflective layer反射层36. optical disk光盘37. high resolution高分辨率38. floppy disk 软盘1. data set 数据集2. pointing device 指点设备3. graphical user interface 图形化用户界面4. time-slice multitasking 分时多任务处理5. object-oriented programming 面向对象编程6. click on an icon 点击图标7. context switching 上下文转换8. distributed system 分布式系统9. pull-down lists of commands 命令的下拉列表10. simultaneous access 同时访问11. command-line interface 命令行界面12. multitasking environment 多任务化环境13. spreadsheet program 电子制表程序14.main memory 主存15. storage media 存储介质16. disk file 磁盘文件17. command interpreter 命令解释器18. network connection 网络连接19.DOS (disk operating system> 磁盘操作系统20. copy a data file 拷贝数据文件21. serial port 串行端口22. configuration utility 配置工具23. ISDN 综合业务数字网24. token ring 令牌环25. fast Ethernet 快速以太网26. virtual memory 虚拟内存27. source code 源代码28. swap space 交换空间29. Internet protocol 因特网协议30.SVGA (Super Video Graphics Array> 超级视频图形阵列31. network throughput 网络吞吐量32. registry access 注册表存取33. scalable file server 规模可变的文件服务34. static Web page 静态网页35. physical memory 物理内存36. Plug and Play 即插即用37. network adapter 网络适配器38.SMP (symmetric multiprocessing> 对称多任务处理1.storage register 存储寄存器2.function statement 函数语句3.program statement 程序语句4.object-oriented language 面向对象语言5.assembly language 汇编语言6.intermediate language 中间语言,中级语言7.relational language 关系<型)语言8.artificial language 人造语言9.data declaration 数据声明10. SQL 结构化查询语言11. executable program 可执行程序12. program module 程序模块13.conditional statement 条件语句14. assignment statemen t赋值语句15.logic language 逻辑语言16. machine language 机器语言17.procedural language 过程语言18. programming language 程序设计语言19. run a computer program 运行计算机程序20. computer programme r 计算机程序设计员1.function call 函数调用2.event-driven programming 事件驱动编程3.click on a push button 点击按钮4.application window 应用程序窗口5.class hierarchy 类继承6.child window 子窗口7.application development environment 应用程序开发环境8.pull-down menu 下拉菜单9.dialog box 对话框10.scroll bar 滚动条1.native code 本机代码2.header file 头文件3.multithreaded program 多线程编程4.Java-enabled browser 支持Java的浏览器5.machine code 机器码6.assembly code 汇编码7.Trojan horse 特洛伊木马程序8.software package 软件包1.inference engine 推理机2.system call 系统调用3.compiled language 编译语言4.parallel computing 平行计算5.pattern matching 模式匹配6.free memory 空闲内存7.interpreter program 解释程序8.library routine 库程序9.intermediate program 中间程序,过渡程序10.source file 源文件11.interpreted language 解释<性)语言12.device driver 设备驱动程序13.source program 源程序14.debugging program 调试程序15.object code 目标代码16.application program 应用程序17.utility program 实用程序18.logic program 逻辑程序19. ink cartridge 墨盒20.program storage and execution 程序的存储与执行1.Windows socket Windows套接字接口2.Winsock interface Winsock接口3.file repository 文件属性4.client-side application 客户端应用程序5.HTML tag HTML标记6.Web browser 万维网浏览器7.hardware platform 硬件平台8.custom control 定制控件9.OLE (object linking and embedding> 对象链接和嵌入10.WAN (wide area network> 广域网1.search path 搜索路径2.dynamic library 动态链接库3.code set 代码集4.ancestor menu 祖辈菜单5.end user 最终用户6.menu item 菜单项7.cross-platform application 跨平台应用程序8.character set 字符集1.procedure call 过程调用2.structured message protocol 结构化消息协议3.secure protocol 安全协议4.networking protocol 网络协议5.processing node 处理节点6.homogeneous system 同构系统7.cost effectiveness 成本效益8.message encryption 信息加密<术)9.message format 信息格式10.component code 组件编码11.sequential program 顺序程序12.multicast protocol 多址通信协议13.routing algorithm 路由算法14.open system 开放式系统15.heterogeneous environment 异构型环境16.distributed processing 分布式处理17.resource sharing 资源共享18.structured message passing 结构化信息传送19.communication(s> link 通信链路20.development tool 开发工具1.logical entity 逻辑实体2.client-server architecture 客户机-服务器结构3.CPU cycle CPU周期4.graphics acceleration 图形加速5.software licensing 软件许可6.word-processing application 字处理应用程序7.load balancing 负载平衡8.remote procedure call 远程过程调用9.hardware configuration 硬件配置10.peer-to-peer network 对等网络1.font server 字体服务器2.data management logic 数据管理逻辑规则3.disk space 磁盘空间4.conceptual model 概念模型5.client-server model 客户–服务器模型6.graphics display 图形显示7.general-purpose hardware 通用硬件8.system expandability 系统可扩展性(3>RTCrpUDGiT1. language precompiler 程序语言预编译器2. business logic implementation 业务逻辑实现3. query processor 查询处理器4. data modeling 数据建模5. storage engine 存储引擎6. tiered architecture 分层结构7. database manager 数据库管理员8. data presentation layer 数据表现层9. logical database design 逻辑上的数据库设计10. entity relationship diagram 实体关系图11. query language 查询语言12. host language 主机语言13.Data Modification Language (DML> 数据修改语言14. data redundancy 数据冗余15. relational database 关系数据库16. relational data model 关系数据模型17. database management system (DBMS> 数据库管理系统18.data element 数据元素19. data access 数据存取20. query optimization 查询优化1. global temporary table 全局临时表2. partitioned data 分区的数据3. virtual table 虚拟<临时)表4. permanent table 永久<固定)表5. log out of a system 退出登录的系统6. primary key 主键7. foreign key 外键8. database object 数据库对象9. clustered index 簇索引10. local temporary table 本地临时表1. data module 数据模块2. object repository 对象库3. local database 本地<机)数据库4. client dataset 客户端数据集5. remote database server 远程数据库服务器6. flat file 平面文件7. data source 数据源8. Distributed Component Object Model (DCOM> 分布式组件对象模型5PCzVD7HxA1. microwave radio 微波无线电2. digital television 数字电视3. DSL 数字用户线路4. analog transmission 模拟传输5. on-screen pointer 屏幕<触摸屏)上的指示<器)6. computer terminal 计算机终端7. radio telephone 无线电话8. cellular telephone 蜂窝电话<移动电话)9. decentralized network 分散的网络10. wire-based internal network 基于普通网线的内部网络11.fiber-optic cable 光缆12. fax machine 传真机13. wireless communications 无线通信14. point-to-point communications 点对点通信15. modulated electrical impulse 调制电脉冲16. communication(s> satellite 通信卫星17. telegraph key 电报电键17. transmission medium 传输媒体19. cordless telephone 无绳电话20.metal conductor 金属导体1. error recovery 错误恢复2. parity function 奇偶函数3. video on demand 视频点播4. collision detection 冲突检测5. protocol layering 协议层6. architectural model 体系结构模型7. packet switching 包交换8. enterprise network 企业网9. protocol suite 协议组commercial backbone 商用骨干网1. high-definition TV 高清晰度电视2. frame relay 帧中继3. data rate 数据传输率4. metropolitan area network 城域网5. set-top box 机顶盒6. multi-mode fiber 多模光纤7. protocol stack 协议堆栈8. VPI (virtual path identifier> 虚拟路径标识符1. coaxial cable 同轴电缆2. computer networking 计算机网络3. multiple-access network 多路访问网络4. management software 管理软件5. broadband connection 宽带连接6. confidential information 机密信息7. monolithic system 单片机系统8. star network 星型网络9. bus network 总线型网络10. ring network 环形网络11. network resources 网络资源12.public key system 公钥体制13.public telephone network 公用电话网14. data encryption system 数据加密系统15. information superhighway 信息高速公路16. information age 信息时代17. computer security 计算机安全18. data network 数据网19. data link 数据链路20. access protocol 存取协议1. switched internetwork 交换式内部网2. routing protocol 路由协议3. carrier sense 载波侦听4. spanning tree 生成树5. hierarchical network 分层网络6. dynamic routing 动态路由选择7. VLAN (virtual local area network> 虚拟局域网8. UNI (user network interface> 用户网络接口9. campus network 校园网10. modular model 模块模型1. diskless workstation 无盘工作站2. group scheduling 成组调度3. remote node 远程节点4. printer port 打印口5. remote access 远程访问6. DUN (Dial-Up Networking> 拨号联网7. parallel port 并行端口NOS (network operating system> 网络操作系统(4>jLBHrnAILg1. network layout 网络布局xHAQX7 4J0X2. physical topology 物理拓扑结构3. logical topology 逻辑拓扑结构4. star configuration 星型结构5. physical network connection 物理网络连接6. high-end active hub 高端主动式集线器7. passive hub 被动式集线器8. network node 网络节点9. electrical ground 电气接地10. data flow 数据流11.wiring closet 布线室12. multistation access unit 多站访问单元13.star topology 星形拓扑结构14.bus topology 总线拓扑结构15. ring topology 环形拓扑结构16. network topology 网络拓扑结构17. centralized network management 集中式网络管理18. intelligent hub 智能集线器19. network hub 网络集线器20.physical network 物理网络1. heterogeneous network 异构网络2. packet delivery 包发送3. IBM compatible IBM兼容的4. IP datagram IP数据报5. DOS box DOS箱<机)6. HTTP (Hypertext Transfer Protocol> 超文本传送协议7. NNTP (Network News Transfer Protocol> 网络新闻传送协议8. SMTP (Simple Mail Transfer Protocol> 简单邮件传送协议9. security hole 安全漏洞10. system crash 系统崩溃1. physical address 物理地址2. data transfer 数据迁移3. header checksum 报头校验4. stream delivery <数据)流发送5. virtual circuit 虚电路6. network layer 网络层7. full-duplex transmission 全双工传输ARP (Address Resolution Protocol> 地址解释协议1. list server 列表服务器2. transmission scheme 传输模式3. data packet 数据包4. Mbps 每秒兆字节5. hypermedia document 超媒体文档6. FTP 文件传输协议7. host network 主机网络8. dedicated access 专线访问9. storage format 存储格式10. mail server 邮件服务器11. multimedia file 多媒体文件12.dial-up access 拨号访问13. LAN (local area network> 局域网14. retrieve files 检索文件15. ISP (Internet Service Provider> 因特网服务供应商16. WWW (World Wide Web> 万维网17. URL (Uniform Resource Locator> 统一资源定位符18.TCP (Transmission Control Protocol> 传输控制协议19. data stream 数据流20. log on 登录1. plain text 纯文本2. destination address3. mail-user agent 邮件用户代理4. message transfer agent 消息传送代理5. graphics-based file6. analog signal 模拟信号LDAYtRyK fE7. domain name 域名8. text file 文本文件9. text editor 文本编辑器10. e-mail address 电子邮件地址1. sound card 声卡2. Web page 网页3. video camera 摄像机,摄像头4. plug-in software 嵌入软件5. input/output port 输入∕输出端口6. home page 主页7. video capture card 视频捕获卡8. chat room 聊天室1. electric motor 电动机2. desktop publishing 桌面出版系统<台式出版系统)3. information-related services 信息相关服务4. information-based occupation 基于信息的职业5. information processor 信息处理6. textual data 文本的数据Zzz6ZB 2Ltk7. numerical data 数字的数据8. audio data 音频数据9. fibre optics 纤维光学10.digital thermometer 数字温度计11.information revolution 信息革命12.technological revolution 技术革命13.global market 全球市场dvzfvkw MI114. IT (information technology> 信息技术15. multimedia product 多媒体产品16.information specialist 信息专家17.database management 数据库管理18.video data 视频数据19. information-processing system 信息处理系统20.telephone helpline 电话服务热线1. tabular data 表格数据2. raster image 光栅图像3. vector model 矢量模型4. statistical analysis system 统计分析系统5. model atmospheric circulation 模拟大气循环6. computer-based tool 基于计算机的工具7. geographic information system 地理信息系统8. database operation 数据库操作9. grid cell 网格单元10.closed loop 闭环1. domain-specific tag 特定<指定)域标记2. handheld terminal 手持终端设备3. life cycle 生命周期<生存周期)4. mobile agent toolkit 移动代理工具包5. XML (eXtensible Markup Language> 扩展标签语言6. data mining 数据挖掘7. game theory 博弈论8. keyword-based text search(ing> 基于关键字的搜索(5>rqyn14ZNXI1.user authentication 用户认证2.electronic purse 电子钱包3.information filter 信息过滤4.data integrity 数据完整性5.smart card 智能卡6.HTML 超文本标记语言7.symmetric key cryptosystem 对称密钥密码系统8.message authentication code 信息鉴定码9.unauthorized access control 未授权访问控制10.electronic catalog 电子目录11.electronic money (或cash> 电子货币12.search engine 搜索引擎13.digital signature 数字签名14.user interface 用户界面15. EFT (Electronic Funds Transfer> 电子资金转帐16.public key cryptosystem 公钥密码系统17.PDA (personal digital assistant> 个人数字助理18.hypertext link 超文本链接19.3D image 三维图像20.credit card 信用卡1.vendor-centric model 客户中心模式2.Web site 网站3.Web surfing 网上冲浪4.middleware server 中间件服务5.back-end platform 后端平台6.e-Business strategy 电子商务策略7.binary format 二进制格式8.customer-oriented e-Business system 面向客户的电子商务系统9.ISV (independent software vendor> 独立软件推销商10.information infrastructure 信息基础结构设施1.Web storefront 网上店面2.electronic press kit 电子版发行包3.online retail 在线零售4.multimedia demo 多媒体演示5.online access 联机访问6.value-added services 增值业务7.product promotion 产品推销8.communication medium 通信媒体1.encryption program 加密程序2.deletion command 删除命令3.authorized user 授权的用户。

家庭物联网中多源异构数据存储方案选择及建模

家庭物联网中多源异构数据存储方案选择及建模

家庭物联网中多源异构数据存储方案选择及建模查改琴;褚伟【期刊名称】《现代计算机(普及版)》【年(卷),期】2015(000)005【摘要】随着家庭物联网的不断发展,如何高效存储海量的多源异构家庭物联网数据越来越重要。

由于传统的关系型数据库不能满足家庭物联网海量的多源异构数据的存储需求,家庭物联网多源异构数据存储方案宜采用新兴的面向文档的非关系型数据库MongoDB。

结合MongoDB的应用特点,提出一种适合于MongoDB 的从E-R模型向逻辑模型转换的转换原则,阐述MongoDB数据库逻辑模型的建模过程,并在此基础上构建家庭物联网应用的多源异构数据在MongoDB数据库中的逻辑模型。

%With the development of household Internet of things, how to efficiently store huge amounts of multi-source heterogeneous data becomes more and more important. Since the traditional relational database can not meet the storage needs of multi-source heterogeneous data of household Internet of things, and the document oriented non relational database MongoDB is suitable. By combining with the characteris-tics of MongoDB, puts forward a processing method of transformation from E-R model to logical model ,which is suitable for MongoDB, establishes a logical model of MongoDB database, and builds a logic model of multi-source heterogeneous data storage application of household Internet of things.【总页数】6页(P22-27)【作者】查改琴;褚伟【作者单位】合肥工业大学计算机网络系统研究所,合肥 230009;合肥工业大学计算机网络系统研究所,合肥 230009【正文语种】中文【相关文献】1.多源异构的智能配用电数据存储处理技术 [J], 葛磊蛟;王守相;王尧;郭乃网2.多源异构数据存储模型在业务监控中的研究及应用 [J], 张冠豫;汤吕3.面向运维阶段的多源异构BIM数据存储方法研究 [J], 徐嘉懿; 邓雪原4.多源异构数据存储模型在业务监控中的研究及应用 [J], 张冠豫;汤吕5.海量多源异构地震监测数据存储和共享服务系统 [J], 吕作勇;黄文辉;康英;苏柱金;刘军;欧阳龙斌因版权原因,仅展示原文概要,查看原文内容请购买。

fundamentals of data engineering pdf

fundamentals of data engineering pdf

fundamentals of data engineering"Fundamentals of Data Engineering"refers to the foundational principles and concepts related to the field of data engineering.Data engineering involves the design,development,and management of data architecture,infrastructure,and systems to ensure efficient and reliable data processing.Key topics within the fundamentals of data engineering may include:Data Modeling:Understanding how to structure and represent data in a way that meets the needs of an organization.This involves designing databases,defining tables,and establishing relationships between different data entities.Database Management Systems(DBMS):Knowledge of various types of database systems and how to manage them.This includes relational databases(like MySQL,PostgreSQL),NoSQL databases(like MongoDB,Cassandra),and other data storage technologies.Data Processing:Techniques for processing and transforming data. This includes Extract,Transform,Load(ETL)processes,data cleaning,and data integration methods.Data Warehousing:Designing and managing data warehouses, which are large,centralized repositories of integrated data from various sources.Data warehouses support reporting and business intelligence activities.Big Data Technologies:Understanding and working with technologies that handle large volumes of data,such as Apache Hadoop,Apache Spark,and distributed computing frameworks.Data Quality and Governance:Ensuring the accuracy,completeness, and reliability of data.Implementing governance practices to maintain data integrity and security.Data Pipelines:Building and managing data pipelines for the efficient flow of data from source to destination.This involves orchestrating various data processing tasks.Cloud Data Services:Leveraging cloud platforms for data storage, processing,and analytics.Familiarity with cloud services like AWS,Azure, or Google Cloud Platform.Data Security and Privacy:Implementing measures to protect data from unauthorized access and ensuring compliance with data privacy regulations.Data Analytics and Visualization:Using data for analysis and creating visualizations to communicate insights effectively.Familiarity with tools like Tableau,Power BI,or programming languages like Python and R.Understanding the fundamentals of data engineering is crucial for professionals working in data-related roles,including data engineers, database administrators,and data scientists.It provides the groundwork for effective data management and utilization within an organization.。

计算机毕业设计外文翻译---数据仓库

计算机毕业设计外文翻译---数据仓库

DATA WAREHOUSEData warehousing provides architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions. A large number of organizations have found that data warehouse systems are valuable tools in today's competitive, fast evolving world. In the last several years, many firms have spent millions of dollars in building enterprise-wide data warehouses. Many people feel that with competition mounting in every industry, data warehousing is the latest must-have marketing weapon —— a way to keep customers by learning more about their needs.“So", you may ask, full of intrigue, “what exactly is a data warehouse?"Data warehouses have been defined in many ways, making it difficult to formulate a rigorous definition. Loosely speaking, a data warehouse refers to a database that is maintained separately from an organization's operational databases. Data warehouse systems allow for the integration of a variety of application systems. They support information processing by providing a solid platform of consolidated, historical data for analysis.According to W. H. Inmon, a leading architect in the construction of data warehouse systems, “a data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management's decision making process." This short, but comprehensive definition presents the major features of a data warehouse. The four keywords, subject-oriented, integrated, time-variant, and nonvolatile, distinguish data warehouses from other data repository systems, such as relational database systems, transaction processing systems, and file systems. Let's take a closer look at each of these key features.(1)Subject-oriented: A data warehouse is organized around major subjects, such as customer, vendor, product, and sales. Rather than concentrating on the day-to-day operations and transaction processing of an organization, a data warehouse focuses on the modeling and analysis of data for decision makers. Hence, data warehouses typically provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.(2)Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as relational databases, flat files, and on-line transaction records. Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attribute measures, and so on..(3)Time-variant: Data are stored to provide information from a historical perspective (e.g., the past 5-10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, an element of time.(4)Nonvolatile: A data warehouse is always a physically separate store of data transformed from the application data found in the operational environment. Due to this separation, a data warehouse does not require transaction processing, recovery, and concurrency control mechanisms. It usually requires only two operations in data accessing: initial loading of data and access of data..In sum, a data warehouse is a semantically consistent data store that serves as a physical implementation of a decision support data model and stores the information on which an enterprise needs to make strategic decisions. A data warehouse is also often viewed as an architecture, constructed by integrating data from multiple heterogeneous sources to support structured and/or ad hoc queries, analytical reporting, and decision making.“OK", you now ask, “what, then, is data warehousing?"Based on the above, we view data warehousing as the process of constructing and using data warehouses. The construction of a data warehouse requires data integration, data cleaning, and data consolidation. The utilization of a data warehouse often necessitates a collection of decision support technologies. This allows “knowledge workers" (e.g., managers, analysts, and executives) to use the warehouse to quickly and conveniently obtain an overview of the data, and to make sound decisionsbased on information in the warehouse. Some authors use the term “data warehousing" to refer only to the process of data warehouse construction, while the term warehouse DBMS is used to refer to the management and utilization of data warehouses. We will not make this distinction here.“How are organizations using the information from data warehouses?" Many organizations are using this information to support business decision making activities, including:(1) increasing customer focus, which includes the analysis of customer buying patterns (such as buying preference, buying time, budget cycles, and appetites for spending).(2) repositioning products and managing product portfolios by comparing the performance of sales by quarter, by year, and by geographic regions, in order to fine-tune production strategies.(3) analyzing operations and looking for sources of profit.(4) managing the customer relationships, making environmental corrections, and managing the cost of corporate assets.Data warehousing is also very useful from the point of view of heterogeneous database integration. Many organizations typically collect diverse kinds of data and maintain large databases from multiple, heterogeneous, autonomous, and distributed information sources. To integrate such data, and provide easy and efficient access to it is highly desirable, yet challenging. Much effort has been spent in the database industry and research community towards achieving this goal.The traditional database approach to heterogeneous database integration is to build wrappers and integrators (or mediators) on top of multiple, heterogeneous databases. A variety of data joiner and data blade products belong to this category. When a query is posed to a client site, a metadata dictionary is used to translate the query into queries appropriate for the individual heterogeneous sites involved. These queries are then mapped and sent to local query processors. The results returned from the different sites are integrated into a global answer set. This query-driven approach requires complex information filtering and integration processes, and competes for resources with processing at local sources. It is inefficient and potentially expensive for frequent queries, especially for queries requiring aggregations.Data warehousing provides an interesting alternative to the traditional approach of heterogeneous database integration described above. Rather than using a query-driven approach, data warehousing employs an update-driven approach in which information from multiple, heterogeneous sources is integrated in advance and stored in a warehouse for direct querying and analysis. Unlike on-line transaction processing databases, data warehouses do not contain the most current information. However, a data warehouse brings high performance to the integrated heterogeneous database system since data are copied, preprocessed, integrated, annotated, summarized, and restructured into one semantic data store. Furthermore, query processing in data warehouses does not interfere with the processing at local sources. Moreover, data warehouses can store and integrate historical information and support complex multidimensional queries. As a result, data warehousing has become very popular in industry.1.Differences between operational database systems and data warehousesSince most people are familiar with commercial relational database systems, it is easy to understand what a data warehouse is by comparing these two kinds of systems.The major task of on-line operational database systems is to perform on-line transaction and query processing. These systems are called on-line transaction processing (OLTP) systems. They cover most of the day-to-day operations of an organization, such as, purchasing, inventory, manufacturing, banking, payroll, registration, and accounting. Data warehouse systems, on the other hand, serve users or “knowledge workers" in the role of data analysis and decision making. Such systems can organize and present data in various formats in order to accommodate the diverse needs of the different users. These systems are known as on-line analytical processing (OLAP) systems.The major distinguishing features between OLTP and OLAP are summarized as follows.(1)Users and system orientation: An OLTP system is customer-oriented and is used for transaction and query processing by clerks, clients, and information technology professionals. An OLAP system is market-oriented and is used for data analysis by knowledge workers, including managers, executives, and analysts.(2)Data contents: An OLTP system manages current data that, typically, are too detailed to be easily used for decision making. An OLAP system manages large amounts of historical data, provides facilities for summarization and aggregation, and stores and manages information at different levels of granularity. These features make the data easier for use in informed decision making.(3)Database design: An OLTP system usually adopts an entity-relationship (ER) data model and an application -oriented database design. An OLAP system typically adopts either a star or snowflake model, and a subject-oriented database design.(4)View: An OLTP system focuses mainly on the current data within an enterprise or department, without referring to historical data or data in different organizations. In contrast, an OLAP system often spans multiple versions of a database schema, due to the evolutionary process of an organization. OLAP systems also deal with information that originates from different organizations, integrating information from many data stores. Because of their huge volume, OLAP data are stored on multiple storage media.(5). Access patterns: The access patterns of an OLTP system consist mainly of short, atomic transactions. Such a system requires concurrency control and recovery mechanisms. However, accesses to OLAP systems are mostly read-only operations (since most data warehouses store historical rather than up-to-date information), although many could be complex queries.Other features which distinguish between OLTP and OLAP systems include database size, frequency of operations, and performance metrics and so on.2.But, why have a separate data warehouse?“Since operational databases store huge amounts of data", you observe, “why not perform on-line analytical processing directly on such databases instead of spending additional time and resources to construct a separate data warehouse?"A major reason for such a separation is to help promote the high performance of both systems. An operational database is designed and tuned from known tasks and workloads, such as indexing and hashing using primary keys, searching for particular records, and optimizing “canned" queries. On the other hand, data warehouse queries are often complex. They involve the computation of large groups of data at summarized levels, and may require the use of special data organization, access, and implementation methods based on multidimensional views. Processing OLAP queries in operational databases would substantially degrade the performance of operational tasks.Moreover, an operational database supports the concurrent processing of several transactions. Concurrency control and recovery mechanisms, such as locking and logging, are required to ensure the consistency and robustness of transactions. An OLAP query often needs read-only access of data records for summarization and aggregation. Concurrency control and recovery mechanisms, if applied for such OLAP operations, may jeopardize the execution of concurrent transactions and thus substantially reduce the throughput of an OLTP system.Finally, the separation of operational databases from data warehouses is based on the different structures, contents, and uses of the data in these two systems. Decision support requires historical data, whereas operational databases do not typically maintain historical data. In this context, the data in operational databases, though abundant, is usually far from complete for decision making. Decision support requires consolidation (such as aggregation and summarization) of data from heterogeneous sources, resulting in high quality, cleansed and integrated data. In contrast, operational databases contain only detailed raw data, such as transactions, which need to be consolidated before analysis. Since the two systems provide quite different functionalities and require different kinds of data, it is necessary to maintain separate databases.数据仓库数据仓库为商务运作提供了组织结构和工具,以便系统地组织、理解和使用数据进行决策。

多域异构数据融合方法

多域异构数据融合方法

多域异构数据融合方法Data fusion is the process of integrating multiple datasets from different sources or domains to create a unified, comprehensive view of the overall data. 数据融合是将来自不同来源或领域的多个数据集集成在一起,以创建统一的、全面的数据视图的过程。

One perspective to consider in the discussion of data fusion methods is the technical aspect. From a technical standpoint, data fusion involves the development of algorithms and techniques for integrating disparate data sources, such as relational databases, sensor networks, and data streams. Technical challenges in data fusion include data formatting, cleaning, and integration, as well as the management of uncertainty and inconsistency in the data. 从技术角度来看,数据融合涉及开发算法和技术,用于集成不同的数据源,例如关系数据库、传感器网络和数据流。

数据融合的技术挑战包括数据格式化、清洗和集成,以及管理数据中的不确定性和不一致性。

Another important perspective to consider is the application of data fusion methods in real-world scenarios. In fields such as healthcare, finance, and environmental monitoring, the integration of data frommultiple sources can provide valuable insights and support decision-making processes. For example, in healthcare, data fusion can be used to integrate electronic health records, medical imaging data, and genetic information to enable personalized medicine and predictive analytics. 在现实世界场景中应用数据融合方法是另一个需要考虑的重要视角。

实施和优化android上的加密文件系统毕业论文外文翻译

实施和优化android上的加密文件系统毕业论文外文翻译

外文原文Implementing and Optimizing an Encryption on AndroidZhaohui Wang, Rahul Murmuria, Angelos StavrouDepartment of Computer ScienceGeorge Mason UniversityFairfax, V A 22030, USA, ,Abstract—The recent surge in popularity of smart handheld devices, including smart-phones and tablets, has given rise to new challenges in protection of Personal Identifiable Information (PII). Indeed, modern mobile devices store PII for applications that span from email to SMS and from social media to location-based services increasing the concerns of the end user’s privacy. Therefore, there is a clear need and expectation for PII data to be protected in the case of loss, theft, or capture of the portable device. In this paper, we present a novel FUSE ( in USErspace) encryption to protect the removable and persistent storage on heterogeneous smart gadget devices running the Android platform. The proposed leverages NIST certified cryptographic algorithms to encrypt the data- at-rest. We present an analysis of the security and performance trade-offs in a wide-range of usage and load scenarios. Using existing known micro benchmarks in devices using encryption without any optimization, we show that encrypted operations can incur negligible overhead for read operations and up to twenty (20) times overhead for write operations for I/Ointensive programs. In addition, we quantified the database transaction performance and we observed a 50% operation time slowdown on average when using encryption. We further explore generic and device specific optimizations and gain 10% to 60% performance for different operations reducing the initial cost of encryption. Finally, we show that our approach is easy to install and configure acrossall Android platforms including mobile phones, tablets, and small notebooks without any user perceivable delay for most of the regular Android applications.Keywords-Smart handheld devices, Full disk encryption, Encrypted , I/O performance.I. BACKGROUND & THREAT MODELA.BackgroundGoogle’s Android is a comprehensive software framework for mobile devices (i.e., smart phones, PDAs), tablet computers and set-top-boxes. The Android operating system includes the system library files, middle-ware, and a set of standard applications for telephony, personal information management, and Internet browsing. The device resources, like the camera, GPS, radio, and Wi-Fi are all controlled through the operating system. Android kernel is based on an enhanced Linux kernel to better address the needs of mobile platforms with improvements on power management, better handling of limited system resources and a special IPC mechanism to isolate the processes. Some of the system libraries included are: a custom C standard library (Bionic), cryptographic (OpenSSL) library, and libraries for media and 2D/3D graphics. The functionality of these libraries are exposed to applications by the Android Application Framework. Many libraries are inherited from open source projects such as WebKit and SQLite. The Android runtime comprises of the Dalvik, a register-based Java virtual machine. Dalvik runs Java code compiled into a dex format, which is optimized for low memory footprint. Everything that runs within the Dalvik environment is considered as an application, which is written in Java. For improved performance, applications can mix native code written in the C language through Java Native Interface (JNI). Both Dalvik and native applications run within the same security environment, contained within the ‘Application Sandbox’. However, native code does not benefit from the Java abstractions (type checking, automated memory management, garbage collection). Table I lists the hardware modules of Nexus S, which is a typical Google branded Android device.Android’s security model differs significantly from the traditional desktopsecurity model [2]. Android applications are treated as mutually distrusting principals; they are isolated from each other and do not have access to each others’ private dat a. Each application runs within their own distinct system identity (Linux user ID and group ID). Therefore, standard Linux kernel facilities for user management is leveraged for enforcing security between applications. Since the Application Sandbox is in the kernel, this security model extends to native code. For applications to use the protected device resources like the GPS, they must request for special permissions for each action in their Manifest file, which is an agreement approved during installation time.Android has adopted SQLite [12] database to store structured data in a private database. SQLite supports standard relational database features and requires only little memory at runtime. SQLite is an Open Source database software library that implements a self-contained, server-less, zeroconfiguration, transactional SQL database engine. Android provides full support for SQLite databases. Any databases you create will be accessible by name to any java class in the application, but not outside the application. The Android SDK includes a sqlite3 database tool that allows you to browse table contents, run SQL commands, and perform other useful functions on SQLite databases. Applications written by 3rd party vendors tend to use these database features extensively in order to store data on internal memory. The databases are stored as single files in the and carry the permissions for only the application that created the be able to access it. Working with databases in Android, however, can be slow due to the necessary I/O.EncFS is a FUSE-based offering encryption on traditional desktop operating systems. FUSE is the supportive library to implement a fully functional in a userspace program [5]. EncFS uses the FUSE library and FUSE kernel module to provide the interface and runs without any special permissions. EncFS runs over an existing base (for example,ext4,yaffs2,vfat) and offers the encrypted . OpenSSL is integrated in EncFS for offering cryptographic primitives. Any data that is written to t he encrypted is encrypted transparently from the user’s perspective and stored onto the base . Reading operations will decrypt the data transparently from thebase and then load it into memory.B.Threat ModelHandheld devices are being manufactured all over the world and millions of devices are being sold every month to the consumer market with increasing expectation for growth and device diversity. The price for each unit ranges from free to eight hundred dollars with or without cellular services. In addition, new smartphone devices are constantly released to the market which results the precipitation of the old models within months of their launch. With the rich set of sensors integrated with these devices, the data collected and generated are extraordin arily sensitive to user’s privacy. Smartphones are therefore data-centric model, where the cheap price of the hardware and the significance of the data stored on the device challenge the traditional security provisions. Due to high churn of new devices it is compelling to create new security solutions that are hardware-agnostic.While the Application Sandbox protects applicationspecific data from other applications on the phone, sensitive data may be leaked accidentally due to improper placement, resale or disposal of the device and its storage media (e.g. removable sdcard). It also can be intentionally exfiltrated by malicious programs via one of the communication channels such as USB, WiFi, Bluetooth, NFC, cellular network etc.Figure 1. Abstraction of Encryption on AndroidFor example, an attacker can compromise a smartphone and gain full control of it by connecting another computing device to it using the USB physical link [33]. Moreover, by simply capturing the smartphones physically, adversaries have access to confidential or even classified data if the owners are the government officials ormilitary personnels. Considering the cheap price of the hardware, the data on the devices are more critical and can cause devastating consequences if not well protected. To protect the secrecy of the data of its entire lifetime, we must have robust techniques to store and delete data while keeping confidentiality.In our threat model, we assume that an adversary is already in control of the device or the bare storage media. The memory-borne attacks and defences are out of the scope of this paper and addressed by related researches in Section II. A robust data encryption infrastructure provided by the operating system can help preserve the confidentiality of all data on the smartphone, given that the adversary cannot obtain the cryptographic key. Furthermore, by destroying the cryptographic key on the smartphone we can make the data practically irrecoverable. Having established a threat model and listed our assumptions, we detail the steps to build encryption on Android in the following sections.V. PERFORMANCEA. ExperimentalSetup For our experiments, we use the Google’s Nexus S smartphone device with Android version 2.3 (codename Gingerbread). The bootloader of the device is unlocked and the device is rooted. The persistent storage on Nexus S smartphones is a 507MB MTD (Memory Technology Device). MTD is neither a block device not a character device, and was designed for flash memory to behave like block devices. In addition to the MTD device, Nexus S has a dedicated MMC (MultiMediaCard, which is also a NAND flash storage technique) device dedicated to system and userdata partition, which is 512MB and 1024MB respectively. Table II provides the MTD device and MMC device partition layout.In order to evaluate this setup for performance, we installed two different types of benchmarking tools. We used the SQLite benchmarking application created by RedLicense Labs - RL Benchmark Sqlite. To better understand finegrained low level operations under different I/O patterns, we use IOzone [7], which is a popular open source micro benchmarking tool. It is to be noted that these tools are both a very good case study for ‘real-use’ as well. RL Benchmark Sqlite behaves as anyapplication that is database-heavy would behave. IOzone uses the direct intensively just like any application would, if it was reading or writing files to the persistant storage. All other applications which run in memory and use the CPU, graphics, GPS or other device drivers are irrelevant for our storage media tests and the presence of encrypted will not affect their performance.IOzone is a benchmark tool [7]. The benchmark generates and measures a variety of and has been widely used in research work for benchmarking various on different platforms. The benchmark tests performance for the generic , such as Read, write, re-read, re-write, read backwards, read strided, fread, fwrite, random read, pread ,mmap, aio read, aio write.IOzone has been ported to many platforms and runs under various operating systems. Here in our paper, we use ARM-Linux version (Android compatible) of latest IOzone available and focus on the encryption overhead. The cache effect is eliminated by cold rebooting the device for each run of IOzone and RL Benchmark Sqlite. The device is fully charged and connected to external USB power while in experiments. We collect the data and plot the average results of the 5 runs in the figures in all the following experiments.A.ThroughputPerformance of EncFS In this section, we present the IOzone performance results for random read and write operations on userdata partition. The benchmark is run for different and for each , with different record lengths. The maximumTable IIISQLITE PERFORMANCE ON GOOGLE NEXUS Sis selected as 4MB due to the observation that 95% of the user data files are smaller than 4MB on a typical Android system.Fig 3 compares the throughput for four typical operations, namely read, random read, write and random write. The IOzone experiments are run on the original ext4 and EncFS with different AES key lengths. Fig 3 shows for read operation, EncFS performs the same with original ext4. However, for random read, write, random write, EncFS only gives 3%, 5%, 4% of the original throughput respectively. Our analysis shows the encryption/decryption contributes the overhead and is the expected trade-off between security and performance. The buffered read in EncFS makes the read operation only incur marginal overhead. However, for random read, the need for the data blocks alignment during decryption results in slower throughput. For different key length, the 256-bits key only incurs additional 10% overhead comparing to 128-bits key for better security. In particular, AES-256 runs 12866KB/s,8915KB/s, 9804KB/s at peak for random read,write and random write respectively while AES-128 runs 14378KB/s, 9808KB/s, 10922KB/s. The performance loss of a longer key length trading better security properties is only marginal to the performance loss of the encryption scheme. Optimizations can compensate such key-length overhead as illustrated in Section V-D. Based on this observation, AES-256 is recommended and used as default in the following subsection unless otherwise mentioned explicitly.Similarly, sdcard partition gives the identical pattern with slightly different value. Due to the fact that the sdcard partition shares the same underlying physical MMC device with userdata partition as listed in Table II, our experiment results demonstrates the original vfat performs 16% faster than ext4 for read and random read operation while ext4 outperforms vfat 80% and 5% for write and random write operations respectively. However, comparing different is out of our focus in this paper. We observed different throughput values and overhead patterns on other devices such as Nexus One, HTC Desire and Dell Streak which use a removable sdcard as separate physical medium to internal NAND device. Both AES-128 and AES-256 throughput on sdcard are statistically identical to the ones on userdata partition given a 95% confidence interval. Such results show that the scheme of encryption in EncFS(e.g. internal data block size, key length) and its FUSE IO primitives are the bottleneck of the performance regardless of the underlying . We suggest corresponding optimizations in Section V-D.In addition to the basic I/O operations, we look at the read operation in detail under different record size before and after encryption. In particular, we plot the 3D surface view and contour view. In the 3D surface graph, the x-axis is the record size, the y-axis is the throughput in Kilobytes per second, and the z-axis is the . The contour view presents the distribution of the throughput across different record sizes and . In a sense, this is a top-view of the 3D surface graph. Figure 4 and 5 show the throughput when IOzone read partial of the the beginning. Figure 4 shows the default ext4 in Android 2.3 favors bigger record size and for better throughput. The performance peak centers in the top-right corner in the contour view of the 3-D graph. However, after placing EncFS, the performance spike shifts to the diagonal where the record size equals to . This is an interesting yet expected result because of the internal alignment of the in decryption.To better understand the performance of our encryption under Android’s SQLite IO access pattern, we present the database transactions benchmark in the next subsection, which is more related to the users’ experiences.C.SQLite Performance BenchmarkingIn addition to the IOzone micro benchmark results in last subsection, we measure the time for various typical database transactions using the RL Benchmark SQLite Performance Application in the Android market [11]. Table III groups the read and write operations and lists the results in detail.We consider that random read and write is a fair representation of database I/O operations in our scenario. This is due to the fact that for SQLite, the database of one or more pages. All reads from and writes to the database at a page boundary and all reads/writes are an integer number of pages in size. Since the exact page is managed by the database engine, only observe random I/O operations.After incorporating the encryption , the database-transactions-intensive apps slows down from 81.68 seconds to 128.66 seconds for the list of operations as described in the Table III. The read operations reflected by select database transactions shows the consistent results with IOzone result: the EncFS buffers help the performance. However, any write operations resulting from insert, update, or drop database transactions will incur 3% to 401% overhead. The overall overhead is 58%. This is the trade-off between security and performance.中文翻译实施和优化android上的加密文件系统王朝晖,拉胡尔Murmuria,安吉罗斯Stavrou计算机科学系乔治·梅森大学费尔法克斯,VA 22030,USA,,摘要:比来激增的智能手持设备,包孕智能手机和平板电脑的普及,已经引起了在庇护个人身份信息(PII),新的挑战。

相关主题
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Heterogeneous Relational Databases for a Grid-enabled Analysis Environment Arshad Ali1, Ashiq Anjum1,4, Tahir Azim1, Julian Bunn2, Saima Iqbal2,4, Richard McClatchey4, Harvey Newman2, S. Yousaf Shah1, Tony Solomonides4, Conrad Steenberg2, Michael Thomas2,Frank van Lingen2, Ian Willers31National University of Sciences & Technology, Rawalpindi, PakistanEmail: {arshad.ali, ashiq.anjum, tahir, yousaf.shah}@.pk2California Institute of Technology, Pasadena, USAEmail: {Julian.Bunn, fvlingen}@, {newman, conrad, thomas}@ 3European Organization for Nuclear Research, Geneva, SwitzerlandEmail: {Ian.Willers, Saima.Iqbal}@cern.ch4University of the West of England, Bristol, UKEmail: {Richard.McClatchey, Tony.Solomonides}@AbstractGrid based systems require a database access mechanism that can provide seamless homogeneous access to the requested data through a virtual data access system, i.e. a system which can take care of tracking the data that is stored in geographically distributed heterogeneous databases. This system should provide an integrated view of the data that is stored in the different repositories by using a virtual data access mechanism, i.e. a mechanism which can hide the heterogeneity of the backend databases from the client applications.This paper focuses on accessing data stored in disparate relational databases through a web service interface, and exploits the features of a Data Warehouse and Data Marts. We present a middleware that enables applications to access data stored in geographically distributed relational databases without being aware of their physical locations and underlying schema. A web service interface is provided to enable applications to access this middleware in a language and platform independent way. A prototype implementation was created based on Clarens [4], Unity [7] and POOL [8]. This ability to access the data stored in the distributed relational databases transparently is likely to be a very powerful one for Grid users, especially the scientific community wishing to collate and analyze data distributed over the Grid. 1. IntroductionIn a geographically distributed environment like the Grid, database resources can be very diverse because they are developed by different vendors, run on different operating systems, support different query languages, possess different database schemas and use different technologies to store the same type of data. Furthermore, this data is accessed by applications developed for different platforms in varying development environments. Currently our Grid environment is using two major formats for the storage of data: file-based data and relational databases. Sophisticated systems are in place to track and manage files containing data and replicated at multiple storage sites. These systems generally rely on cataloging services to map file names with their physical storage locations and access mechanism. By finding out the physical location of a file and its access protocol from the catalog, the user can easily gain access to the data stored in the file. Examples of cataloging services in use for this purpose are the Replica Location Service (European DataGrid (EDG) [1] and Globus [2]), and the POOL File Catalog [3].However, besides file-based data, significant amounts of data are also stored in relational databases. This is especially true in the case of life sciences data, astronomy data, and to a lesser extent, high-energy physics data. Therefore, a system that provides access to multiple databases can greatly facilitate scientists and users in utilizing such data resources.Software for providing virtualized access to the data, in a similar way to files, is only beginning to appear. Most data grids currently do not incorporate much support for data stored in multiple, distributed databases. As a result, users have to send queries for data to each of the databases individually, and then manually integrate the returned data. This integration of data is essential in order to obtain a consistent result of a query submitted against the databases.In this paper, we present a system that has been developed to provide Grid users efficient access to globally distributed, relational databases. In our prototype, the crucial issue of integration of data is addressed in three stages: initially data is integrated at the data warehouse level where data is extracted from the normalized schema of the databases, and loaded into the denormalized schema of the data warehouse. Secondly, materialized views of the data are replicated from the data warehouse and stored in data marts. Finally, data generated from the distributed queries, which run over the data marts, is integrated and these integrated results are presented to the clients. A Clarens [4] based web service interface is provided in order to enable all kinds of (simple and) complex clients to use this prototype over the web conveniently. Furthermore, in order to avoid the performance issue of centralized registration of data marts and their respective schema information, the Replica Location Service (RLS) is used.The paper is organized as follows. In Section 2, we briefly describe the background of the issue addressed, and describe the requirements in detail. In Section 3, we give a brief overview of previous related work in this direction, before plunging into a full description of the architecture and design of the system in Section 4. Performance statistics are presented in Section 5, and the current status of the work and possible extensions for the future are mentioned in Section 6. We finally conclude our discussion in Section 7.2. BackgroundThe Large Hadron Collider (LHC) [5], being constructed at the European Organization for Nuclear Research (CERN) [6], is scheduled to go online in 2007. In order to cater for the large amounts of data to be generated by this enormous accelerator, a Grid-based architecture has been proposed which aims to distribute the generated data to storage and processing sites. The data generated is stored both in the form of files (event data) and relational databases (non-event data). Non-event data includes data such as a detector’s calibration data and conditions data.While sophisticated systems are already in place for accessing the data stored in distributed files, software for managing databases in a similar way is only beginning to be developed. In this paper, we propose a web service based middleware for locating and accessing data that is stored in data marts. A prototype was developed based on this proposed middleware. This prototype provides an integrated view of the data stored in distributed heterogeneous relational databases through the Online Transaction Processing (OLTP) system of a data warehouse. Furthermore, this prototype provides the facility to distribute an SQL query through a data abstraction layer into multiple sub-queries aimed at the data marts containing the requested tables, and to combine the outcome of the individual sub-queries into a single consistent result.The non-event data from the LHC will be generated at CERN and, like event data, will be distributed in multiple locations at sites around the world. Most of the data will be stored at the Tier-0 site at CERN, and at the seven Tier-1 sites. Smaller subsets of this data will be replicated to Tier-2 and Tier-3 sites when requested by scientists for analysis. Moreover, the database technologies used at the different tiers are also different. Oracle, for instance, is the most popular RDBMS system used at the Tier-0 and Tier-1 sites. On the other hand, MySQL and Microsoft SQL Server is the more common technology used at Tier-2 and Tier-3 sites. SQLite is the database favored by users who wish to do analysis while remaining disconnected over long periods of time (laptop users, for instance).In order to manage these replicated sets of data, a system is required that can track the locations of the various databases, and provide efficient, transparent access to these datasets when queries are submitted to it by end users. In addition, with the growing linkages of Grid computing and Web service technologies, it is desirable to provide a Web service interface to the system so that client applications at every tier can access these services conveniently over the Web.3. Related WorkThe Grid is considered as a data-intensive infrastructure. Users expect the Grid to provide efficient and transparent access to enormous quantities of data scattered in globally distributed, heterogeneous databases. For this reason, integration of data retrieved from these databases becomes a major research issue.Efforts have been made to solve this issue but performance bottlenecks still exist.One of the projects targeting database integration is the Unity project [7] at the University of Iowa research labs. This project provides a JDBC driver, which uses XML specification files for integrating the databases. Then using the metadata information from the XML specification files, connections are established to the appropriate databases. The data is thus accessed without previous knowledge of the physical location of the data. Unity, however, does not do any load distribution, which causes some delays in query processing. As a result, if there is a lot of data to be fetched for a query, the memory becomes overloaded. In addition, it does not handle joins that span tables in multiple databases. In our work, we have used the Unity driver as the baseline for development. For our prototype, we have enhanced the driver with several features that are described in detail in Section 4.Another project called the POOL Relational Abstraction Layer (POOL-RAL) [8], being pursued at CERN, provides a relational abstraction layer for relational databases and follows a vendor-neutral approach to database access. However, POOL provides access to tables within one database at a time, which puts a limit on the query and does not allow parallel execution of a query on multiple databases.OGSA Distributed Query Processing (DQP) [9] is another project for distributed query processing on Grid-based databases. It distributes join operations on multiple nodes within a grid to take full advantage of the grid’s distributed processing capabilities. However, OGSA-DQP is strongly dependent on the Globus Toolkit 3.2, which limits it to Globus only and makes it platform dependent.IBM’s Discovery Link [10] is another project aimed at carrying out integration of relational databases and other types of data sources for life sciences, genetics and bio-chemical data sources. However, due to the domain specific nature of this project, it cannot be used directly for HEP databases. ALDAP and SkyQuery are two other similar projects with the same basic objectives, but aimed at Astronomy databases.4. System Architecture and DesignThe distributed architecture of our system consists of the basic components described in the following discussion. Figure 1 shows an architectural diagram of the developed prototype.The architecture consists of two main parts: The first part (the lower half of the diagram) retrieves data from various underlying databases, integrates it into a data warehouse, and replicates data from the data warehouse to the data marts, which are locally accessible by the client applications through the web-service interface. The second part (the upper half of the diagram) provides lightweight Clarens clients web service-based access to the data stored in the data marts.4.1.Data SourcesThe developed prototype supports Oracle and MySQL relational source databases. A normalized schema was developed for these source databases to store HBOOK [11] Ntuples data. The following example can help to understand the meaning of the Ntuples. Suppose that a dataset contains 10000 events and each event consists of many variables (say NVAR=200), then an Ntuple is like a table where these 200 variables are the columns and each event isa row. Furthermore, these databases were distributed over a Tier-1 center at CERN and a Tier-2 center at CALTECH.4.2. Data WarehouseThe currently available approaches, as mentioned in section 3, provide different implementations to access data from distributed heterogeneous relational databases. Each approach provides a different implementation based on different driver requirements, connection requirements and database schema requirements of the databases. It means that for ‘N’ number of database technologies with ‘S’ number of database schemas, current approaches require ‘NxS’ number of implementations to be provided, in order to access data from globally distributed, heterogeneous databases. Furthermore, ‘NxS’ implementations also create a performance issue because each time access to data from these databases is requested, all the related meta-data information i.e. database schema, connection and vendor specific information, has to be parsed in orderto return a reliable and consistent result of the query.In order to resolve this performance issue, we propose the use of a data warehouse.Figure 1. Architectural Diagram.A data warehouse is a repository, which stores data that is integrated from heterogeneous data sources, for efficient querying, analysis and decision-making. Data warehouse technology is very successfully implemented in various commercial projects and is highly supported by vendors like ORACLE.For the developed prototype, a denormalized star schema was developed in the ORACLE database. An Extraction, Transformation, Transportation and Loading (ETL) process was used to populate the data warehouse. In this ETL process, data was initially extracted from the distributed relational data sources, then integrated and transformed according to the denormalized database schema of the data warehouse. In this prototype, data streaming technology was used to perform the ETL process. Finally, this transformed data is loaded into the warehouse. In this prototype, we created views on the data stored in the warehouse to provide read-only access for scientific analysis.4.3. Data MartsA remote centralized data warehouse cannot be considered a good solution for an environment like the Grid, which is seen as a pool of distributed resources. In the context of databases, efficient accessibility of distributed databases can be achieved by making the required data available locally to the applications.Thus, in order to utilize the features of the data warehouse successfully in a Grid environment without creating a centralized performance bottleneck, views are created on the integrated data of the data warehouse, and materialized on a new set of databases, which are made available locally to the applications. These databases are termed as data marts. Data marts are databases that store subsets of replicated data from the centralized data warehouse.For the developed prototype, we create data marts, which are supported by MySQL, MS-SQL, ORACLE and SQLite. These databases are accessed using either through the POOL-RAL interface or using JDBC drivers, depending on whether or not they are supported by the POOL-RAL libraries.4.4. XSpec FilesXSpec stands for “XML Specifications” files. These files are generated from the data sources using tools provided by the Unity project. Each database has its own XSpec file, which contains information about the schema of the database, including the tables, columns and relationships within the database. These logical names form a kind of data dictionary for the database, and this data dictionary is used for determining which database to access to fulfill a client’s request. The client does not need to know the exact name of a database, tables in the database or names of the columns in the table. The client is provided this data dictionary of logical names, and he uses these logical names without any knowledge of the physical location of the data and their actual names. The query processing mechanism automatically maps logical names to physical names and divides the query to be executed among the individual databases.These XSpec files are of two types:4.4.1. Lower Level XSpec. The Lower Level XSpec refers to each individual database’s XSpec file, which is generated from the original data source and contains the schema and all the other information mentioned above.4.4.2. Upper Level XSpec. The Upper Level XSpec file is generated manually using the Lower Level XSpec files. This file just contains the URL for each database, the driver that each database is using and the name of the Lower Level XSpec for each database. There is only one Upper-Level XSpec file, whereas the number of lower-Level XSpec depends on the number of data sources.4.5. Data Access LayerThis layer processes the queries for data sent by the clients containing joins of different tables from different databases (data marts), and divides them into sub-queries, which are then distributed on to the underlying databases.The data access layer looks for the tables from which data is requested by the client. If the tables are locally registered with the JClarens server, the data access layer decides which of the two modules (POOL-RAL module or Unity driver) to forward the query to by finding out which databases are to be queried. If a database is supported by the POOL-RAL, the query is forwarded to the POOL RAL layer; otherwise, the query is forwarded to the JDBC driver. If the tables requested are not registered with the JClarens server, the Replica Location Service (RLS) is used to lookup the physical locations (hosting servers) of the tables. The RLS server provides the URL of the remote JClarens server with which the tables are registered. The queries are then forwarded to the remote servers, which perform the query processing, and send the retrieved data back to the original server, where the queries were submitted.4.6. Unity DriverAs mentioned in Section 2 (Related Work) above, the Unity driver enables access to and integration of data from multiple databases. We have further enhanced this driver to be able to apply joins on rows extracted from multiple databases.While accessing the underlying databases, the sub-queries meant for unsupported databases are accessed using the Unity driver, whereas the sub-queries concerned with POOL-supported databases are processed through the POOL RAL. The data retrievedthrough each of the sub-queries is finally merged intoa single 2-D vector, and returned to the client.4.7. POOL RAL WrapperThe databases not supported by the POOL-RAL are handled by the JDBC driver. On the other hand, queries to databases supported by the POOL-RAL are forwarded through a wrapper layer to the POOL RAL libraries for execution.The POOL RAL is implemented in C++ whereas JClarens and its services are implemented in Java. Therefore, to make the POOL libraries work with the JClarens based service, a JNI (Java Native Interface) wrapper was developed which exposes two methods:1. One method initializes a service handler for a new database using a connection string, a username and a password and adds it to a list of previously initialized handles.2. The other method takes as input a connection string, an array of select fields, an array of table names, and a ‘where’ clause string, and returns a 2D array containing the results of the query execution on the database represented by the connection string.4.8. Replica Location ModuleThe Replica Location module is included in the project to distribute the load and reduce the query processing time, by enabling multiple instances of the database service to host smaller subsets from the entire collection of databases, and then collaborating with each other to provide access to one or more of those databases. In this way, load can be distributed over as many servers as required, instead of putting it entirely on just one server registering all the databases. This can also potentially enable us to achieve a hierarchical database hosting service in parallel with the tieredRLSThis module uses a central RLS Server that contains the mapping of table names with replica servers’ URLs. Each service instance publishes information about the databases and the tables it is hosting to the central RLS server. This central RLS server is contacted when the data access layer does not find a locally registered table.4.9. Tracking Changes in SchemaThe system is also able to track changes made to the schema of any database in the system. This feature enables the system to update itself according to changes in the schema of any of the databases.The algorithm works as follows. After a fixed interval of time, a thread is run against the back-end databases to generate a new XSpec for each database. The size of the newly created XSpec is compared against the size of the older XSpec file. If the sizes are equal, the files are compared using their md5 sums. If there is any change in the size or md5 sum of the file, the older version of the XSpec is replaced by the new one. The JClarens server then uses the new XSpec file to update the schema it is using for that database.4.10. Plug-in DatabasesThis feature enables databases to be added at runtime to the system. The server is provided the URL of the databases’ XSpec file, the database driver name, and the database location. The server then downloads the file, parses it, and retrieves the metadata about the database. Using this metadata, the server establishes a connection with the database using the appropriate JDBC driver. When the connection is established, the server updates itself with the information about the tables contained in that database.5.Performance ResultsThe developed prototype was tested in three stages: Stage 1: Data is extracted from the source databases, transformed according to the denormalized schema requirements of the data warehouse, and then streamed into the data warehouse.Stage 2: Data is extracted from the views, which were created on the data stored in the data warehouse, and materialized from the views (through data streaming) into the local databases i.e. data marts.Stage 3: Response time of the distributed query, which runs through a JClarens based web interface, is measured against the locally available data marts.5.1.Stage 1 and 2 results:Figure 4: Data extracted from sourcedatabases and loaded into the dataFigure 5. Views extracted from the data warehouse and materialized into data marts.Stage 1 and 2 tests were carried out by streaming data of different sizes from data sources to the data warehouse and from data warehouse to the data marts respectively. The tests were carried out over a 100 Mbps Ethernet LAN. The respective data transfer time was plotted against the size of the transferred data. These plots, shown in figure 4 and 5, show the average data transfer time i.e. average of observations taken on different days at different time to measure the data transfer rate with different network traffic load. This time includes the time taken by a class to connect with the respective databases and, to open and close the stream for the respective SQL statements.Each of the graphs shown in figure 4 and 5 are comprised of two plots, because in our prototype every time data was retrieved from a database it was first placed into a temporary file (data extraction) and then from this temporary file, data was stored into the other databases (data loading). In Figure 4, the lower line of the graph was plotted for the data extracted from the normalized data sources, transformed according to the denormalized schema of the data warehouse, and imported into a temporary file. The upper line shows the time taken to transfer data from the above generated temporary files to the data warehouse. Similarly, in Figure 5 the lower line of the graph shows the data retrieved from views, which were created on the data warehouse. The upper line shows the time taken to transfer the data from the generated temporary files and materialized into the data marts. Of course, the use of the temporary staging file during the process is a performance bottleneck, and we are working on a cleaner way of loading the warehouse directly from the normalized databases.5.2. Stage 3 results:We present here two aspects of the performance of the service. First, we measure the time the system takes to respond to a set of queries, each of which requires the involvement of a different number of Clarens servers and different number of databases. Secondly, we determine how the system throughput changes with different numbers of requested rows.The tests were carried out on a 100 Mbps Ethernet LAN over two single-processor Intel Pentium IV (1.8 and 2.4 GHz) machines, with 512 MB and 1 GB of RAM respectively. The operating system on each machine was Redhat Linux 7.3. A Clarens server (with the data access service installed) was installed on each of the machines. The two servers were configured to host a total of 6 databases, with a total of nearly 80,000 rows and 1700 tables. The databases were equally shared between a Microsoft SQL Server on Windows 2000, and a MySQL database server.Table 1: Query Response TimeNumber ofClarensserversaccessedQueryDistributed(Yes/No)ResponseTimeNumberof tablesaccessed1 No 38ms11 Yes487.5ms22 Yes 594ms4 Table 1 (Query response time) shows the time in which the system responds to the client’s queries. The column “Number of Clarens servers” shows the number of Clarens servers that had to be accessed in order to retrieve the requested rows of data. The “Query Distributed (Yes/No)” columns shows whether or not the query had to fetch data from multiple databases. The “Number of tables accessed” field represents the number of tables that were requested in that query. Although the response time for queries executing over multiple databases and servers are more than 10 times slower, it is inevitable because it involves determining which server to connect to using RLS, connecting and authenticating with several databases or servers, and integrating the results.Figure 6. Response time versus number ofrows requestedWe also collected performance statistics to determine how the system scales with increasing number of rows requested by clients. For this purpose, we selected a number of queries to be run against the ntuple data in our databases, each of which was to return different number of rows. We determined the number of rows returned for each query, and measured the response time for each query to execute. The graph depicting the response time of the queries versus the number of rows returned is shown in Figure 6.The graph shows that there is a linear increase in the response time of the system as a result of an increase in the number of requested rows. Increasing the number of rows from 21 to 2551 only increases the response time from about 300 to 700 ms. This shows that the system is scalable to support large queries. In addition, it is comparable to the performance reported by the OGSA-DAI project [12]. However, we are working on further improving the algorithms and implementation, to enable even better performance for very large queries.6. Current Status and Future WorkA prototype has been developed, which is installable as an RPM on Redhat 7.3-based systems. The prototype possesses all of the features described above. However, some of the features such as joins spanning multiple databases have not been tested yet for all possible scenarios. Unit tests have been written for the system to check the integrity of the system. A plug-in for the Java Analysis Studio (JAS) [13] was also developed to submit queries for accessing the data and visualizing the results as histogramsFuture directions will be to ensure the efficiency of the system and enhance the performance. In addition, we will be testing the system for query distribution on geographically distributed databases in order to measure its performance over wide area networks. We are also working on the design of a system that could decide the closest available database (in terms of network connectivity) from a set of replicated databases. Another interesting extension to the project could be the study of how tables from databases can be integrated with respect to their semantic similarity.7. ConclusionWe have presented a system that enables heterogeneous databases distributed over the N-tiered architecture of the LHC experiment to present a single, simplified view to the user. With a single query, users can request and retrieve data from a number of databases simultaneously. This makes the (potentially) large number of databases at the backend transparent to the user while continuing to give them satisfactory performance.8. References[1] DataGrid Project. http://eu-datagrid.web.cern.ch/eu-datagrid/ss[2] Globus Replica Location Service. /rls[3] D. Düllmann, M. Frank, G. Govi, I. Papadopoulos, S. Roiser. “The POOL Data Storage, Cache and Conversion Mechanism”. Computing in High Energy and Nuclear Physics 2003, San Diego, paper MOKT008.[4] C. Steenberg, H.Newman, F. van Lingen et. al. “Clarens Client and Server Applications”. Computing in High Energyand Nuclear Physics 2003, San Diego, paper TUCT005.[5] The Large Hadron Collider Homepage. http://lhc-new-homepage.web.cern.ch/lhc-new-homepage/[6] European Organization for Nuclear Research (CERN), Switzerland, http://public.web.cern.ch/Public/Welcome.html[7] Ramon Lawrence, Ken Barker. “Unity - A Database Integration Tool”. TRLabs Emerging Technology Bulletin, December, 2000.[8] POOL Persistency Framework. http://pool.cern.ch.[9] M. N. Alpdemir, A. Mukherjee, N.W. Paton, et. al. “Service Based Distributed Query Processing on the Grid”. Proceedings of the First International Conference on Service Oriented Computing, pages 467-482. Springer, 15-18 December 2003.[10] IBM Discovery Link Project: /journal/sj/402/haas.html[11] R Brun, M Goossens. “HBOOK-Statistical Analysisand Histogramming”. CERN Program Library Long Write-Ups Y250, CERN Geneva, Switzerland.[12] M. Jackson, M. Antonioletti, N.C. Hong, et. al. “Performance Analysis of the OGSA-DAI software”. OGSA-DAI mini-workshop, UK e-Science All Hands Meeting, Nottingham, September 2004.[13] Java Analysis Studio. 。

相关文档
最新文档