ArchitectureforBigDataOrca一个处理大数据的模块化的

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

GPDB architecture
GPDB architecture工作原理
储存和处理如此大量的数据是通过将负载分布到几个独立的服务器或主 机,创建一系列个体数据库,一起协作,提供一个单一的数据库映像。
主管服务器(master severs)是GPDB的入口Baidu Nhomakorabea也就是用户连接数据库提 交SQL语句的地方。
database system to register a metadata provider (MD Provider) that is responsible for serializing metadata into DXL before being sent to Orca.
The database system needs to include translators that consume/emit data in DXL format. Query2DXL translator converts a query parse tree into a DXL query, while DXL2Plan translator converts a DXL plan into an executable plan.
Memo是一个由group的集合。每一个group包含一类逻辑等价式。
Memo groups capture the different sub-goals of a query (e.g., a filter on a table, or a join of two tables). 将查询拆解成子目标,由group获得。
Transformations
Plan alternatives are generated by applying transformation rules that can produce either equivalent logical expressions (e.g., InnerJoin(A,B) →InnerJoin(B,A)), or physical implementations of existing expressions (e.g., Join(A,B) → HashJoin(A,B)).
The implementation of such translators is done completely outside Orca, which allows multiple to use Orca by providing the appropriate translators.
Orca architecture
Interaction of Orca with database system
Figure 2 shows the interaction between Orca and an external database system. The input to Orca is a DXL query.
Search and Job Scheduler
Orca uses a search mechanism to navigate through the space of possible plan alternatives and identify the plan with the least estimated cost.
Orca: A Modular Query Optimizer Architecture for Big Data Orca: 一个处理大数据的模块化 的查询优化体系
Sigmod 2014
报告人 万丽蓉
研究背景
In this paper we present the architecture of Orca, the new query optimizer for all Pivotal ( Pivotal 公司) data management products,including Pivotal Greenplum Database (GPDB) and Pivotal HAWQ.
Property Enforcement
Orca includes an extensible framework for describing query requirements and plan characteristics based on formal property specifications. Properties have different types including logical properties (e.g., output columns), physical properties (e.g., sort order and data distribution), and scalar properties (e.g., columns used in join conditions).
Group members, called group expressions, achieve the group goal in different logical ways (e.g., different join orders).
Each group expression is an operator that has other groups as its children.
The output of Orca is a DXL plan. During optimization, the database system can be
queried for metadata (e.g., table definitions). Orca abstracts metadata access details by allowing
efficient multi-core aware scheduler that distributes individual negrained optimization subtasks across multiple cores for speed-up of the optimization process.
DXL:查询优化器的解耦需要建立一个沟通机 制与数据库系统通信来处理查询。Orca体系包 括一个框架,用于数据库系统和优化器之间的 数据交换,这个框架叫做Data eXchange Language (DXL)。
Interaction of Orca with database system
translator
Metadata Cache
Since metadata (e.g., table definitions) changes infrequently, shipping it with every query incurs an overhead. Orca caches metadata on the optimizer side and only retrieves pieces of it from the catalog if something is unavailable in the cache, or has changed since the last time it was loaded in the cache.
HAWQ 在内核中使用Orca体系设计高效的查询 方案,降低Hadoop集群的数据访问成本。
HAWQ把最新的基于成本的优化和 Hadoop的可扩展性容错性结合了起来, 在PB级规模上实现了数据的交互处理。
Interaction of Orca with database system
Orca is the new query optimizer for Pivotal data management products, including GPDB and HAWQ.
The search mechanism由专门的Job Scheduler来创建 独立或者并行的work units来完成,大概分为三个步骤: exploration where equivalent logical expressions are generated, implementation where physical plans are generated, and optimization, where required physical properties (e.g., sort order) are enforced and plan alternatives are costed.
During query optimization, each operator may request specific properties from its children. An optimized child plan may either satisfy the required properties on its own (e.g., an IndexScan plan delivers sorted data), or an enforcer (e.g., a Sort operator) needs to be plugged in the plan to deliver the required property. The framework allows each operator to control enforcers placement based on child plans' properties and operator's local behavior.
Metadata cache also abstracts the database system details from the optimizer, which is particularly useful during testing and debugging.
Orca特点
Modularity(模块化) Extensibility(可扩展性) Multi-core ready:Orca deploys a highly
主管服务器负责协调与其他数据库实例共同工作,处理和存储数据。每 一个数据库实例成为一个segment。
当一个查询被提交给主管服务器,查询将被优化并分解为更加细化的任 务,分布到每一个segment进行处理,共同提交最终结果。
网络层负责segments之间的进程进行通讯。
HAWQ
HAWQ , a massively parallel SQL-compliant engine on top of HDFS(Hadoop分布式系 统).
The results of applying transformation rules are copied-in to the Memo, which may result in creating new groups and/or adding new group expressions to existing groups. Each transformation rule is a selfcontained component that can be explicitly activated/deactivated in Orca configurations.
Memo.
The space of plan alternatives generated by the optimizer is encoded in a compact in-memory data structure called the Memo .
The Memo structure consists of a set of containers called groups, where each group contains logically equivalent expressions.
相关文档
最新文档