Kettle使用+说明

相关主题

kettle使用说明

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Transformation举例二：支持hive表操作
• 支持Hive的表操作，结合使用hadoop file output 可以支持从关系型数据库向hive表中导入数据
Transformation举例三：数据同步
Hyperbase 外表
改表的列的顺序和类型
• 支持数据更新和同步
▫ 两张表的列的顺序和数据格式必须一模一样 ▫ 注意hyperbase id 为字典序，但RDB id则不一定
Kettle 使用简要说明
2014/10/06
设置和坑[1]
• 需要配置pentaho-big-data-plugin 目录中的plugin.properties文件
▫ 把active.hadoop.configuration = 的值改成 hadp20
• mysql貌似连不上，需要把mysql-connector-java-***-bin.jar 放到lib目录中
Transformation举例四：数据增量同步
Set hyperbase.reader = true
根据maxid 取数据流
获取maxid
根据id更新
数据库连接设置：Mysql
数据库连接设置：Inceptor来自百度文库
Kettle Cluster：远程执行
添加子服务器远程执行
远程机器启动 carte服务
• Carte是Kettle的Cluster相关的服务
Appendix
数据库同步： CDC （Change Data Capture）
• • Source Data-Based CDC In this case you use the time stamp or sequenced IDs to identify the last loaded rows and store this information in a status table. This can even be combined with transactions: This status table holds for all jobs/transformations all tables that need to be in a consistent state. For all tables the last processed keys (source/target) and the status is saved. Some tables might need compound keys depending on the ER-design. It is also possible to combine this approach with the own Kettle transformation log tables and the Dates and Dependencies functionality. There is an extended example in the Pentaho Data Integration for Database Developers (PDI2000C) course in module ETL patterns (Patterns: Batching, Transaction V - Status Table) Snapshot-Based CDC When no suitable time stamps or IDs are available or when some records might have been updated, you need the snapshot-based approach. Store a copy of the loaded data (in a table or even a file) and compare record by record. It is possible to create a SQL statement that queries the delta or use a transformation. Kettle supports this very comfortable by the Merge rows (diff) step. There is an extended example in the Pentaho Data Integration for Database Developers (PDI2000C) course in module ETL patterns (Pattern: Change Data Capture) Trigger-Based CDC Kettle does not create triggers in a database system and some (or most?) people don't like the trigger-based CDC approach because it introduces a further layer of complexity into another system. Over time it is hard to maintain and keep in sync with the overall architecture. But at the end, it depends on the use case and might be needed in some projects. There are two main options: Create a trigger and write the changed data to a separate table This table has a time stamp or sequenced ID that can be used to select the changed data rows. Create a trigger and call Kettle directly via the Kettle API This scenario might be needed in real-time CDC needs, so a Kettle transformation might be called directly from the trigger. Some databases support Java calls from a trigger (e.g. PL/Java for PostgreSQL or Oracle, see References below). If you are using Hibernate to communicate with the database, you can use Hibernate event listeners as triggers (package summary). That way it would work with every database when you use standard SQL queries or HQL queries in the triggers. Database Log-Based CDC Some databases allow own CDC logs that can be analyzed. Real-time CDC So in case you need Real-time CDC, some of the above solutions will solve this need. Depending on the timing (how real-time or near-time) your needs are, you may choose the best suitable option. The trigger based call of Kettle is the most real-time solution. It is also possible to combine all of the above solutions with a continuously executed transformation (e.g. every 15 minutes) that collects the changed data.
• •
• • • • • • • • • • •
Kettle 家族
• Spoon 允许通过图形化界面来设计ETL转换和job • PAN 允许批量运行由Spoon设计的ETL转换（例如使用时间调度器），它是一个后台程序 • KITCHEN允许批量运行由Spoon设计的job（例如使用时间调度器），它是一个后台程序 • Carte, 在服务器端运行，能被Spoon远程调用执行
Job举例：Sqoop import
• 注意：target中的目录在hdfs不能预先存在
Job举例：Sqoop export
• RDB中的表和Hive表数据不能有重复
Transformation举例一：hadoop2mysql
Hadoop设置 Mysql设置
• 支持直接将Hadoop中的数据导入到Mysql
• 需要添加配置JAVA_HOME
设置和坑[2]
• 需要配置yarn-site
Kettle 简介
• Kettle是一款国外开源的ETL工具，纯java 编写，可以在Window、Linux、Unix上运行，绿色无需安装，数据抽取高效稳定。 • Kettle 中文名称叫水壶，该项目的主程序员 MATT 希望把各种数据放到一个壶里，然后以一种指定的格式流出。 • Kettle中有两种脚本文件，transformation 和job。transformation完成针对数据的基础转换工作，包括文件格式的转换、数据清洗、表关联操作等；job则完成整个工作流的控制，控制的对象可以是transformation 转换、也可以是其他的job。