Sqoop官方中文手册

合集下载

Sqoop安装配置及演示

Sqoop 是一个用来将Hadoop （Hive 、HBase ）和关系型数据库中的数据相互转移的工具，可以将一个关系型数据库（例如：MySQL ,Oracle ,Postgres 等）中的数据导入到Hadoop 的HDFS 中，也可以将HDFS 的数据导入到关系型数据库中。

Sqoop 目前已经是Apache 的顶级项目了，目前版本是1.4.5 和 Sqoop2 1.99.4，本文以1.4.5的版本为例讲解基本的安装配置和简单应用的演示。

安装配置准备测试数据导入数据到HDFS导入数据到Hive导入数据到HBase[一]、安装配置选择Sqoop 1.4.5 版本：sqoop-1.4.5.bin__hadoop-2.0.4-alpha.tar.gz1.1、下载后解压配置：1.2、环境变量配置 vi ~/.bash_profile ：1.3、配置Sqoop 参数：复制<SQOOP_HOME>/conf/sqoop-env-template.sh 一份重命名为：<SQOOP_HOME>/conf/sqoop-env.sh vi <SQOOP_HOME>/conf/sqoop-env.sh补充：因为我当前用户的默认环境变量中已经配置了相关变量，故该配置文件无需再修改：1.4、驱动jar 包下面测试演示以MySQL 为例，则需要把mysql 对应的驱动lib 文件copy 到 <SQOOP_HOME>/lib 目录下。

1tar -zxvf sqoop-1.4.5.bin__hadoop-2.0.4-alpha.tar .gz /hadoop/2cd /hadoop/3ln -s sqoop-1.4.5.bin__hadoop-2.0.4-alpha sqoop1#Sqoop add by 2export SQOOP_HOME=/hadoop/sqoop 3export PATH=$SQOOP_HOME/bin:$PATH1# 指定各环境变量的实际配置2# Set Hadoop-specific environment variables here.3 4#Set path to where bin/hadoop is available 5#export HADOOP_COMMON_HOME=6 7#Set path to where hadoop-*-core.jar is available 8#export HADOOP_MAPRED_HOME=9 10#set the path to where bin/hbase is available 11#export HBASE_HOME=12 13#Set the path to where bin/hive is available 14#export HIVE_HOME=1# Hadoop 2export HADOOP_PREFIX="/hadoop/hadoop-2.6.0" 3export HADOOP_HOME=${HADOOP_PREFIX} 4export PATH=$PATH:$HADOOP_PREFIX/bin:$HADOOP_PREFIX/sbin 5export HADOOP_COMMON_HOME=${HADOOP_PREFIX} 6export HADOOP_HDFS_HOME=${HADOOP_PREFIX} 7export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}8export HADOOP_YARN_HOME=${HADOOP_PREFIX} 9# Native Path 10export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native 11export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib/native"12# Hadoop end 13 14#Hive 15export HIVE_HOME=/hadoop/apache-hive-0.14.0-bin 16export PATH=$HIVE_HOME/bin:$PATH 17 18#HBase 19export HBASE_HOME=/hadoop/hbase-0.98.8-hadoop220export PATH=$HBASE 21 22#add by 以MySQL 为例：192.168.6.77（hostname:Master.Hadoop ）database: test用户：root 密码：micmiu准备两张测试表一个有主键表demo_blog ，一个无主键表 demo_log ：插入测试数据：[三]、导入数据到HDFS3.1、导入有主键的表比如我需要把表 demo_blog (含主键) 的数据导入到HDFS 中，执行如下命令：执行过程如下：1CREATE TABLE `demo_blog` (2 `id` int (11) NOT NULL AUTO_INCREMENT,3 `blog` varchar (100) NOT NULL,4 PRIMARY KEY (`id`)5) ENGINE=MyISAM DEFAULT CHARSET=utf8;1CREATE TABLE `demo_log` (2 `operator` varchar (16) NOT NULL,3 `log` varchar (100) NOT NULL 4) ENGINE=MyISAM DEFAULT CHARSET=utf8;1insert into demo_blog (id, blog) values (1, "");2insert into demo_blog (id, blog) values (2, "");3insert into demo_blog (id, blog) values (3, "");4 5insert into demo_log (operator, log) values ("micmiu", "create");6insert into demo_log (operator, log) values ("micmiu", "update");7insert into demo_log (operator, log) values ("michael", "edit");8insert into demo_log (operator, log) values ("michael", "delete");1sqoop import --connect jdbc:mysql://192.168.6.77/test --username root --password micmiu --table demo_blog验证导入到hdfs 上的数据：ps ：默认设置下导入到hdfs 上的路径是： /user/username/tablename/(files)，比如我的当前用户是hadoop ，那么实际路径即： /user/hadoop/demo_blog/(files)。

Sqoop-1.4.6安装部署及详细使用介绍

Sqoop-1.4.6安装部署及详细使⽤介绍之所以选择Sqoop1是因为Sqoop2⽬前问题太多。

⽆法正常使⽤，综合⽐较后选择Sqoop1。

Sqoop1安装配置⽐较简单⼀、安装部署（1）、下载安装包解压到/home/duanxz/sqooptar -zxvf sqoop-1.4.6-cdh5.5.2.tar.gz（2）、拷贝mysql的jdbc驱动包mysql-connector-java-5.1.31-bin.jar到sqoop/lib⽬录下。

duanxz@three:~/sqoop/sqoop-1.4.6-cdh5.5.2/lib$ ll mysql-connector-java-5.1.31.jar-rw------- 1 duanxz duanxz 964879 Jun 19 08:22 mysql-connector-java-5.1.31.jarduanxz@three:~/sqoop/sqoop-1.4.6-cdh5.5.2/lib$（3）、配置环境变量#sqoopexport SQOOP_HOME=/home/duanxz/sqoop/sqoop-1.4.6-cdh5.5.2export PATH="$PATH:$JAVA_HOME/bin:$HIVE_HOME/bin:$HIVE_HOME/conf:$SQOOP_HOME/bin"（4）、复制sqoop/conf/sqoop-env-template.sh为sqoop-env.sh添加相关的配置#Set path to where bin/hadoop is availableexport HADOOP_COMMON_HOME=/usr/local/hadoop-2.7.6#Set path to where hadoop-*-core.jar is availableexport HADOOP_MAPRED_HOME=/usr/local/hadoop-2.7.6#set the path to where bin/hbase is available#export HBASE_HOME=#Set the path to where bin/hive is availableexport HIVE_HOME=/home/duanxz/hive/apache-hive-2.1.1-bin#Set the path for where zookeper config dir is#export ZOOCFGDIR=（5）、测试Sqoopsqoop help结果：duanxz@ubuntu:~/sqoop/sqoop-1.4.6-cdh5.5.2/bin$ sqoop helpWarning: /home/duanxz/sqoop/sqoop-1.4.6-cdh5.5.2/../hbase does not exist! HBase imports will fail.Please set $HBASE_HOME to the root of your HBase installation.Warning: /home/duanxz/sqoop/sqoop-1.4.6-cdh5.5.2/../hcatalog does not exist! HCatalog jobs will fail.Please set $HCAT_HOME to the root of your HCatalog installation.Warning: /home/duanxz/sqoop/sqoop-1.4.6-cdh5.5.2/../accumulo does not exist! Accumulo imports will fail.Please set $ACCUMULO_HOME to the root of your Accumulo installation.Warning: /home/duanxz/sqoop/sqoop-1.4.6-cdh5.5.2/../zookeeper does not exist! Accumulo imports will fail.Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.18/06/1918:20:23 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.5.2usage: sqoop COMMAND [ARGS]Available commands:codegen Generate code to interact with database recordscreate-hive-table Import a table definition into Hiveeval Evaluate a SQL statement and display the resultsexport Export an HDFS directory to a database tablehelp List available commandsimport Import a table from a database to HDFSimport-all-tables Import tables from a database to HDFSimport-mainframe Import datasets from a mainframe server to HDFSjob Work with saved jobslist-databases List available databases on a serverlist-tables List available tables in a databasemerge Merge results of incremental importsmetastore Run a standalone Sqoop metastoreversion Display version informationSee 'sqoop help COMMAND'for information on a specific command.duanxz@ubuntu:~/sqoop/sqoop-1.4.6-cdh5.5.2/bin$说明：因为我们没有基于hadoop安装HBase，所以HBase相关的命令不能⽤，但是操作hadoop分布式⽂件系统的命令是可以⽤的。

sqoop知识点

sqoop知识点（实用版）目录1.Sqoop 简介2.Sqoop 的核心概念3.Sqoop 的安装与配置4.Sqoop 的基本操作5.Sqoop 的高级特性6.Sqoop 的优缺点7.Sqoop 的应用场景正文1.Sqoop 简介Sqoop 是一个用于在 Hadoop（Hive）与关系型数据库之间传输数据的工具。

它可以将一个关系型数据库中的数据导出到 Hadoop 分布式文件系统（HDFS）上，也可以将 HDFS 上的数据导入到关系型数据库中。

Sqoop 旨在实现大数据处理与传统数据库之间的数据交互，降低了数据迁移的难度。

2.Sqoop 的核心概念Sqoop 主要包括以下几个核心概念：- 数据源（Data Source）：数据源定义了要从中导入或导出数据的数据库。

- 目标（Target）：目标定义了要将数据导入到哪个 Hadoop 数据存储系统（如 HDFS、Hive 等）。

- 映射（Mapping）：映射定义了如何将数据源中的记录映射到目标中3.Sqoop 的安装与配置Sqoop 是 Apache Hadoop 的一个子项目，可以通过 Maven 或直接下载源代码进行安装。

配置方面，主要需要配置数据源、目标以及映射等相关参数。

4.Sqoop 的基本操作Sqoop 的基本操作包括导入和导出数据。

其中，导入数据可以使用`sqoop import`命令，导出数据可以使用`sqoop export`命令。

通过命令行参数可以指定数据源、目标、映射等具体参数。

5.Sqoop 的高级特性Sqoop 还支持许多高级特性，如：- 数据分片：可以将大量数据分成多个小块并行导入或导出，提高数据处理效率。

- 增量导入：可以只导入源数据库中自上次导入以来发生变化的数据，避免重复导入。

- 数据验证：可以对导入到目标的数据进行校验，确保数据质量。

6.Sqoop 的优缺点Sqoop 的优点包括：- 易于使用：提供了简单的命令行接口，方便用户进行数据导入和导出。

sqoop安装手册

1、sqoop环境准备1）Apache官方网站上下载支持hadoop2.0.0版本的sqoop安装包，此次下载的安装包是sqoop-1.4.4.bin__hadoop-2.0.4-alpha.tar.gz。

把sqoop安装包通过ftp传到服务器上上。

a)scp -p sqoop-1.4.4.bin__hadoop-2.0.4-alpha.tar.gz xxxx@x.x.x.x:/home/ocdc/bin/appb)ssh xxxx@x.x.x.x ###该服务器必须能连上数据库，故Sqoop安装在该台服务器上c)cd /home/ocdc/bin/appd) tar –xzvf sqoop-1.4.4.bin__hadoop-2.0.4-alpha.tar.gze)mv sqoop-1.4.4.bin__hadoop-2.0.4-alpha sqoop2）安装的sqoop的服务器必须有hadoop、hive、jdk环境，如无则安装3）检查服务器的防火墙是否关闭，如防火墙开启则关闭[root@localhost ~]# service iptables status 查看iptables状态[root@ localhost ~]# service iptables stop iptables服务禁用4）设置环境变量包括hadoop、hive、jdk、sqoop等，vi ~/.bashrc或者sqoop-env.sh 文件，此次直接在vi ~/.bashrc文件中设置环境变量。

5）注释相关配置configure-sqoop文件中无用的hbase、zookeeper等应用注释掉vi /home/ocdc/bin/app/sqoop/bin/configure-sqoop 如6）$sqoop help ##测试环境是否准备好2、连接的数控环境设置1）查询关系型数据库版本2）下载关系型数据库JDBC驱动包3）驱动包放到$SQOOP_HOME/lib目录下3、测试sqoop环境是否联通sqoop list-tables --connect jdbc:mysql://192.168.0.1:3306/hive --username 用户名--password密码sqoop list-tables --connect jdbc:db2:// 192.168.0.1:50000/数据库名--username 用户名--password密码。

sqoop基本操作

sqoop基本操作介绍Sqoop是一个用于在Apache Hadoop和结构化数据存储之间传输数据的工具。

它支持从关系型数据库（如MySQL，Oracle，PostgreSQL等）中导入数据到Hadoop分布式文件系统（HDFS），或将数据从HDFS导出到关系型数据库中。

Sqoop基本操作包括以下几个方面：安装和配置Sqoop、导入数据、导出数据、增量导入和导出、并行导入和导出、使用自定义查询、使用Sqoop Connectors等。

安装和配置Sqoop要使用Sqoop，首先需要安装并配置它。

安装Sqoop的步骤如下：1. 下载最新版本的Sqoop二进制文件，并解压缩到本地目录中。

2. 配置环境变量，使得可以在任何位置运行Sqoop命令。

3. 配置sqoop-site.xml文件，指定相关参数，如数据库连接信息、Hadoop集群信息等。

可以在$SQOOP_HOME/conf目录下找到该文件。

4. 配置hadoop-env.sh文件，指定相关参数，如JAVA_HOME等。

该文件位于$SQOOP_HOME/conf目录下。

5. 测试安装是否成功。

可以通过执行sqoop version命令来测试是否成功安装了Sqoop。

导入数据使用Sqoop可以将关系型数据库中的表数据导入到HDFS中。

以下是使用Sqoop进行简单的数据导入操作的步骤：1. 使用命令sqoop import指定相关参数，如数据库连接信息、表名、HDFS目录等。

2. Sqoop会自动将表中的数据导入到HDFS指定的目录中。

3. 可以使用Hadoop命令来查看导入的数据是否成功。

导出数据使用Sqoop可以将HDFS中的数据导出到关系型数据库中。

以下是使用Sqoop进行简单的数据导出操作的步骤：1. 使用命令sqoop export指定相关参数，如数据库连接信息、表名、HDFS目录等。

2. Sqoop会自动将HDFS目录中的数据导出到关系型数据库指定的表中。

sqoop使用手册

Author ：路帅1.Sqoop介绍概述Hadoop的数据传输工具sqoop是Apache顶级项目，主要用来在Hadoop和关系数据库、数据仓库、NoSql系统中传递数据。

通过sqoop，我们可以方便的将数据从关系数据库导入到HDFS、Hbase、Hive，或者将数据从HDFS导出到关系数据库。

sqoop架构非常简单，其整合了Hive、Hbase和Oozie，通过map-reduce任务来传输数据，从而提供并发特性和容错。

Sqoop集成了工作流程协调的Apache Oozie，定义安排和自动导入/导出任务。

sqoop主要通过JDBC和关系数据库进行交互。

理论上支持JDBC的database都可以使用sqoop和hdfs 进行数据交互。

但是，只有一小部分经过sqoop官方测试，如下：Database version --direct support connect stringHSQLDB 1.8.0+ No jdbc:hsqldb:*//MySQL 5.0+ Yes jdbc:mysql://Oracle 10.2.0+ No jdbc:oracle:*//PostgreSQL 8.3+ Yes jdbc:postgresql://较老的版本有可能也被支持，但未经过测试。

出于性能考虑，sqoop提供不同于JDBC的快速存取数据的机制，可以通过--direct使用。

版本介绍2012年3月，sqoop从Apache的孵化器中毕业。

成为Apache的Top-Level Project。

下图提供了sqoop从诞生到目前的简要概述：，sqoop2 的最新版本是sqoop 1.99.3。

Sqoop1和sqoop2的区别1.3.1.工作模式sqoop1 基于客户端模式，用户使用客户端模式，需要在客户端节点安装sqoop和连接器/驱动器sqoop2 基于服务的模式，是sqoop1的下一代版本，服务模式主要分为 sqoop2 server 和 client，用户使用服务的模式，需要在sqoop2 server安装连接器/驱动器，所有配置信息都在sqoop2 server进行配置。

Sqoop1工具import和export使用详解

Sqoop1工具import和export使用详解Sqoop工具import和export使用详解问题导读：1、Sqoop如何在异构平台之间进行数据迁移？2、Sqoop是怎样保证高可靠性的？Sqoop可以在HDFS/Hive和关系型数据库之间进行数据的导入导出，其中主要使用了import和export这两个工具。

这两个工具非常强大，提供了很多选项帮助我们完成数据的迁移和同步。

比如，下面两个潜在的需求：1.业务数据存放在关系数据库中，如果数据量达到一定规模后需要对其进行分析或同统计，单纯使用关系数据库可能会成为瓶颈，这时可以将数据从业务数据库数据导入（import）到Hadoop平台进行离线分析。

2.对大规模的数据在Hadoop平台上进行分析以后，可能需要将结果同步到关系数据库中作为业务的辅助数据，这时候需要将Hadoop平台分析后的数据导出（export）到关系数据库。

这里，我们介绍Sqoop完成上述基本应用场景所使用的import 和export工具，通过一些简单的例子来说明这两个工具是如何做到的。

工具通用选项import和export工具有些通用的选项，如下表所示：选项含义说明--connect 指定JDBC连接字符串--connection-manager 指定要使用的连接管理器类--driver 指定要使用的JDBC驱动类--hadoop-mapred-home指定$HADOOP_MAPRED_HOME路径--help 打印用法帮助信息--password-file 设置用于存放认证的密码信息文件的路径-P 从控制台读取输入的密码--password 设置认证密码--username 设置认证用户名--verbose 打印详细的运行信息--connection-param-file 可选，指定存储数据库连接参数的属性文件数据导入工具importimport工具，是将HDFS平台外部的结构化存储系统中的数据导入到Hadoop平台，便于后续分析。

samtools 官方手册

Workflows•WES Mapping to Variant Calls - Version 1.0•Using CRAM within SamtoolsWGS/WES Mapping to Variant Calls - Version 1.0The standard workflow for working with DNA sequence data consists of three major steps:•Mapping•Improvement•Variant CallingMappingFor reads from 70bp up to a few megabases we recommend using BWA MEM(/)to map the data to a given reference genome. The reference you use will differ depending on the species your data came from and the resources you want to use with it. For example for a new research project consisting of Human data you would probably use the Genome Reference Consortium’s build 38 analysis set(ftp:///genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38//seqs_for_alignment_pipelines"). Note that with BWA 0.7.10, mapping to alternative haplotypes has been deemed unready for production use, so you will probably wish to use the analysis set that does not contain them.To prepare the reference for mapping you must first index it by typing the following command where <ref.fa>is the path to your reference file:bwa index <ref.fa>This may take several hours as it prepares the Burrows Wheeler Transform index for the reference, allowing the aligner to locate where your reads map within that reference.Once you have finished preparing your indexed reference you can map your reads to the reference:bwa mem -R '@RG\tID:foo\tSM:bar\tLB:library1' <ref.fa> <read1.fa> <read1.fa> > lane.samTypically your reads will be supplied to you in two files written in the FASTQ format. It is particularly important to ensure that the @RG information here is correct as this information is used by later tools. The SM field must be set to the name of the sample being processed, and LB field to the library. The resulting mapped reads will be delivered to you in a mapping format known as SAM(http://samtools.github.io/hts-specs/).Because BWA can sometimes leave unusual FLAG information on SAM records, it is helpful when working with many tools to first clean up read pairing information and flags: samtools fixmate -O bam <lane.sam> <lane_fixmate.bam>To sort them from name order into coordinate order:samtools sort -O bam -o <lane_sorted.bam> -T </tmp/lane_temp> <lane_fixmate.sam>ImprovementIn order to reduce the number of miscalls of INDELs in your data it is helpful to realign your raw gapped alignment with the Broad’s GATK(https:///gatk/) Realigner.java -Xmx2g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R <ref.fa> -I <lane.bam> -o <lane.intervals> --known <bundle/b38/Mills1000G.b38.vcf>java -Xmx4g -jar GenomeAnalysisTK.jar -T IndelRealigner -R <ref.fa> -I <lane.bam> -targetIntervals <lane.intervals> --known <bundle/b38/Mills1000G.b38.vcf> -o <lane_realigned.bam>BQSR from the Broad’s GATK allows you to reduce the effects of analysis artefacts produced by your sequencing machines. It does this in two steps, the first analyses your data to detect covariates and the second compensates for those covariates by adjusting quality scores.java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R <ref.fa> -knownSites >bundle/b38/dbsnp_142.b38.vcf> -I <lane.bam> -o <lane_recal.table>java -Xmx2g -jar GenomeAnalysisTK.jar -T PrintReads -R <ref.fa> -I <lane.bam> --BSQR <lane_recal.table> -o <lane_recal.bam>It is helpful at this point to compile all of the reads from each library together into one BAM, which can be done at the same time as marking PCR and optical duplicates. To identify duplicates we currently recommend the use of either the Picard(/command-line-overview.shtml#MarkDuplicates)or biobambam’s(https:///gt1/biobambam)mark duplicates tool.java -Xmx2g -jar MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT INPUT=<lane_1.bam> INPUT=<lane_2.bam> INPUT=<lane_3.bam> OUTPUT=<library.bam>Once this is done you can perform another merge step to produce your sample BAM files.samtools merge <sample.bam> <library1.bam> <library2.bam> <library3.bam>samtools index <sample.bam>If you have the computational time and resources available it is helpful to realign your INDELS again:java -Xmx2g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R <ref.fa> -I <sample.bam> -o <sample.intervals> --known >bundle/b38/Mills1000G.b38.vcf> java -Xmx4g -jar GenomeAnalysisTK.jar -T IndelRealigner -R <ref.fa> -I <sample.bam> -targetIntervals <sample.intervals> --known >bundle/b38/Mills1000G.b38.v cf> -o <sample_realigned.bam>Lastly we index our BAM using samtools:samtools index <sample_realigned.bam>Variant CallingTo convert your BAM file into genomic positions we first use mpileup to produce a BCF file that contains all of the locations in the genome. We use this information to call genotypes and reduce our list of sites to those found to be variant by passing this file into bcftools call.You can do this using a pipe as shown here:samtools mpileup -ugf <ref.fa> <sample1.bam> <sample2.bam> <sample3.bam> | bcftools call -vmO z -o <study.vcf.gz>Alternatively if you need to see why a specific site was not called by examining the BCF, or wish to spread the load slightly you can break it down into two steps as follows:samtools mpileup -go <study.bcf> -f <ref.fa> <sample1.bam> <sample2.bam> <sample3.bam> bcftools call -vmO z -o <study.vcf.gz> <study.bcf>To prepare our VCF for querying we next index it using tabix:tabix -p vcf <study.vcf.gz>Additionally you may find it helpful to prepare graphs and statistics to assist you in filtering your variants:这次没有找到变异位点，试一下这个思路，分开跑一下，看看是什么原因；bcftools stats -F <ref.fa> -s - <study.vcf.gz> > <study.vcf.gz.stats>mkdir plotsplot-vcfstats -p plots/ <study.vcf.gz.stats>Finally you will probably need to filter your data using commands such as:bcftools filter -O z -o <study_filtered..vcf.gz> -s LOWQUAL -i'%QUAL>10' <study.vcf.gz>Variant filtration is a subject worthy of an article in itself and the exact filters you will need to use will depend on the purpose of your study and quality and depth of the data used to call the variants.References•The 1000 Genomes Project Consortium - An Integrated map of genetic variation from 1092 human genomes Nature 491, 56–65 (01 November 2012) doi:10.1038/nature11632 (/10.1038/nature11632)•GATK Best Practices(/gatk/guide/best-practices)Using CRAM within SamtoolsCRAM is primarily a reference-based compressed format, meaning that only differences between the stored sequences and the reference are stored.For a workflow this has a few fundamental effects:1.Alignments should be kept in chromosome/position sort order.2.The reference must be available at all times. Losing it may be equivalent to losing all your read sequences.Technically CRAM can work with other orders but it can become inefficient due to a large amount of random access across the reference genome. The current implementation of CRAM in htslib 1.0 is also inefficient in size for unsorted data, although this will be rectified in upcoming releases.In CRAM format the reference sequence is linked to by the md5sum (M5 auxiliary tag) in the CRAM header (@SQ tags). This is mandatory and part of the CRAM specification. In SAM/BAM format, these M5 tags are optional. Therefore converting from SAM/BAM to CRAM requires some additional overhead to link the CRAM to the correct reference sequence.A Worked ExampleObtain some public dataWe will use the first 100,000 read-pairs from a yeast data set.curl ftp:///vol1/fastq/SRR507/SRR507778/SRR507778_1.fastq.gz|gzip -d | head -100000 > y1.fastqcurl ftp:///vol1/fastq/SRR507/SRR507778/SRR507778_2.fastq.gz|gzip -d | head -100000 > y2.fastqcurl ftp:///pub/current_fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.fa.gz > yeast.fastaPrepare the BWA indicesWe need to ensure there exists a .fai fasta index and also indices for whichever aligner we are using (Bwa-mem in this example).samtools faidx yeast.fastabwa index yeast.fastaProduce the alignmentsThe aligner is likely to output SAM in the same order or similar order to the input fastq files. It won’t be outputting in chromosome position order, so the output is typically not well suited to CRAM.bwa mem -R '@RG\tID:foo\tSM:bar\tLB:library1' yeast.fasta y1.fastq y2.fastq > yeast.samThe -R option adds a read-group line and applies that read-group to all aligned sequence records. It is not necessary, but a recommended practice.Sort into chromosome/positon orderIdeally at this point we would be outputting CRAM directly, but at present samtools 1.0 does not have a way to indicate the reference on the command line. We can output to BAM instead and convert (below), or modify the SAM @SQ header to include MD5 sums in the M5: field.samtools sort -O bam -T /tmp -l 0 -o yeast.bam yeast.samThe “-l 0” indicates to use no compression in the BAM file, as it is transitory and will be replaced by CRAM soon. We may wish to use -l 1 if disk space is short and we wish to reduce temporary file size.Convert to CRAM formatsamtools view -T yeast.fasta -C -o yeast.cram yeast.bamNote that since the BAM file did not have M5 tags for the reference sequences, they are computed by Samtools and added to the CRAM. In a production environment, this step can be avoided by ensuring that the M5 tags are already in the SAM/BAM header.The last 3 steps can be combined into a pipeline to reduce disk I/O:bwa mem yeast.fasta y1.fastq y2.fastq | \samtools sort -O bam -l 0 -T /tmp - | \samtools view -T yeast.fasta -C -o yeast.cram -Viewing in alignment and pileup formatSee the variant calling workflow for more advanced examples.samtools view yeast.cramsamtools mpileup -f yeast.fasta yeast.cramThe REF_PATH and REF_CACHEOne of the key concepts in CRAM is that it is uses reference based compression. This means that Samtools needs the reference genome sequence in order to decode a CRAM file. Samtools uses the MD5 sum of the each reference sequence as the key to link a CRAM file to the reference genome used to generate it. By default Samtools checks the reference MD5 sums (@SQ “M5” auxiliary tag) in the directory pointed to by $REF_PATH environment variable (if it exists), falling back to querying the European Bioinformatics Institute (EBI) reference genome server, and further falling back to the @SQ “UR” field if these are not found.While the EBI have an MD5 reference server for downloading reference sequences over http, we recommend use of a local MD5 cache. We have provided with Samtools a basic script (misc/seq_cache_populate.pl) to convert your local yeast.fasta to a directory tree of reference sequence MD5 sums:<samtools_src_dir>/misc/seq_cache_populate.pl -root /some_dir/cache yeast.fastaexport REF_PATH=/some_dir/cache/%2s/%2s/%s:/ena/cram/md5/%sexport REF_CACHE=/some_dir/cache/%2s/%2s/%sREF_PATH is a colon separated list of directories in which to search for files named after the sequence M5 field. The : in http:// is not considered to be a separator. Hence using the above setting, any CRAM files that are not cached locally may still be looked up remotely.In this example “%2s/%2s/%s” means the first two digits of the M5 field followed by slash, the next two digits and slash, and then the remaining 28 digits. This helps to avoid one large directory with thousands of files in it.The REF_CACHE environment variable is used to indicate that any downloaded reference sequences should be stored locally in this directory in order to avoid subsequent downloads. This should normally be set to the same location as the first directory in REF_PATH.Copyright © 2014 Genome Research Limited (reg no. 2742969) is a charity registered in England with number 1021457. Terms and conditions(/terms).。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Sqoop中文手册
1. 概述
本文档主要对SQOOP的使用进行了说明，参考内容主要来自于Cloudera SQOOP的官方文档。

为了用中文更清楚明白地描述各参数的使用含义，本文档几乎所有参数使用说明都经过了我的实际验证而得到。

2. codegen
将关系数据库表映射为一个java文件、java class类、以及相关的jar包，
1、将数据库表映射为一个Java文件，在该Java文件中对应有表的各个字段。

2、生成的Jar和class文件在metastore功能使用时会用到。

基础语句：
sqoop codegen –connect jdbc:mysql://localhost:3306/hive –username root
–password 123456 –table TBLS2
3. create-hive-table
生成与关系数据库表的表结构对应的HIVE表
基础语句：
sqoop create-hive-table –connect jdbc:mysql://localhost:3306/hive -username root -password 123456 –table TBLS –hive-table h_tbls2
4. eval
可以快速地使用SQL语句对关系数据库进行操作，这可以使得在使用import这种工具进行数据导入的时候，可以预先了解相关的SQL语句是否正确，并能将结果显示在控制台。

查询示例：
sqoop eval –connect jdbc:mysql://localhost:3306/hive -username root -password 123456 -query ―SELECT * FROM tbls LIMIT 10″
数据插入示例：
sqoop eval –connect jdbc:mysql://localhost:3306/hive -username root -password 123456 -e ―INSERT INTO TBLS2
VALUES(100,1375170308,1,0,‘hadoop‘,0,1,‘guest‘,‘MANAGED_TABLE‘,‘abc‘,‘ddd‘)‖
-e、-query这两个参数经过测试，比如后面分别接查询和插入SQL语句，皆可运行无误，如上。

5. export
从hdfs中导数据到关系数据库中
sqoop export –connect jdbc:mysql://localhost:3306/hive –username root
–password
123456 –table TBLS2 –export-dir sqoop/test
6. import
将数据库表的数据导入到hive中，如果在hive中没有对应的表，则自动生成与数据库表名相同的表。

sqoop import –connect jdbc:mysql://localhost:3306/hive –username root
–password
123456 –table user –split-by id –hive-import
–split-by指定数据库表中的主键字段名，在这里为id。

增量导入
对incremental参数，如果是以日期作为追加导入的依据，则使用lastmodified，否则就使用append值。

7. import-all-tables
将数据库里的所有表导入到HDFS中，每个表在hdfs中都对应一个独立的目录。

sqoop import-all-tables –connect jdbc:mysql://localhost:3306/test
sqoop import-all-tables –connect jdbc:mysql://localhost:3306/test –hive-import
8. job
用来生成一个sqoop的任务，生成后，该任务并不执行，除非使用命令执行该任务。

sqoop job
9. list-databases
打印出关系数据库所有的数据库名
sqoop list-databases –connect jdbc:mysql://localhost:3306/ -username root
-password 123456
10.list-tables
打印出关系数据库某一数据库的所有表名
sqoop list-tables –connect jdbc:mysql://localhost:3306/zihou -username root
-password 123456
11. merge
将HDFS中不同目录下面的数据合在一起，并存放在指定的目录中，示例如：
sqoop merge –new-data /test/p1/person –onto /test/p2/person –target-dir
/test/merged –jar-file /opt/data/sqoop/person/Person.jar –class-name Person
–merge-key id
其中，–class-name所指定的class名是对应于Person.jar中的Person类，而Person.jar 是通过Codegen生成的
12. metastore
记录sqoop job的元数据信息，如果不启动metastore实例，则默认的元数据存储目录为：~/.sqoop，如果要更改存储目录，可以在配置文件sqoop-site.xml中进行更改。

metastore实例启动：sqoop metastore
13. version
显示sqoop版本信息语句：sqoop version 14. help
打印sqoop帮助信息语句：sqoop help 15.公共参数Hive参数
数据库连接参数
文件输出参数用于import场景。

示例如：
sqoop import –connect jdbc:mysql://localhost:3306/test –username root –P –table person –split-by id –check-column id –incremental append –last-value 1
–enclosed-by ‗\‖‗
–escaped-by \# –fields-terminated-by .
文件输入参数
对数据格式的解析，用于export场景，与文件输出参数相对应。

示例如：
sqoop export –connect jdbc:mysql://localhost:3306/test –username root
–password
123456 –table person2 –export-dir /user/hadoop/person –staging-table person3 –clear-staging-table –input-fields-terminated-by ‗,‘
在hdfs中存在某一格式的数据，在将这样的数据导入到关系数据库中时，必须要按照该格式来解析出相应的字段值，比如在hdfs中有这样格式的数据：
3,jimsss,dd@,1,2013-08-07 16:00:48.0,‖hehe‖,
上面的各字段是以逗号分隔的，那么在解析时，必须要以逗号来解析出各字段值，如：
–input-fields-terminated-by ‗,‘。