【原创】大数据基础之Hive（1）HiveSQL执行过程之代码流程

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

【原创】⼤数据基础之Hive（1）HiveSQL执⾏过程之代码流程hive 2.1
hive执⾏sql有两种⽅式：
执⾏hive命令，⼜细分为hive -e，hive -f，hive交互式；
执⾏beeline命令，beeline会连接远程thrift server；
下⾯分别看这些场景下sql是怎样被执⾏的：
1 hive命令
启动命令
启动hive客户端命令
$HIVE_HOME/bin/hive
等价于
$HIVE_HOME/bin/hive --service cli
会调⽤
$HIVE_HOME/bin/ext/cli.sh
实际启动类为：org.apache.hadoop.hive.cli.CliDriver
代码解析
org.apache.hadoop.hive.cli.CliDriver
public static void main(String[] args) throws Exception {
int ret = new CliDriver().run(args);
System.exit(ret);
}
public int run(String[] args) throws Exception {
...
// execute cli driver work
try {
return executeDriver(ss, conf, oproc);
} finally {
ss.resetThreadName();
ss.close();
}
...
private int executeDriver(CliSessionState ss, HiveConf conf, OptionsProcessor oproc)
throws Exception {
...
if (ss.execString != null) {
int cmdProcessStatus = cli.processLine(ss.execString);
return cmdProcessStatus;
}
...
try {
if (ss.fileName != null) {
return cli.processFile(ss.fileName);
}
} catch (FileNotFoundException e) {
System.err.println("Could not open input file for reading. (" + e.getMessage() + ")");
return 3;
}
...
while ((line = reader.readLine(curPrompt + "> ")) != null) {
if (!prefix.equals("")) {
prefix += '\n';
}
if (line.trim().startsWith("--")) {
continue;
}
if (line.trim().endsWith(";") && !line.trim().endsWith("\\;")) {
line = prefix + line;
ret = cli.processLine(line, true);
...
public int processFile(String fileName) throws IOException {
...
rc = processReader(bufferReader);
...
public int processReader(BufferedReader r) throws IOException {
String line;
StringBuilder qsb = new StringBuilder();
while ((line = r.readLine()) != null) {
// Skipping through comments
if (! line.startsWith("--")) {
qsb.append(line + "\n");
}
}
return (processLine(qsb.toString()));
}
public int processLine(String line, boolean allowInterrupting) {
...
ret = processCmd(command);
...
public int processCmd(String cmd) {
...
CommandProcessor proc = CommandProcessorFactory.get(tokens, (HiveConf) conf);
ret = processLocalCmd(cmd, proc, ss);
...
int processLocalCmd(String cmd, CommandProcessor proc, CliSessionState ss) {
int tryCount = 0;
boolean needRetry;
int ret = 0;
do {
try {
needRetry = false;
if (proc != null) {
if (proc instanceof Driver) {
Driver qp = (Driver) proc;
PrintStream out = ss.out;
long start = System.currentTimeMillis();
if (ss.getIsVerbose()) {
out.println(cmd);
}
qp.setTryCount(tryCount);
ret = qp.run(cmd).getResponseCode();
...
while (qp.getResults(res)) {
for (String r : res) {
out.println(r);
}
...
CliDriver.main会调⽤run，run会调⽤executeDriver，在executeDriver中对应上边提到的三种情况：
⼀种是hive -e执⾏sql，此时ss.execString⾮空，执⾏完进程退出；
⼀种是hive -f执⾏sql⽂件，此时ss.fileName⾮空，执⾏完进程退出；
⼀种是hive交互式执⾏sql，此时会不断读取reader.readLine，然后执⾏失去了并输出结果；
上述三种情况最终都会调⽤processLine，processLine会调⽤processLocalCmd，在processLocalCmd中会先调⽤到Driver.run执⾏sql，执⾏完之后再调⽤Driver.getResults输出结果，这也是Driver最重要的两个接⼝，Driver实现后边再看；
2 beeline命令
beeline需要连接到hive thrift server，先看hive thrift server如何启动：
hive thrift server
启动命令
启动hive thrift server命令
$HIVE_HOME/bin/hiveserver2
等价于
$HIVE_HOME/bin/hive --service hiveserver2
会调⽤
$HIVE_HOME/bin/ext/hiveserver2.sh
实际启动类为：org.apache.hive.service.server.HiveServer2
启动过程
HiveServer2.main
startHiveServer2
init
addService-CLIService,ThriftBinaryCLIService
start
Service.start
CLIService.start
ThriftBinaryCLIService.start
TThreadPoolServer.serve
类结构：【接⼝或⽗类->⼦类】
TServer->TThreadPoolServer
TProcessorFactory->SQLPlainProcessorFactory
TProcessor->TSetIpAddressProcessor
ThriftCLIService->ThriftBinaryCLIService
CLIService
HiveSession
代码解析
org.apache.hive.service.cli.thrift.ThriftBinaryCLIService
public ThriftBinaryCLIService(CLIService cliService, Runnable oomHook) {
super(cliService, ThriftBinaryCLIService.class.getSimpleName());
this.oomHook = oomHook;
}
ThriftBinaryCLIService是⼀个核⼼类，其中会实际启动thrift server，同时包装⼀个CLIService，请求最后都会调⽤底层的CLIService处理，下⾯看CLIService代码：
org.apache.hive.service.cli.CLIService
@Override
public OperationHandle executeStatement(SessionHandle sessionHandle, String statement,
Map<String, String> confOverlay) throws HiveSQLException {
OperationHandle opHandle =
sessionManager.getSession(sessionHandle).executeStatement(statement, confOverlay);
LOG.debug(sessionHandle + ": executeStatement()");
return opHandle;
}
@Override
public RowSet fetchResults(OperationHandle opHandle, FetchOrientation orientation,
long maxRows, FetchType fetchType) throws HiveSQLException {
RowSet rowSet = sessionManager.getOperationManager().getOperation(opHandle)
.getParentSession().fetchResults(opHandle, orientation, maxRows, fetchType);
LOG.debug(opHandle + ": fetchResults()");
return rowSet;
}
CLIService最重要的两个接⼝，⼀个是executeStatement，⼀个是fetchResults，两个接⼝都会转发给HiveSession处理，下⾯看HiveSession实现类代码：
org.apache.hive.service.cli.session.HiveSessionImpl
@Override
public OperationHandle executeStatement(String statement, Map<String, String> confOverlay) throws HiveSQLException {
return executeStatementInternal(statement, confOverlay, false, 0);
}
private OperationHandle executeStatementInternal(String statement,
Map<String, String> confOverlay, boolean runAsync, long queryTimeout) throws HiveSQLException {
acquire(true, true);
ExecuteStatementOperation operation = null;
OperationHandle opHandle = null;
try {
operation = getOperationManager().newExecuteStatementOperation(getSession(), statement,
confOverlay, runAsync, queryTimeout);
opHandle = operation.getHandle();
operation.run();
...
@Override
public RowSet fetchResults(OperationHandle opHandle, FetchOrientation orientation,
long maxRows, FetchType fetchType) throws HiveSQLException {
acquire(true, false);
try {
if (fetchType == FetchType.QUERY_OUTPUT) {
return operationManager.getOperationNextRowSet(opHandle, orientation, maxRows);
}
return operationManager.getOperationLogRowSet(opHandle, orientation, maxRows, sessionConf);
} finally {
release(true, false);
}
}
可见
HiveSessionImpl.executeStatement是调⽤ExecuteStatementOperation.run（ExecuteStatementOperation是Operation的⼀种）HiveSessionImpl.fetchResults是调⽤OperationManager.getOperationNextRowSet，然后会调⽤到Operation.getNextRowSet org.apache.hive.service.cli.operation.OperationManager
public RowSet getOperationNextRowSet(OperationHandle opHandle,
FetchOrientation orientation, long maxRows)
throws HiveSQLException {
return getOperation(opHandle).getNextRowSet(orientation, maxRows);
}
下⾯写详细看Operation的run和getOperationNextRowSet：
org.apache.hive.service.cli.operation.Operation
public void run() throws HiveSQLException {
beforeRun();
try {
Metrics metrics = MetricsFactory.getInstance();
if (metrics != null) {
try {
metrics.incrementCounter(MetricsConstant.OPEN_OPERATIONS);
} catch (Exception e) {
LOG.warn("Error Reporting open operation to Metrics system", e);
}
}
runInternal();
} finally {
afterRun();
}
}
public RowSet getNextRowSet() throws HiveSQLException {
return getNextRowSet(FetchOrientation.FETCH_NEXT, DEFAULT_FETCH_MAX_ROWS);
}
Operation是⼀个抽象类，
run会调⽤抽象⽅法runInternal
getNextRowSet会调⽤抽象⽅法getNextRowSet
下⾯会看到这两个抽象⽅法在⼦类中的实现，最终会依赖Driver的run和getResults；
1）先看runInternal在⼦类HiveCommandOperation中被实现：
org.apache.hive.service.cli.operation.HiveCommandOperation
@Override
public void runInternal() throws HiveSQLException {
setState(OperationState.RUNNING);
try {
String command = getStatement().trim();
String[] tokens = statement.split("\\s");
String commandArgs = command.substring(tokens[0].length()).trim();
CommandProcessorResponse response = commandProcessor.run(commandArgs);
...
这⾥会调⽤CommandProcessor.run，实际会调⽤Driver.run（Driver是CommandProcessor的实现类）；
2）再看getNextRowSet在⼦类SQLOperation中被实现：
org.apache.hive.service.cli.operation.SQLOperation
public RowSet getNextRowSet(FetchOrientation orientation, long maxRows)
throws HiveSQLException {
...
driver.setMaxRows((int) maxRows);
if (driver.getResults(convey)) {
return decode(convey, rowSet);
}
...
这⾥会调⽤Driver.getResults；
3 Driver
通过上⾯的代码分析发现⽆论是hive命令⾏执⾏还是beeline连接thrift server执⾏，最终都会依赖Driver，Driver最核⼼的两个接⼝：
run
getResults
代码解析
org.apache.hadoop.hive.ql.Driver
@Override
public CommandProcessorResponse run(String command)
throws CommandNeedRetryException {
return run(command, false);
}
public CommandProcessorResponse run(String command, boolean alreadyCompiled)
throws CommandNeedRetryException {
CommandProcessorResponse cpr = runInternal(command, alreadyCompiled);
...
private CommandProcessorResponse runInternal(String command, boolean alreadyCompiled)
throws CommandNeedRetryException {
...
ret = compileInternal(command, true);
...
ret = execute(true);
...
private int compileInternal(String command, boolean deferClose) {
...
ret = compile(command, true, deferClose);
...
public int compile(String command, boolean resetTaskIds, boolean deferClose) {
...
plan = new QueryPlan(queryStr, sem, perfLogger.getStartTime(PerfLogger.DRIVER_RUN), queryId,
queryState.getHiveOperation(), schema);
...
public int execute(boolean deferClose) throws CommandNeedRetryException {
...
// Add root Tasks to runnable
for (Task<? extends Serializable> tsk : plan.getRootTasks()) {
// This should never happen, if it does, it's a bug with the potential to produce
// incorrect results.
assert tsk.getParentTasks() == null || tsk.getParentTasks().isEmpty();
driverCxt.addToRunnable(tsk);
}
...
// Loop while you either have tasks running, or tasks queued up
while (driverCxt.isRunning()) {
// Launch upto maxthreads tasks
Task<? extends Serializable> task;
while ((task = driverCxt.getRunnable(maxthreads)) != null) {
TaskRunner runner = launchTask(task, queryId, noName, jobname, jobs, driverCxt);
if (!runner.isRunning()) {
break;
}
}
// poll the Tasks to see which one completed
TaskRunner tskRun = driverCxt.pollFinished();
if (tskRun == null) {
continue;
}
hookContext.addCompleteTask(tskRun);
queryDisplay.setTaskResult(tskRun.getTask().getId(), tskRun.getTaskResult());
Task<? extends Serializable> tsk = tskRun.getTask();
TaskResult result = tskRun.getTaskResult();
...
if (tsk.getChildTasks() != null) {
for (Task<? extends Serializable> child : tsk.getChildTasks()) {
if (DriverContext.isLaunchable(child)) {
driverCxt.addToRunnable(child);
}
}
}
}
public boolean getResults(List res) throws IOException, CommandNeedRetryException {
if (driverState == DriverState.DESTROYED || driverState == DriverState.CLOSED) {
throw new IOException("FAILED: query has been cancelled, closed, or destroyed.");
}
if (isFetchingTable()) {
/**
* If resultset serialization to thrift object is enabled, and if the destination table is
* indeed written using ThriftJDBCBinarySerDe, read one row from the output sequence file,
* since it is a blob of row batches.
*/
if (fetchTask.getWork().isUsingThriftJDBCBinarySerDe()) {
maxRows = 1;
}
fetchTask.setMaxRows(maxRows);
return fetchTask.fetch(res);
}
...
Driver的run会调⽤runInternal，runInternal中会先compileInternal编译sql并⽣成QueryPlan，然后调⽤execute执⾏QueryPlan中的所有task；
Driver的getResults会调⽤FetchTask的fetch来获取结果；
Hive SQL解析过程详见：。