Hive执行计划之什么是hiveSQL向量化模式及优化详解

2023-06-09 20:14 由鲁边发表于 #数据库

Hive开启向量化模式也是hiveSQL优化方法中的一种，可以提升hive查询速率，也叫hive矢量化。

问题1：那么什么是hive向量化模式呢？

问题2：hive向量化什么情况下可以被使用，或者说它有哪些使用场景呢？

问题3：如何查看hive向量化使用的相关信息？

1.什么是hive向量化模式

hive向量化模式是hive的一个特性，也叫hive矢量化，在没有引入向量化的执行模式之前，一般的查询操作一次只处理一行数据，在向量化查询执行时一次处理1024行的块来简化系统底层的操作，提高了数据处理的性能。

在底层，hive提供的向量模式，并不是重写了Mapper函数，而是通过实现inputformat接口，创建了VectorizedParquetInputFormat类，来构建一个批量输入的数组。

向量化模式开启的方式如下：

-- 开启hive向量化模式
set hive.vectorized.execution.enabled = true;

2.Hive向量化模式支持的使用场景

Hive向量化模式并不是可以直接使用，它对使用的计算引擎，使用数据的数据类型，以及使用的SQL函数都有一定的要求。

2.1 hive向量化模式使用前置条件

不同的计算引擎支持程度不一样：MR计算引擎仅支持Map阶段的向量化，Tez和Spark计算引擎可以支持Map阶段和Reduce阶段的向量化。
hive文件存储类型必须为ORC或者Parquet等列存储文件类型。

2.2 向量模式支持的数据类型

tinyint
smallint
int
bigint
boolean
float
double
decimal
date
timestamp
string

以上数据类型为向量化模式支持的数据类型，如果使用其他数据类型，例如array和map等，开启了向量化模式查询，查询操作将使用标准的方式单行执行，但不会报错。

2.3 向量化模式支持的函数

算数表达式： +, -, *, /, %
逻辑关系：AND, OR, NOT
比较关系（过滤器）： <, >, <=, >=, =, !=, BETWEEN, IN ( list-of-constants ) as filters
使用 AND, OR, NOT, <, >, <=, >=, =, != 等布尔值表达式（非过滤器）
空值校验：IS [NOT] NULL
所有的数学函数，例如 SIN, LOG等
字符串函数： SUBSTR, CONCAT, TRIM, LTRIM, RTRIM, LOWER, UPPER, LENGTH
类型转换：cast
Hive UDF函数, 包括标准和通用的UDF函数
日期函数：YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, UNIX_TIMESTAMP
IF条件表达式

以上函数表达式在运行时支持使用向量化模式。

3.如何查看hiveSQL向量化运行信息

查看hive向量化信息是前置的，可以通过执行计划命令explain vectorization查看向量化描述信息。当然，执行中，也可以通过日志了解向量化执行信息，但相对筛选关键信息比较复杂。

explain vectorization是在hive2.3.0版本之后发布的功能，可以查看map阶段和reduce阶段为什么没有开启矢量化模式，类似调试功能。

explain vectorization支持的语法：explain vectorization [only] [summary|operator|expression|detail]

explain vectorization：不带后置参数，显示执行计划的向量化信息（启用向量化）以及 Map 和 Reduce 阶段的摘要。
only：这个命令只显示向量化模式相关的描述信息，这个参数和后面的其他参数是可以一起使用的，与它相对的是explain vectorization。
summary：这是个默认参数，任何命令后面默认有该参数。
operator：补充显示运算符的向量化信息。例如数据过滤向量化。还包括summary的所有信息。
expression：补充显示表达式的向量化信息。例如谓词表达式。还包括 summary 和 operator 的所有信息。
detail：显示最详细级别的向量化信息。它包括summary、operator、expression的所有信息。

接下来我们通过实例来查看以上命令的展示内容：

3.1 explain vectorization only只查询向量化描述信息内容

例1 关闭向量化模式的情况下，使用explain vectorization only。

-- 关闭向量化模式
set hive.vectorized.execution.enabled = false;
explain vectorization only
select age,count(0) as num from temp.user_info_all where ymd = '20230505'
and age < 30 and nick like '%小%'
group by age;

执行结果：

PLAN VECTORIZATION:
  enabled: false		#标识向量化模式没有开启
  enabledConditionsNotMet: [hive.vectorized.execution.enabled IS false]  #未开启原因

如上，如果关闭向量化模式，输出结果中PLAN VECTORIZATION 这里可以看到该模式没有被开启，原因是没有满足enabledConditionsNotMet 指代的条件。

例2 开启向量化模式的情况下，使用explain vectorization only。

-- 开启向量化模式
set hive.vectorized.execution.enabled = true;
explain vectorization only
select age,count(0) as num from temp.user_info_all where ymd = '20230505'
and age < 30 and nick like '%小%'
group by age;

执行结果：

PLAN VECTORIZATION:
  enabled: true
  enabledConditionsMet: [hive.vectorized.execution.enabled IS true]

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Execution mode: vectorized
      Map Vectorization:
          enabled: true
          enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
          groupByVectorOutput: true
          inputFileFormats: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
          allNative: false
          usesVectorUDFAdaptor: false
          vectorized: true
      Reduce Vectorization:
          enabled: false
          enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true
          enableConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false

  Stage: Stage-0
    Fetch Operator

执行结果有三部分内容：

PLAN VECTORIZATION
STAGE DEPENDENCIES
STAGE PLANS

其中STAGE PLANS打印的并不是explain中map和reduce阶段的运行信息，而是这两个阶段使用向量化模式的信息。

对以上案例内容进行关键词解读：

Execution mode：当前的执行模式，vectorized表示当前模式是向量化的模式。
Map Vectorization：当前是map阶段的向量化执行模式信息。
enabled：是否开启该阶段向量化模式，true表示开启，false表示关闭。在上面案例中Map Vectorization阶段是开启，Reduce Vectorization阶段是关闭。
enabledConditionsMet：表示当前阶段，开启向量化模式已经满足的条件。
enableConditionsNotMet：表示当前阶段，开启向量化模式未满足的条件。
groupByVectorOutput：标识该阶段分组聚合操作是否开启向量化模式。
inputFileFormats：当前阶段，输入的文件格式。
allNative：是否都是本地化操作，false表示不是。
usesVectorUDFAdaptor：值为true时，表示至少有一个向量化表达式在使用VectorUDFAdaptor（向量化udf适配器）
vectorized：向量化模式执行是否成功，true为是向量化执行，false为不是向量化执行。
Reduce Vectorization：reduce阶段向量化模式执行信息。

以上整个过程在map阶段执行了向量化模式，在reduce阶段没有执行向量化模式，是因为上文提到的reduce阶段mr计算引擎不支持，需要tez或spark计算引擎。

3.2 explain vectorization 查看hive向量化模式执行信息

可以执行以下命令：

-- 开启向量化模式
set hive.vectorized.execution.enabled = true;
explain vectorization only summary
select age,count(0) as num from temp.user_info_all where ymd = '20230505'
and age < 30 and nick like '%小%'
group by age;

会发现explain vectorization only命令和explain vectorization only summary命令执行结果完全一致。

后续其他命令也类似，explain vectorization等同于explain vectorization summary，summary参数是一个默认参数，可以忽略。

例3 使用explain vectorization命令查看hive向量化模式执行信息。

-- 开启向量化模式
set hive.vectorized.execution.enabled = true;
explain vectorization
select age,count(0) as num from temp.user_info_all where ymd = '20230505'
and age < 30 and nick like '%小%'
group by age;

其执行结果是explain和explain vectorization only两者相加执行结果：

PLAN VECTORIZATION:
  enabled: true
  enabledConditionsMet: [hive.vectorized.execution.enabled IS true]

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: user_info_all
            Statistics: Num rows: 32634295 Data size: 783223080 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((age < 30) and (nick like '%小%')) (type: boolean)
              Statistics: Num rows: 5439049 Data size: 130537176 Basic stats: COMPLETE Column stats: NONE
              Select Operator ... 	#省略部分
      # 向量化模式描述信息
      Execution mode: vectorized
      Map Vectorization:
          enabled: true
          enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
          groupByVectorOutput: true
          inputFileFormats: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
          allNative: false
          usesVectorUDFAdaptor: false
          vectorized: true
      Reduce Vectorization:
          enabled: false
          enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true
          enableConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          ...  	#省略部分

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

... 为省略了一部分信息。

3.3 使用operator查看运算符的向量化信息

使用explain vectorization operator可以查看显示执行计划过程中运算符的向量化信息和explain运行阶段信息。

简化版为explain vectorization only operator，加only相对前者少的部分为explain运行阶段信息，下同。explain运行阶段信息我们就不查询了，感兴趣小伙伴可以自行查询查看。

例4 简化版为explain vectorization only operator查看hiveSQL矢量化描述信息。

set hive.vectorized.execution.enabled = true;
explain vectorization only operator
select age,count(0) as num from temp.user_info_all where ymd = '20230505'
and age < 30 and nick like '%小%'
group by age;

执行结果：

PLAN VECTORIZATION:
  enabled: true
  enabledConditionsMet: [hive.vectorized.execution.enabled IS true]

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
      			# 表扫描的向量化信息
            TableScan Vectorization:
            		# 读表采用本地的向量化模式扫描
                native: true
              # 过滤操作的向量化信息
              Filter Vectorization:
              		# 过滤操作的类
                  className: VectorFilterOperator
                  # 过滤采用本地的向量化模式
                  native: true
                # 列筛选的向量化信息
                Select Vectorization:
                    className: VectorSelectOperator
                    native: true
                  # 聚合操作的向量化信息
                  Group By Vectorization:
                      className: VectorGroupByOperator
                      # 输出采用向量化输出
                      vectorOutput: true
                      #非本地操作
                      native: false
                    # reduce output向量化信息
                    Reduce Sink Vectorization:
                        className: VectorReduceSinkOperator
                        native: false
                        # 已满足的Reduce Sink向量化条件
                        nativeConditionsMet: hive.vectorized.execution.reducesink.new.enabled IS true, Not ACID UPDATE or DELETE IS true, No buckets IS true, No TopN IS true, No DISTINCT columns IS true, BinarySortableSerDe for keys IS true, LazyBinarySerDe for values IS true
                        # 不满足的Reduce Sink向量化条件
                        nativeConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false, Uniform Hash IS false
      # 向量化描述信息，同explain vectorization only，不作标注了。
      Execution mode: vectorized
      Map Vectorization:
          enabled: true
          enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
          groupByVectorOutput: true
          inputFileFormats: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
          allNative: false
          usesVectorUDFAdaptor: false
          vectorized: true
      Reduce Vectorization:
          enabled: false
          enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true
          enableConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false
      Reduce Operator Tree:
          Group By Vectorization:
              vectorOutput: false
              native: false

  Stage: Stage-0
    Fetch Operator

以上内容关键词在代码块有行注释标注，可以看到explain vectorization only operator命令多了在explain执行计划过程中增加了具体每一个运算符（operator）步骤的是否向量化及具体信息。如果不满足向量化步骤，哪些条件满足，哪些条件不满足，也做了标注。

3.4 使用expression显示字段粒度的向量化信息

expression：补充显示表达式的向量化信息，例如谓词表达式。还包括 summary 和 operator 的所有信息。

例5 简化版explain vectorization only expression命令查看hiveSQL执行计划表达式的向量化信息。

set hive.vectorized.execution.enabled = true;
explain vectorization only expression
select age,count(0) as num from temp.user_info_all where ymd = '20230505'
and age < 30 and nick like '%小%'
group by age;

执行结果：

# 同explain vectorization
PLAN VECTORIZATION:
  enabled: true
  enabledConditionsMet: [hive.vectorized.execution.enabled IS true]

# 同explain vectorization
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
            TableScan Vectorization:
                native: true
                # 表示表扫描后有25列。
                projectedOutputColumns: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
              Filter Vectorization:
                  className: VectorFilterOperator
                  native: true
                  # 表示谓词过滤少选有两列，以及过滤条件的内容。
                  predicateExpression: FilterExprAndExpr(children: FilterLongColLessLongScalar(col 11, val 30), FilterStringColLikeStringScalar(col 7, pattern %小%))
                Select Vectorization:
                    className: VectorSelectOperator
                    native: true
                    # 表示进行列筛选的具体列，这里是第12列，数组下标为11.如果为空[],则表示任何一个列。
                    projectedOutputColumns: [11]
                  Group By Vectorization:
                  		# 表示使用VectorUDAFCount的方法进行count计数统计以及输出类型。
                      aggregators: VectorUDAFCount(ConstantVectorExpression(val 0) -> 25:int) -> bigint
                      className: VectorGroupByOperator
                      vectorOutput: true
                      # 聚合列
                      keyExpressions: col 11
                      native: false
                      # 输出为一个新的数组，只有一列
                      projectedOutputColumns: [0]
                    Reduce Sink Vectorization:
                        className: VectorReduceSinkOperator
                        native: false
                        nativeConditionsMet: hive.vectorized.execution.reducesink.new.enabled IS true, Not ACID UPDATE or DELETE IS true, No buckets IS true, No TopN IS true, No DISTINCT columns IS true, BinarySortableSerDe for keys IS true, LazyBinarySerDe for values IS true
                        nativeConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false, Uniform Hash IS false
      # 向量化描述信息，同explain vectorization only，不作标注了。
      Execution mode: vectorized
      Map Vectorization:
          enabled: true
          enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
          groupByVectorOutput: true
          inputFileFormats: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
          allNative: false
          usesVectorUDFAdaptor: false
          vectorized: true
      Reduce Vectorization:
          enabled: false
          enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true
          enableConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false
      Reduce Operator Tree:
          Group By Vectorization:
              vectorOutput: false
              native: false
              projectedOutputColumns: null

  Stage: Stage-0
    Fetch Operator

以上打印信息内容可以看出 explain vectorization only expression命令相对打印的信息是更细粒度到字段级别的信息了。基本上将操作的每一列是否使用向量化处理都打印了出来，这样我们可以很好的判断哪些字段类型是不支持向量化模式的。

3.5 使用detail查看最详细级别的向量化信息

explain vectorization only detail 查看最详细级别的向量化信息。它包括summary、operator、expression的所有信息。

例6 explain vectorization only detail 查看最详细级别的向量化信息。

set hive.vectorized.execution.enabled = true;
explain vectorization only detail
select age,count(0) as num from temp.user_info_all where ymd = '20230505'
and age < 30 and nick like '%小%'
group by age;

执行结果：

PLAN VECTORIZATION:
  enabled: true
  enabledConditionsMet: [hive.vectorized.execution.enabled IS true]

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

# 同explain vectorization only expression
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
            TableScan Vectorization:
                native: true
                projectedOutputColumns: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
              Filter Vectorization:
                  className: VectorFilterOperator
                  native: true
                  predicateExpression: FilterExprAndExpr(children: FilterLongColLessLongScalar(col 11, val 30), FilterStringColLikeStringScalar(col 7, pattern %小%))
                Select Vectorization:
                    className: VectorSelectOperator
                    native: true
                    projectedOutputColumns: [11]
                  Group By Vectorization:
                      aggregators: VectorUDAFCount(ConstantVectorExpression(val 0) -> 25:int) -> bigint
                      className: VectorGroupByOperator
                      vectorOutput: true
                      keyExpressions: col 11
                      native: false
                      projectedOutputColumns: [0]
                    Reduce Sink Vectorization:
                        className: VectorReduceSinkOperator
                        native: false
                        nativeConditionsMet: hive.vectorized.execution.reducesink.new.enabled IS true, Not ACID UPDATE or DELETE IS true, No buckets IS true, No TopN IS true, No DISTINCT columns IS true, BinarySortableSerDe for keys IS true, LazyBinarySerDe for values IS true
                        nativeConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false, Uniform Hash IS false
      # 向量化描述信息这里做了更详细的描述
      Execution mode: vectorized
      Map Vectorization:
          enabled: true
          enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
          groupByVectorOutput: true
          inputFileFormats: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
          allNative: false
          usesVectorUDFAdaptor: false
          vectorized: true
          rowBatchContext:
              dataColumnCount: 24
              includeColumns: [7, 11]
              dataColumns: uid:bigint, reg_time:string, cc:string, client:string, if_new:int, last_location:string, platform_reg:string, nick:string, gender:int, birthday:string, constellation:string, age:bigint, description:string, is_realname:int, realname_date:string, last_active_day:string, is_active:int, user_status:int, user_ua:string, vst_cnt:bigint, vst_dur:bigint, is_vip:int, chat_uv:bigint, chat_cnt:bigint
              partitionColumnCount: 1
              partitionColumns: ymd:string
              scratchColumnTypeNames: bigint
      Reduce Vectorization:
          enabled: false
          enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true
          enableConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false
      Reduce Operator Tree:
          Group By Vectorization:
              vectorOutput: false
              native: false
              projectedOutputColumns: null

  Stage: Stage-0
    Fetch Operator

通过以上内容可以看出 explain vectorization only detail打印的信息其中执行计划部分内容和explain vectorization only expression粒度一致，在向量化描述信息部分做了更细粒度的描述，到字段级别。

以上就是hive向量化explain vectorization相关参数的使用，其命令在我们使用向量化模式中进行验证支持的函数和数据类型逐步递进，可以根据需要使用。

而hive向量化模式可以极大程度的优化hive执行速度。

4.hive向量化模式优化执行比对

例7 执行优化速度比对。

-- 代码1 开启向量化模式
set hive.vectorized.execution.enabled = true;
select age,count(0) as num from temp.user_info_all where ymd = '20230505'
and age < 30 and nick like '%小%'
group by age;

-- 代码2 关闭向量化模式
set hive.vectorized.execution.enabled = false;
select age,count(0) as num from temp.user_info_all where ymd = '20230505'
and age < 30 and nick like '%小%'
group by age;

执行结果：

# 代码1执行结果开启向量化模式
MapReduce Total cumulative CPU time: 1 minutes 1 seconds 740 msec
Ended Job = job_1675664438694_13647623
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 6  Reduce: 5   Cumulative CPU: 61.74 sec   HDFS Read: 367242142 HDFS Write: 1272 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 1 seconds 740 msec
OK
15      23
... # 省略数据
29      81849
Time taken: 41.322 seconds, Fetched: 31 row(s)

# 代码2执行结果关闭向量化模式
MapReduce Total cumulative CPU time: 1 minutes 39 seconds 190 msec
Ended Job = job_1675664438694_13647754
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 6  Reduce: 5   Cumulative CPU: 99.19 sec   HDFS Read: 367226626 HDFS Write: 1272 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 39 seconds 190 msec
OK
15      23
... # 省略数据
29      81849
Time taken: 50.724 seconds, Fetched: 31 row(s)

以上结果可以看出，开启向量化模式执行结果查询耗时减少，虽然减少的不多，但在CPU使用上少了三分之一的资源。可见开启向量化模式不仅可以提高查询速度，还可以节省查询资源。

以上开启向量化模式为mr引擎测试结果，tez和spark还具有更优的执行表现。

下一期：Hive执行计划之只有map阶段SQL性能分析和解读

按例，欢迎点击此处关注我的个人公众号，交流更多知识。

后台回复关键字 hive，随机赠送一本鲁边备注版珍藏大数据书籍。