Analyze 如何收集统计信息

  使用配置, 性能相关
内容纲要

概要描述


Analyze,分析表(也称为计算统计信息)是一种内置的Hive操作,可以执行该操作来收集表上的元数据信息。这可以极大的改善表上的查询时间,因为它收集构成表中数据的行计数,文件计数和文件大小(字节),并在执行之前将其提供给查询计划程序。

表支持情况:

  • txt
  • csv
  • orc
  • torc
  • hyperbase
  • hyperdrive
  • es
  • holodesk

详细说明


  • 环境准备:TDH6.2.1

1. 对非分区表的统计

> ANALYZE TABLE people_torc_7_buckets COMPUTE STATISTICS;

file

表统计信息更新入TABLE_PARAMS表:

> select t.PARAM_KEY,t.PARAM_VALUE from TABLE_PARAMS t 
> where t.TBL_ID in (select TBLS.TBL_ID from TBLS where TBLS.TBL_NAME='people_torc_7_buckets');
> ANALYZE TABLE people_torc_7_buckets COMPUTE STATISTICS FOR COLUMNS borndate,nation;

列统计信息更新入:

> select * from metastore_inceptor1.TAB_COL_STATS 
> where DB_NAME='default' and TABLE_NAME='people_torc_7_buckets';

2. 对分区表的统计

2.1 对单值分区表的统计

注意:目前单值partition表只支持string类型partition字段的分区表。
2.1.1 单个partition字段
2.1.1.1 全表分析
> ANALYZE TABLE people_single_partition COMPUTE STATISTICS;

file

> ANALYZE TABLE people_single_partition COMPUTE STATISTICS FOR COLUMNS;

表统计信息更新入TABLE_PARAMS表:

> select t.PARAM_KEY,t.PARAM_VALUE from TABLE_PARAMS t 
> where t.TBL_ID in (select TBLS.TBL_ID from TBLS where TBLS.TBL_NAME='people_single_partition');

列统计信息更新入part_col_stats:

> select * from metastore_inceptor1.PART_COL_STATS 
> where DB_NAME='default' and TABLE_NAME='people_single_partition';
2.1.1.2 如果指定了某个partition列,则分析该列下所有的partition
> ANALYZE TABLE people_single_partition PARTITION(borndate) COMPUTE STATISTICS;

file

ANALYZE TABLE people_single_partition PARTITION(borndate) COMPUTE STATISTICS FOR COLUMNS;

列统计信息更新入part_col_stats:

> select * from metastore_inceptor1.PART_COL_STATS 
> where DB_NAME='default' and TABLE_NAME='people_single_partition';
2.1.1.3 对于特定分区的analyze,对应的partition value 在partition spec中指定,示例如下
> ANALYZE TABLE people_single_partition PARTITION(borndate=19950101) COMPUTE STATISTICS;

file

> ANALYZE TABLE people_single_partition PARTITION(borndate=19950101) COMPUTE STATISTICS FOR COLUMNS;

列统计信息更新入part_col_stats:

> select * from metastore_inceptor1.PART_COL_STATS where DB_NAME='default' and TABLE_NAME='people_single_partition';
2.1.2 多个partition字段
2.1.2.1 全表分析:
> ANALYZE TABLE people_double_partition COMPUTE STATISTICS;

file

> ANALYZE TABLE people_double_partition COMPUTE STATISTICS FOR COLUMNS;

表统计信息更新入TABLE_PARAMS表:

> select t.PARAM_KEY,t.PARAM_VALUE from TABLE_PARAMS t 
> where t.TBL_ID in (select TBLS.TBL_ID from TBLS where TBLS.TBL_NAME='people_double_partition');

列统计信息更新入part_col_stats:

> select * from metastore_inceptor1.PART_COL_STATS where DB_NAME='default' and TABLE_NAME='people_double_partition';
2.1.2.2 可以不指定partition value,分析全列
> ANALYZE TABLE people_double_partition PARTITION(borndate,sex) COMPUTE STATISTICS;

file

> ANALYZE TABLE people_double_partition PARTITION(borndate,sex) COMPUTE STATISTICS FOR COLUMNS;

列统计信息更新入part_col_stats:

> select * from metastore_inceptor1.PART_COL_STATS 
> where DB_NAME='default' and TABLE_NAME='people_double_partition';
2.1.2.3 对于特定分区的analyze,对应的partition value 在partition spec中指定,示例如下
> ANALYZE TABLE people_double_partition PARTITION(borndate='19950101',sex='女') COMPUTE STATISTICS;

file

> ANALYZE TABLE people_double_partition PARTITION(borndate='19950101',sex='女') COMPUTE STATISTICS FOR COLUMNS;

列统计信息更新入part_col_stats:

> select * from metastore_inceptor1.PART_COL_STATS where DB_NAME='default' and TABLE_NAME='people_double_partition';
2.1.2.4 也可以支持以下格式,表示match第一个partition column的所有的partition
> ANALYZE TABLE people_double_partition PARTITION(borndate='19950101',sex) COMPUTE STATISTICS;

file

> ANALYZE TABLE people_double_partition PARTITION(borndate='19950101',sex) COMPUTE STATISTICS FOR COLUMNS;

列统计信息更新入part_col_stats:

> select * from metastore_inceptor1.PART_COL_STATS 
> where DB_NAME='default' and TABLE_NAME='people_double_partition';

2.2 对范围分区表的统计

Range partition 表ANALYZE 没有分区字段类型限制。支持常用的date, string, int等分区类型。

2.2.1 单个partition字段
2.2.1.1 全表分析:
> ANALYZE TABLE people_range_partition COMPUTE STATISTICS;

file
表统计信息更新入TABLE_PARAMS表:

> select t.PARAM_KEY,t.PARAM_VALUE from TABLE_PARAMS t 
> where t.TBL_ID in (select TBLS.TBL_ID from TBLS where TBLS.TBL_NAME='people_range_partition');
> ANALYZE TABLE people_range_partition COMPUTE STATISTICS FOR COLUMNS;

列统计信息更新入part_col_stats:

> select * from metastore_inceptor1.PART_COL_STATS where DB_NAME='default' and TABLE_NAME='people_range_partition';
2.2.2.2 对于特定分区的analyze,相应的partition name在partition spec中指定,示例如下

<5.0版本:

> ANALYZE TABLE people_range_partition PARTITION(borndate='before2010') COMPUTE STATISTICS;

>=5.0版本:

> ANALYZE TABLE people_range_partition PARTITION before2010 COMPUTE STATISTICS;

file
<5.0版本:

> ANALYZE TABLE people_range_partition PARTITION(borndate='before2010') COMPUTE STATISTICS FOR COLUMNS;

>=5.0版本:

> ANALYZE TABLE people_range_partition PARTITION before2010 COMPUTE STATISTICS FOR COLUMNS;

列统计信息更新入part_col_stats:

> select * from metastore_inceptor1.PART_COL_STATS 
> where DB_NAME='default' and TABLE_NAME='people_range_partition';
2.2.2.3 多个partition字段

–分析之后判断可以根据已有的borndate,和phonenumber作为多range partition 分区字段
<5.0版本:

> ANALYZE  TABLE people_double_range_partition PARTITION(borndate='2000-12-31',phonenumber='160000002322') COMPUTE STATISTICS;

>=5.0版本:

ANALYZE TABLE people_double_range_partition PARTITION before2000_5_6 COMPUTE STATISTICS;

file
<5.0版本:

ANALYZE  TABLE people_double_range_partition PARTITION(borndate='2000-12-31',phonenumber='160000002322') COMPUTE STATISTICS FOR COLUMNS;

>=5.0版本:

> ANALYZE TABLE people_double_range_partition PARTITION before2000_5_6 COMPUTE STATISTICS FOR COLUMNS;

列统计信息更新入part_col_stats(实际没有查到????):

> select * from metastore_inceptor1.PART_COL_STATS 
> where DB_NAME='default' and TABLE_NAME='people_double_range_partition';

2.3 对二级分区表的统计

> ANALYZE TABLE PEOPLE_DOUBLE_PARTITION  COMPUTE STATISTICS;

file
表统计信息更新入TABLE_PARAMS表:

> select t.PARAM_KEY,t.PARAM_VALUE from TABLE_PARAMS t
> where t.TBL_ID in (select TBLS.TBL_ID from TBLS where TBLS.TBL_NAME='people_double_partition');
> ANALYZE TABLE PEOPLE_DOUBLE_PARTITION PARTITION (borndate='19900101',sex='女') COMPUTE STATISTICS FOR COLUMNS ;

列统计信息更新入part_col_stats:

> select * from metastore_inceptor1.PART_COL_STATS 
> where DB_NAME='default' and upper(TABLE_NAME)='PEOPLE_DOUBLE_PARTITION';

2.4 对interval分区表的统计

> ANALYZE TABLE people_interval_partition  COMPUTE STATISTICS;

file
表统计信息更新入TABLE_PARAMS表:

> select t.PARAM_KEY,t.PARAM_VALUE from TABLE_PARAMS t 
> where t.TBL_ID in (select TBLS.TBL_ID from TBLS where TBLS.TBL_NAME='people_interval_partition');
> ANALYZE TABLE people_interval_partition PARTITION p2 COMPUTE STATISTICS;

file

> ANALYZE TABLE people_interval_partition PARTITION p3 COMPUTE STATISTICS FOR COLUMNS ;

列统计信息更新入part_col_stats:

> select * from metastore_inceptor1.PART_COL_STATS 
> where DB_NAME='default' and TABLE_NAME='people_interval_partition';

这篇文章对您有帮助吗?

平均评分 0 / 5. 次数: 0

尚无评价,您可以第一个评哦!

非常抱歉,这篇文章对您没有帮助.

烦请您告诉我们您的建议与意见,以便我们改进,谢谢您。