内容纲要
概要描述
Analyze,分析表(也称为计算统计信息)是一种内置的Hive操作,可以执行该操作来收集表上的元数据信息。这可以极大的改善表上的查询时间,因为它收集构成表中数据的行计数,文件计数和文件大小(字节),并在执行之前将其提供给查询计划程序。
表支持情况:
- txt
- csv
- orc
- torc
hyperbasehyperdriveesholodesk
详细说明
- 环境准备:TDH6.2.1
1. 对非分区表的统计
> ANALYZE TABLE people_torc_7_buckets COMPUTE STATISTICS;
表统计信息更新入TABLE_PARAMS表:
> select t.PARAM_KEY,t.PARAM_VALUE from TABLE_PARAMS t
> where t.TBL_ID in (select TBLS.TBL_ID from TBLS where TBLS.TBL_NAME='people_torc_7_buckets');
> ANALYZE TABLE people_torc_7_buckets COMPUTE STATISTICS FOR COLUMNS borndate,nation;
列统计信息更新入:
> select * from metastore_inceptor1.TAB_COL_STATS
> where DB_NAME='default' and TABLE_NAME='people_torc_7_buckets';
2. 对分区表的统计
2.1 对单值分区表的统计
注意:目前单值partition表只支持string类型partition字段的分区表。
2.1.1 单个partition字段
2.1.1.1 全表分析
> ANALYZE TABLE people_single_partition COMPUTE STATISTICS;
> ANALYZE TABLE people_single_partition COMPUTE STATISTICS FOR COLUMNS;
表统计信息更新入TABLE_PARAMS表:
select t.PARAM_KEY,t.PARAM_VALUE from TABLE_PARAMS@system_dblink t
where t.TBL_ID in (select TBLS.TBL_ID from TBLS@system_dblink where TBLS.TBL_NAME='people_single_partition');
select * from partition_PARAMS@system_dblink t WHERE PART_ID
IN
(SELECT partition_id FROM system.partitions_v
WHERE table_name='people_single_partition');
列统计信息更新入part_col_stats:
> select * from metastore_inceptor1.PART_COL_STATS
> where DB_NAME='default' and TABLE_NAME='people_single_partition';
2.1.1.2 如果指定了某个partition列,则分析该列下所有的partition
> ANALYZE TABLE people_single_partition PARTITION(borndate) COMPUTE STATISTICS;
ANALYZE TABLE people_single_partition PARTITION(borndate) COMPUTE STATISTICS FOR COLUMNS;
列统计信息更新入part_col_stats:
> select * from metastore_inceptor1.PART_COL_STATS
> where DB_NAME='default' and TABLE_NAME='people_single_partition';
2.1.1.3 对于特定分区的analyze,对应的partition value 在partition spec中指定,示例如下
> ANALYZE TABLE people_single_partition PARTITION(borndate=19950101) COMPUTE STATISTICS;
> ANALYZE TABLE people_single_partition PARTITION(borndate=19950101) COMPUTE STATISTICS FOR COLUMNS;
列统计信息更新入part_col_stats:
> select * from metastore_inceptor1.PART_COL_STATS where DB_NAME='default' and TABLE_NAME='people_single_partition';
2.1.2 多个partition字段
2.1.2.1 全表分析:
> ANALYZE TABLE people_double_partition COMPUTE STATISTICS;
> ANALYZE TABLE people_double_partition COMPUTE STATISTICS FOR COLUMNS;
表统计信息更新入TABLE_PARAMS表:
> select t.PARAM_KEY,t.PARAM_VALUE from TABLE_PARAMS t
> where t.TBL_ID in (select TBLS.TBL_ID from TBLS where TBLS.TBL_NAME='people_double_partition');
select * from partition_PARAMS@system_dblink t WHERE PART_ID
IN
(SELECT partition_id FROM system.partitions_v
WHERE table_name='people_double_partition');
列统计信息更新入part_col_stats:
> select * from metastore_inceptor1.PART_COL_STATS where DB_NAME='default' and TABLE_NAME='people_double_partition';
2.1.2.2 可以不指定partition value,分析全列
> ANALYZE TABLE people_double_partition PARTITION(borndate,sex) COMPUTE STATISTICS;
> ANALYZE TABLE people_double_partition PARTITION(borndate,sex) COMPUTE STATISTICS FOR COLUMNS;
列统计信息更新入part_col_stats:
> select * from metastore_inceptor1.PART_COL_STATS
> where DB_NAME='default' and TABLE_NAME='people_double_partition';
2.1.2.3 对于特定分区的analyze,对应的partition value 在partition spec中指定,示例如下
> ANALYZE TABLE people_double_partition PARTITION(borndate='19950101',sex='女') COMPUTE STATISTICS;
> ANALYZE TABLE people_double_partition PARTITION(borndate='19950101',sex='女') COMPUTE STATISTICS FOR COLUMNS;
列统计信息更新入part_col_stats:
> select * from metastore_inceptor1.PART_COL_STATS where DB_NAME='default' and TABLE_NAME='people_double_partition';
2.1.2.4 也可以支持以下格式,表示match第一个partition column的所有的partition
> ANALYZE TABLE people_double_partition PARTITION(borndate='19950101',sex) COMPUTE STATISTICS;
> ANALYZE TABLE people_double_partition PARTITION(borndate='19950101',sex) COMPUTE STATISTICS FOR COLUMNS;
列统计信息更新入part_col_stats:
> select * from metastore_inceptor1.PART_COL_STATS
> where DB_NAME='default' and TABLE_NAME='people_double_partition';
2.2 对范围分区表的统计
Range partition 表ANALYZE 没有分区字段类型限制。支持常用的date, string, int等分区类型。
2.2.1 单个partition字段
2.2.1.1 全表分析:
> ANALYZE TABLE people_range_partition COMPUTE STATISTICS;
表统计信息更新入TABLE_PARAMS表:
> select t.PARAM_KEY,t.PARAM_VALUE from TABLE_PARAMS t
> where t.TBL_ID in (select TBLS.TBL_ID from TBLS where TBLS.TBL_NAME='people_range_partition');
select * from partition_PARAMS@system_dblink t WHERE PART_ID
IN
(SELECT partition_id FROM system.partitions_v
WHERE table_name='people_double_partition');
> ANALYZE TABLE people_range_partition COMPUTE STATISTICS FOR COLUMNS;
列统计信息更新入part_col_stats:
> select * from metastore_inceptor1.PART_COL_STATS where DB_NAME='default' and TABLE_NAME='people_range_partition';
2.2.2.2 对于特定分区的analyze,相应的partition name在partition spec中指定,示例如下
<5.0版本:
> ANALYZE TABLE people_range_partition PARTITION(borndate='before2010') COMPUTE STATISTICS;
>=5.0版本:
> ANALYZE TABLE people_range_partition PARTITION before2010 COMPUTE STATISTICS;
<5.0版本:
> ANALYZE TABLE people_range_partition PARTITION(borndate='before2010') COMPUTE STATISTICS FOR COLUMNS;
>=5.0版本:
> ANALYZE TABLE people_range_partition PARTITION before2010 COMPUTE STATISTICS FOR COLUMNS;
列统计信息更新入part_col_stats:
> select * from metastore_inceptor1.PART_COL_STATS
> where DB_NAME='default' and TABLE_NAME='people_range_partition';
2.2.2.3 多个partition字段
–分析之后判断可以根据已有的borndate,和phonenumber作为多range partition 分区字段
<5.0版本:
> ANALYZE TABLE people_double_range_partition PARTITION(borndate='2000-12-31',phonenumber='160000002322') COMPUTE STATISTICS;
>=5.0版本:
ANALYZE TABLE people_double_range_partition PARTITION before2000_5_6 COMPUTE STATISTICS;
<5.0版本:
ANALYZE TABLE people_double_range_partition PARTITION(borndate='2000-12-31',phonenumber='160000002322') COMPUTE STATISTICS FOR COLUMNS;
>=5.0版本:
> ANALYZE TABLE people_double_range_partition PARTITION before2000_5_6 COMPUTE STATISTICS FOR COLUMNS;
列统计信息更新入part_col_stats(实际没有查到????):
> select * from metastore_inceptor1.PART_COL_STATS
> where DB_NAME='default' and TABLE_NAME='people_double_range_partition';
2.3 对二级分区表的统计
> ANALYZE TABLE PEOPLE_DOUBLE_PARTITION COMPUTE STATISTICS;
表统计信息更新入TABLE_PARAMS表:
> select t.PARAM_KEY,t.PARAM_VALUE from TABLE_PARAMS t
> where t.TBL_ID in (select TBLS.TBL_ID from TBLS where TBLS.TBL_NAME='people_double_partition');
> ANALYZE TABLE PEOPLE_DOUBLE_PARTITION PARTITION (borndate='19900101',sex='女') COMPUTE STATISTICS FOR COLUMNS ;
列统计信息更新入part_col_stats:
> select * from metastore_inceptor1.PART_COL_STATS
> where DB_NAME='default' and upper(TABLE_NAME)='PEOPLE_DOUBLE_PARTITION';
2.4 对interval分区表的统计
> ANALYZE TABLE people_interval_partition COMPUTE STATISTICS;
表统计信息更新入TABLE_PARAMS表:
> select t.PARAM_KEY,t.PARAM_VALUE from TABLE_PARAMS t
> where t.TBL_ID in (select TBLS.TBL_ID from TBLS where TBLS.TBL_NAME='people_interval_partition');
> ANALYZE TABLE people_interval_partition PARTITION p2 COMPUTE STATISTICS;
> ANALYZE TABLE people_interval_partition PARTITION p3 COMPUTE STATISTICS FOR COLUMNS ;
列统计信息更新入part_col_stats:
> select * from metastore_inceptor1.PART_COL_STATS
> where DB_NAME='default' and TABLE_NAME='people_interval_partition';