关联字段 null 值造成部分 task 过慢

  性能相关
内容纲要

概要描述


本文主要介绍在多表关联的场景下,如何避免因为主表 null 值占比过多而导致的一个 reduce 任务过慢的问题

详细说明


Inceptor基本优化思路中,我们在做 join 关联操作的时候,尽量避免关联字段有 null 值和空字符串,会导致部分 task 执行过慢。
下面这条sql运行时间较长,需要6000s,是不正常的

create table temp.day_user_play_uuid_split_20180226_test1 as
select 
t.datess,
t.device_id,
t.version_id,
t.province_id,
t.city_id,
t.telecom_id,
t.terminal_id,
t.ip,
case
when play_type_id='02' then 
case when t.uuid is not null and t.uuid !='' and t.uuid !='0' then t.uuid when t2.uuid is not null then t2.uuid when t3.uuid is not null then t3.uuid
else 'unknow' end
when play_type_id='03' then
case when t.uuid is not null and t.uuid !='' and t.uuid !='0' then t.uuid when t1.premiere_channel_uuid is not null then t1.premiere_channel_uuid else 'unknow' end
else t.uuid end,
t.play_type_id,
t.play_type,
t.program_id,
t.program_serise_id,
t.client_time,
t.server_time,
t.open_time,
t.play_time,
t.buffer_count,
t.buffer_aver_time,
t.url_first,
t.url,
t.terminal_type
from
    (select * from dm_bas.day_user_play_uuid where play_type<>'vod_replay') t
left join 
         dim.cms_program t1 on t.program_id=t1.program_id 
left join 
         dim.cms_playbill_channel t2 on t.program_id=t2.channel_id 
left join 
         dim.cms_history_playbill_channel t3 on t.program_id=t3.playbill_id ;

排查思路


查看主表关联字段分布,可以看出空值占总量的66.6%的百分比,这导致大部分计算分配到一个reduce导致整个任务计算缓慢,体现为reduce进度长时间处在99%。

--总数据量
> select count(1) as num from dm_bas.day_user_play_uuid where  play_type<>'vod_replay';
390270887
--关联字段分布情况
> select program_id,count(1) as num from dm_bas.day_user_play_uuid where  play_type<>'vod_replay' group by program_id order by num desc limit 6;
            259816668
0           46610174
1           1284378
70564740    357546
70564743    251696
70564694    212326

解决方案


关联前将左表为空关联字段设置为一个随机数,再去关联右表,把倾斜的数据分到不同的reduce上,由于null值关联不上,处理后并不影响最终结果

create table temp.day_user_play_uuid_split_20180226_test3 as
select 
t.datess,
t.device_id,
t.version_id,
t.province_id,
t.city_id,
t.telecom_id,
t.terminal_id,
t.ip,
case
when play_type_id='02' then 
case when t.uuid is not null and t.uuid !='' and t.uuid !='0' then t.uuid when t2.uuid is not null then t2.uuid when t3.uuid is not null then t3.uuid
else 'unknow' end
when play_type_id='03' then
case when t.uuid is not null and t.uuid !='' and t.uuid !='0' then t.uuid when t1.premiere_channel_uuid is not null then t1.premiere_channel_uuid else 'unknow' end
else t.uuid end,
t.play_type_id,
t.play_type,
t.program_id,
t.program_serise_id,
t.client_time,
t.server_time,
t.open_time,
t.play_time,
t.buffer_count,
t.buffer_aver_time,
t.url_first,
t.url,
t.terminal_type 
from 
   (select 
   datess,device_id,version_id,province_id,city_id,telecom_id,terminal_id,ip,uuid,play_type_id,play_type,
   case when program_id is not null and trim(program_id) != '' then program_id else cast(RAND()*100000 as int) end as program_id,
   program_serise_id,client_time,server_time,open_time,play_time,buffer_count,
   buffer_aver_time,url_first,url,terminal_type
   from dm_bas.day_user_play_uuid where play_type<>'vod_replay') t 
left join 
        dim.cms_program t1 on t.program_id=t1.program_id 
left join 
        dim.cms_playbill_channel t2 on t.program_id=t2.channel_id 
left join 
        dim.cms_history_playbill_channel t3 on t.program_id=t3.playbill_id ;

case when program_id is not null and trim(program_id) != '' then program_id else cast(RAND()*100000 as int) end as program_id,
把program_id为null和空字符串的转换为随机数即可

这篇文章对您有帮助吗?

平均评分 0 / 5. 次数: 0

尚无评价,您可以第一个评哦!

非常抱歉,这篇文章对您没有帮助.

烦请您告诉我们您的建议与意见,以便我们改进,谢谢您。