• 热门专题

Etl之HiveSql调优(left join)

作者:李秋  发布日期:2015-10-13 20:28:13
Tag标签:Etl之HiveSql调优(left  join)  
  • 一、前言

    公司实用Hadoop构建数据仓库,期间不可避免的实用HiveSql,在Etl过程中,速度成了避无可避的问题。本人有过几个数据表关联跑1个小时的经历,你可能觉得无所谓,可是多次Etl就要多个小时,非常浪费时间,所以HiveSql优化不可避免。

    注:本文只是从sql层面介绍一下日常需要注意的点,不涉及Hadoop、MapReduce等层面,关于Hive的编译过程,请参考文章:http://tech.meituan.com/hive-sql-to-mapreduce.html

    二、准备数据

    假设咱们有两张数据表。

    景区表:sight,12W条记录,数据表结构:

    hive> desc sight;
    OK
    area                    string                  None                
    city                    string                  None                
    country                 string                  None                
    county                  string                  None                
    id                      string                  None                
    name                    string                  None                
    region                  string                  None     

    景区订单明细表:order_sight,1040W条记录,数据表结构:

    hive> desc order_sight;    
    OK
    create_time             string                  None                
    id                      string                  None                
    order_id                string                  None                
    sight_id                bigint                  None  

     三、分析

    3.1 where条件

    那么咱们希望看见景区id是9718,日期是2015-10-10的所有订单id,那么sql需要如下书写:

    hive> select s.id,o.order_id from sight s left join order_sight o on o.sight_id=s.id where s.id=9718 and o.create_time = '2015-10-10';         
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks not specified. Estimated from input data size: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapred.reduce.tasks=<number>
    Starting Job = job_1434099279301_3562174, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3562174/
    Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1434099279301_3562174
    Hadoop job information for Stage-1: number of mappers: 8; number of reducers: 1
    2015-10-12 22:58:22,706 Stage-1 map = 0%,  reduce = 0%
    2015-10-12 22:58:29,882 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 4.73 sec
    2015-10-12 22:58:30,907 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 4.73 sec
    2015-10-12 22:58:31,933 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
    2015-10-12 22:58:32,968 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
    2015-10-12 22:58:33,995 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
    2015-10-12 22:58:35,020 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
    2015-10-12 22:58:36,046 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
    2015-10-12 22:58:37,070 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
    2015-10-12 22:58:38,096 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
    2015-10-12 22:58:39,121 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
    2015-10-12 22:58:40,153 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
    2015-10-12 22:58:41,182 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 15.22 sec
    2015-10-12 22:58:42,209 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 15.22 sec
    2015-10-12 22:58:43,236 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 15.22 sec
    2015-10-12 22:58:44,263 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 15.3 sec
    2015-10-12 22:58:45,289 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 15.3 sec
    2015-10-12 22:58:46,316 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 15.3 sec
    2015-10-12 22:58:47,344 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 21.85 sec
    2015-10-12 22:58:48,370 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 21.85 sec
    2015-10-12 22:58:49,397 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 21.85 sec
    2015-10-12 22:58:50,424 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 21.85 sec
    2015-10-12 22:58:51,452 Stage-1 map = 83%,  reduce = 17%, Cumulative CPU 37.62 sec
    2015-10-12 22:58:52,478 Stage-1 map = 88%,  reduce = 17%, Cumulative CPU 38.06 sec
    2015-10-12 22:58:53,506 Stage-1 map = 88%,  reduce = 17%, Cumulative CPU 38.06 sec
    2015-10-12 22:58:54,534 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.17 sec
    2015-10-12 22:58:55,560 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.17 sec
    2015-10-12 22:58:56,587 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.17 sec
    2015-10-12 22:58:57,615 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.25 sec
    2015-10-12 22:58:58,642 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.25 sec
    2015-10-12 22:58:59,674 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.25 sec
    2015-10-12 22:59:00,708 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.32 sec
    2015-10-12 22:59:01,736 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.32 sec
    2015-10-12 22:59:02,763 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.32 sec
    2015-10-12 22:59:03,791 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.41 sec
    2015-10-12 22:59:04,817 Stage-1 map = 96%,  reduce = 29%, Cumulative CPU 49.13 sec
    2015-10-12 22:59:05,843 Stage-1 map = 100%,  reduce = 29%, Cumulative CPU 49.59 sec
    2015-10-12 22:59:06,870 Stage-1 map = 100%,  reduce = 41%, Cumulative CPU 49.76 sec
    2015-10-12 22:59:07,897 Stage-1 map = 100%,  reduce = 41%, Cumulative CPU 49.76 sec
    2015-10-12 22:59:08,922 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 52.79 sec
    2015-10-12 22:59:09,947 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 52.79 sec
    MapReduce Total cumulative CPU time: 52 seconds 790 msec
    Ended Job = job_1434099279301_3562174
    MapReduce Jobs Launched: 
    Job 0: Map: 8  Reduce: 1   Cumulative CPU: 52.79 sec   HDFS Read: 371210469 HDFS Write: 330 SUCCESS
    Total MapReduce CPU Time Spent: 52 seconds 790 msec
    OK
    9718    210294949
    9718    210294421
    9718    210296438
    9718    210295344
    9718    210297567
    9718    210296076
    9718    210295525
    9718    210298219
    9718    210295840
    9718    210301363
    9718    210297733
    9718    210298066
    9718    210295239
    9718    210298328
    9718    210298008
    9718    210299712
    9718    210295586
    9718    210295050
    9718    210295566
    9718    210299105
    9718    210296318
    9718    210295277
    Time taken: 52.068 seconds, Fetched: 22 row(s)

     可见需要的时间是52秒,如果咱们换一个sql的书写方式:

    hive> select s.id,o.order_id from sight s left join (select order_id,sight_id from order_sight where create_time = '2015-10-10') o on o.sight_id=s.id where s.id=9718;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks not specified. Estimated from input data size: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapred.reduce.tasks=<number>
    Starting Job = job_1434099279301_3562218, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3562218/
    Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1434099279301_3562218
    Hadoop job information for Stage-1: number of mappers: 8; number of reducers: 1
    2015-10-12 23:03:54,926 Stage-1 map = 0%,  reduce = 0%
    2015-10-12 23:04:01,075 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU 2.24 sec
    2015-10-12 23:04:02,101 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU 2.24 sec
    2015-10-12 23:04:03,126 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU 2.24 sec
    2015-10-12 23:04:04,151 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 5.53 sec
    2015-10-12 23:04:05,176 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 5.53 sec
    2015-10-12 23:04:06,201 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.62 sec
    2015-10-12 23:04:07,226 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 18.66 sec
    2015-10-12 23:04:08,250 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 18.66 sec
    2015-10-12 23:04:09,275 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 18.66 sec
    2015-10-12 23:04:10,300 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 18.66 sec
    2015-10-12 23:04:11,324 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 18.66 sec
    2015-10-12 23:04:12,356 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.09 sec
    2015-10-12 23:04:13,384 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.09 sec
    2015-10-12 23:04:14,410 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.09 sec
    2015-10-12 23:04:15,437 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.22 sec
    2015-10-12 23:04:16,463 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.22 sec
    2015-10-12 23:04:17,487 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.22 sec
    2015-10-12 23:04:18,514 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.35 sec
    2015-10-12 23:04:19,538 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.35 sec
    2015-10-12 23:04:20,569 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.35 sec
    2015-10-12 23:04:21,595 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.54 sec
    2015-10-12 23:04:22,620 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.54 sec
    2015-10-12 23:04:23,646 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.54 sec
    2015-10-12 23:04:24,673 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.64 sec
    2015-10-12 23:04:25,698 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.64 sec
    2015-10-12 23:04:26,723 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.64 sec
    2015-10-12 23:04:27,748 Stage-1 map = 75%,  reduce = 21%, Cumulative CPU 23.32 sec
    2015-10-12 23:04:28,774 Stage-1 map = 88%,  reduce = 21%, Cumulative CPU 27.27 sec
    2015-10-12 23:04:29,799 Stage-1 map = 100%,  reduce = 21%, Cumulative CPU 32.82 sec
    2015-10-12 23:04:30,823 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 34.35 sec
    2015-10-12 23:04:31,846 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 34.35 sec
    MapReduce Total cumulative CPU time: 34 seconds 350 msec
    Ended Job = job_1434099279301_3562218
    MapReduce Jobs Launched: 
    Job 0: Map: 8  Reduce: 1   Cumulative CPU: 34.35 sec   HDFS Read: 371210469 HDFS Write: 330 SUCCESS
    Total MapReduce CPU Time Spent: 34 seconds 350 msec
    OK
    9718    210297733
    9718    210298066
    9718    210295239
    9718    210298328
    9718    210298008
    9718    210299712
    9718    210297567
    9718    210296076
    9718    210295525
    9718    210298219
    9718    210295840
    9718    210301363
    9718    210295586
    9718    210295050
    9718    210295566
    9718    210299105
    9718    210296318
    9718    210295277
    9718    210294949
    9718    210294421
    9718    210296438
    9718    210295344
    Time taken: 43.709 seconds, Fetched: 22 row(s)

    实用43秒,快了一些。当然咱们并不是仅仅分析说快了20%(我还多次测试,这次的差距最小),而是分析原因!

    单从两个sql的写法上看的出来,特别是第二条的红色部分,我将left的条件写到里面了。那么执行的结果随之不一样,第二条的Reduce时间明显小于第一条的Reduce时间。

    原因是这两个sql都分解成8个Map任务和1个Reduce任务,如果left的条件写在后面,那么这些关联操作会放在Reduce阶段,1个Reduce操作的时间必然大于8个Map的执行时间,造成执行时间超长。

    结论:当使用外关联时,如果将副表的过滤条件写在Where后面,那么就会先全表关联,之后再过滤

延伸阅读:

About IT165 - 广告服务 - 隐私声明 - 版权申明 - 免责条款 - 网站地图 - 网友投稿 - 联系方式
本站内容来自于互联网,仅供用于网络技术学习,学习中请遵循相关法律法规