ITPub博客

首页 > 数据库 > PostgreSQL > PostgreSQL DBA(34) - (HashAggregate vs GroupAggregate)

PostgreSQL DBA(34) - (HashAggregate vs GroupAggregate)

原创 PostgreSQL 作者:husthxd 时间:2019-04-24 14:54:05 0 删除 编辑

PostgreSQL数据库中,聚合函数如max/min/count等有两种实现方式:HashAggregate与GroupAggregate.本节简单介绍这两者的区别.
案例一

首先我们看一个案例:
测试表:


drop table  if exists t_agg;
create table t_agg(bh varchar(20),c1 int,c2 int,c3 int,c4 int,c5 int,c6 int);
insert into t_agg select 'GZ01',col,col,col,col,col,col from generate_series(1,100000) as col;
insert into t_agg select 'GZ02',col,col,col,col,col,col from generate_series(1,100000) as col;
insert into t_agg select 'GZ03',col,col,col,col,col,col from generate_series(1,100000) as col;
insert into t_agg select 'GZ04',col,col,col,col,col,col from generate_series(1,100000) as col;
insert into t_agg select 'GZ05',col,col,col,col,col,col from generate_series(1,100000) as col;

执行查询:


testdb=# -- 禁用并行
testdb=# set max_parallel_workers_per_gather=0;
SET
testdb=# explain verbose select bh,min(c1),max(c1),min(c2),max(c2),min(c3),max(c3),min(c4),max(c4),min(c5),max(c5) from t_agg group by bh;
                                               QUERY PLAN                                               
--------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=22427.00..22427.05 rows=5 width=45)
   Output: bh, min(c1), max(c1), min(c2), max(c2), min(c3), max(c3), min(c4), max(c4), min(c5), max(c5)
   Group Key: t_agg.bh
   ->  Seq Scan on public.t_agg  (cost=0.00..8677.00 rows=500000 width=25)
         Output: bh, c1, c2, c3, c4, c5, c6
(5 rows)

PG的优化器选择了HashAggregate.
下面禁用HashAggregate,优化器只能选择GroupAggregate.可以看到两者的总成本比较:22427.05 vs 82968.97


testdb=# set enable_hashagg = off;
SET
testdb=# explain verbose select bh,min(c1),max(c1),min(c2),max(c2),min(c3),max(c3),min(c4),max(c4),min(c5),max(c5) from t_agg group by bh;
                                               QUERY PLAN                                               
--------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=67968.92..82968.97 rows=5 width=45)
   Output: bh, min(c1), max(c1), min(c2), max(c2), min(c3), max(c3), min(c4), max(c4), min(c5), max(c5)
   Group Key: t_agg.bh
   ->  Sort  (cost=67968.92..69218.92 rows=500000 width=25)
         Output: bh, c1, c2, c3, c4, c5
         Sort Key: t_agg.bh
         ->  Seq Scan on public.t_agg  (cost=0.00..8677.00 rows=500000 width=25)
               Output: bh, c1, c2, c3, c4, c5
(8 rows)

案例二
下面用一个宽表来进行测试:分组键值很少,但聚合列很多


drop table  if exists t_agg_width;
create table t_agg_width
(bh varchar(20)
,c1 int,c2 int,c3 int,c4 int,c5 int,c6 int,c7 int,c8 int,c9 int
,c11 int,c12 int,c13 int,c14 int,c15 int,c16 int,c17 int,c18 int,c19 int
,c21 int,c22 int,c23 int,c24 int,c25 int,c26 int,c27 int,c28 int,c29 int
,c31 int,c32 int,c33 int,c34 int,c35 int,c36 int,c37 int,c38 int,c39 int);
insert into t_agg_width 
select 'GZ01'
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
from generate_series(1,100000) as col;
insert into t_agg_width 
select 'GZ02'
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
from generate_series(1,100000) as col;
insert into t_agg_width 
select 'GZ03'
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
from generate_series(1,100000) as col;
insert into t_agg_width 
select 'GZ04'
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
from generate_series(1,100000) as col;
-- 禁用hashagg
set enable_hashagg = off;
-- 禁用并行
set max_parallel_workers_per_gather=0;
select bh
,min(c1),min(c2) ,min(c3) ,min(c4) ,min(c5) ,min(c6) ,min(c7) ,min(c8) ,min(c9)
,min(c11),min(c12) ,min(c13) ,min(c14) ,min(c15) ,min(c16) ,min(c17) ,min(c18) ,min(c19)
,min(c21),min(c22) ,min(c23) ,min(c24) ,min(c25) ,min(c26) ,min(c27) ,min(c28) ,min(c29)
,min(c31),min(c32) ,min(c33) ,min(c34) ,min(c35) ,min(c36) ,min(c37) ,min(c38) ,min(c39)
from t_agg_width group by bh;

在这种情况下,优化器仍会选择Hash


testdb=# explain verbose select bh
testdb-# ,min(c1),min(c2) ,min(c3) ,min(c4) ,min(c5) ,min(c6) ,min(c7) ,min(c8) ,min(c9)
testdb-# ,min(c11),min(c12) ,min(c13) ,min(c14) ,min(c15) ,min(c16) ,min(c17) ,min(c18) ,min(c19)
testdb-# ,min(c21),min(c22) ,min(c23) ,min(c24) ,min(c25) ,min(c26) ,min(c27) ,min(c28) ,min(c29)
testdb-# ,min(c31),min(c32) ,min(c33) ,min(c34) ,min(c35) ,min(c36) ,min(c37) ,min(c38) ,min(c39)
testdb-# from t_agg_width group by bh;
                                                    QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=49889.00..49889.04 rows=4 width=149)
   Output: bh, min(c1), min(c2), min(c3), min(c4), min(c5), min(c6), min(c7), min(c8), min(c9), min(c11), min(c12), min(c13),
 min(c14), min(c15), min(c16), min(c17), min(c18), min(c19), min(c21), min(c22), min(c23), min(c24), min(c25), min(c26), min(
c27), min(c28), min(c29), min(c31), min(c32), min(c33), min(c34), min(c35), min(c36), min(c37), min(c38), min(c39)
   Group Key: t_agg_width.bh
   ->  Seq Scan on public.t_agg_width  (cost=0.00..12889.00 rows=400000 width=149)
         Output: bh, c1, c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c13, c14, c15, c16, c17, c18, c19, c21, c22, c23, c24, c25
, c26, c27, c28, c29, c31, c32, c33, c34, c35, c36, c37, c38, c39
(5 rows)
testdb=# set enable_hashagg = off;
SET
testdb=# explain verbose select bh
,min(c1),min(c2) ,min(c3) ,min(c4) ,min(c5) ,min(c6) ,min(c7) ,min(c8) ,min(c9)
,min(c11),min(c12) ,min(c13) ,min(c14) ,min(c15) ,min(c16) ,min(c17) ,min(c18) ,min(c19)
,min(c21),min(c22) ,min(c23) ,min(c24) ,min(c25) ,min(c26) ,min(c27) ,min(c28) ,min(c29)
,min(c31),min(c32) ,min(c33) ,min(c34) ,min(c35) ,min(c36) ,min(c37) ,min(c38) ,min(c39)
from t_agg_width group by bh;
                                                    QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=110266.28..148266.32 rows=4 width=149)
   Output: bh, min(c1), min(c2), min(c3), min(c4), min(c5), min(c6), min(c7), min(c8), min(c9), min(c11), min(c12), min(c13),
 min(c14), min(c15), min(c16), min(c17), min(c18), min(c19), min(c21), min(c22), min(c23), min(c24), min(c25), min(c26), min(
c27), min(c28), min(c29), min(c31), min(c32), min(c33), min(c34), min(c35), min(c36), min(c37), min(c38), min(c39)
   Group Key: t_agg_width.bh
   ->  Sort  (cost=110266.28..111266.28 rows=400000 width=149)
         Output: bh, c1, c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c13, c14, c15, c16, c17, c18, c19, c21, c22, c23, c24, c25
, c26, c27, c28, c29, c31, c32, c33, c34, c35, c36, c37, c38, c39
         Sort Key: t_agg_width.bh
         ->  Seq Scan on public.t_agg_width  (cost=0.00..12889.00 rows=400000 width=149)
               Output: bh, c1, c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c13, c14, c15, c16, c17, c18, c19, c21, c22, c23, c2
4, c25, c26, c27, c28, c29, c31, c32, c33, c34, c35, c36, c37, c38, c39
(8 rows)
testdb=#

下面增大分组键值的分布,同时提高c1等列的选择率,再次测试:


testdb=# insert into t_agg_width 
testdb-# select 'GZ'||col
testdb-# ,mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100) 
testdb-# ,mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100) 
testdb-# ,mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100) 
testdb-# ,mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100) 
testdb-# from generate_series(1,1000000) as col;
INSERT 0 1000000
testdb=# set enable_hashagg = on;
SET
testdb=# explain verbose select bh
,min(c1),min(c2) ,min(c3) ,min(c4) ,min(c5) ,min(c6) ,min(c7) ,min(c8) ,min(c9)
,min(c11),min(c12) ,min(c13) ,min(c14) ,min(c15) ,min(c16) ,min(c17) ,min(c18) ,min(c19)
,min(c21),min(c22) ,min(c23) ,min(c24) ,min(c25) ,min(c26) ,min(c27) ,min(c28) ,min(c29)
,min(c31),min(c32) ,min(c33) ,min(c34) ,min(c35) ,min(c36) ,min(c37) ,min(c38) ,min(c39)
from t_agg_width group by bh;
                                                    QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=440012.46..586553.52 rows=7414 width=149)
   Output: bh, min(c1), min(c2), min(c3), min(c4), min(c5), min(c6), min(c7), min(c8), min(c9), min(c11), min(c12), min(c13),
 min(c14), min(c15), min(c16), min(c17), min(c18), min(c19), min(c21), min(c22), min(c23), min(c24), min(c25), min(c26), min(
c27), min(c28), min(c29), min(c31), min(c32), min(c33), min(c34), min(c35), min(c36), min(c37), min(c38), min(c39)
   Group Key: t_agg_width.bh
   ->  Sort  (cost=440012.46..443866.86 rows=1541757 width=149)
         Output: bh, c1, c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c13, c14, c15, c16, c17, c18, c19, c21, c22, c23, c24, c25
, c26, c27, c28, c29, c31, c32, c33, c34, c35, c36, c37, c38, c39
         Sort Key: t_agg_width.bh
         ->  Seq Scan on public.t_agg_width  (cost=0.00..49681.57 rows=1541757 width=149)
               Output: bh, c1, c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c13, c14, c15, c16, c17, c18, c19, c21, c22, c23, c2
4, c25, c26, c27, c28, c29, c31, c32, c33, c34, c35, c36, c37, c38, c39
(8 rows)
testdb=#

这一次选择的是GroupAggregate.

HashAggregate
HashAggregate,数据库会根据group by字段后面的值算出hash值,并在内存中维护对应的Hash表,比如select有n个聚合函数,那么在内存中就会维护n个Hash表.这种方式使用的内存比GroupAggregate要大,内存的使用与group by COLUMN中的COLUMN的唯一键值以及聚合列的多少成正比.

GroupAggregate
GroupAggregate,数据库先将表中的数据按group by的字段进行排序,然后对排好序的数据进行一次扫描,计算得到聚合的结果.这种方式需要先执行一次排序,计算复杂度上面要比HashAggregate要高,但这种方法的好处是与group by COLUMN中的COLUMN的唯一键值多寡/聚合列多寡无关,分组键值很多而且聚合列很多且列数据选择很高的情况下,会优于HashAggregate.

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/6906/viewspace-2642384/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论
长期从事政务、金融等行业产品研发和架构设计工作,ITPUB数据库版块资深版主,对Oracle、PostgreSQL以及大数据等相关技术有深入研究。现就职于广州云图数据技术有限公司,系统架构师。

注册时间:2007-12-28

  • 博文量
    1250
  • 访问量
    3724347