ITPub博客

首页 > Linux操作系统 > Linux操作系统 > 使用exists(Semi-Join)优化distinct语句

使用exists(Semi-Join)优化distinct语句

原创 Linux操作系统 作者:chncaesar 时间:2013-09-09 04:21:27 0 删除 编辑
在Oracle 官方文档,semi-join是这么解释的:

A semijoin returns rows that match an EXISTS subquery without duplicating rows
from the left side of the predicate when multiple rows on the right side satisfy the
criteria of the subquery.

Semijoin and antijoin transformation cannot be done if the subquery is on an OR
branch of the WHERE clause.

由上可见,semi-join特别适用于如下场景:
表A 和B有1:N关系,需要根据B表某个条件,通过semi-join查询A表上某个列(属性)。因为semi-join已经包含一个去重过程,不需要在A表上加distinct。比内连接然后去重方式性能更好。

假设有两个表Customer , orders。当然,每个客户可以有多个订单。现在需要找出订单金额超过10000的客户数量。本文例子在12cR1上实现,11g上应该是同样结果。

show rel;
release 1201000100

desc customer
Name        Null     Type          
----------- -------- ------------- 
CUSTOMER_ID NOT NULL NUMBER(12)    
NAME                 VARCHAR2(100) 
STATUS               VARCHAR2(10)  

desc orders
Name        Null Type         
----------- ---- ------------ 
ORDER_ID         NUMBER(12)   
CUSTOMER_ID      NUMBER(12)   
CREATE_DATE      DATE         
ORDER_PRICE      NUMBER(38,6) 

select count(*) from  orders;
count(*)
--------------
1000002

select count(*) from customer;
count(*)
--------------
100000

orders表有100万行,customer表有10万行。

下面来比较下inner join和semi-join两种SQL写法的性能。
Select /*HE7*/ count(distinct cus.name)
from customer cus, orders ord
where cus.customer_id=ord.customer_id
and ord.order_price > 10000;

由于做了内连接(inner join),最终结果集里将会出现所有匹配行,包含一部分重复customer(用户)。著名Oracle布道者Tom Kyte在2006年一篇文章里就写到:

“In general, you should phrase the queries in the manner that says it best. If one set of joins were particularly efficient to use in all cases, Oracle would not have implemented the rest of them!

In general, you use a join when you need data from more than one table in the ultimate SELECT list.”

文章出处:

http://www.oracle.com/technetwork/issue-archive/2006/06-may/o36asktom-099107.html

在这里,我们只需要customer表的信息,orders表是用来做检索(filtering)的。

只将customer表放在from处,更加自然。用exists(semi-join)改写后:


Select /*HE8*/ count(name)
from customer
where exists(
  select 1 from orders
  where customer.customer_id=orders.customer_id
  and orders.order_price > 10000
);

由于semi-join,这里我们不需要distinct,从customer能直接得到去重后的结果。查看下执行计划和最终的cost
SQL_ID  3kfvh06bqrn5h, child number 0
-------------------------------------
select /*HE7*/ count(distinct cus.name) from customer cus, orders ord 
where cus.customer_id=ord.customer_id and ord.order_price > 10000
 
Plan hash value: 3082118893
 
------------------------------------------------------------------------------------------
| Id  | Operation             | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |       |       |       |  3212 (100)|          |
|   1 |  SORT AGGREGATE       |          |     1 |    52 |       |            |          |
|   2 |   VIEW                | VW_DAG_0 | 52570 |  2669K|       |  3212   (1)| 00:00:01 |
|   3 |    HASH GROUP BY      |          | 52570 |  2104K|  2688K|  3212   (1)| 00:00:01 |
|*  4 |     HASH JOIN SEMI    |          | 52570 |  2104K|  2640K|  2651   (1)| 00:00:01 |
|   5 |      TABLE ACCESS FULL| CUSTOMER |   100K|  1464K|       |   102   (0)| 00:00:01 |
|*  6 |      TABLE ACCESS FULL| ORDERS   |   753K|    18M|       |  1063   (1)| 00:00:01 |
------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   4 - access("CUS"."CUSTOMER_ID"="ORD"."CUSTOMER_ID")
   6 - filter("ORD"."ORDER_PRICE">10000)
 
Note
-----
   - dynamic statistics used: dynamic sampling (level=2)

SQL_ID  ggs8p2c27nzmu, child number 0
-------------------------------------
select /*HE8*/ count(name) from customer where exists(   select 1 from 
orders   where customer.customer_id=orders.customer_id   and 
orders.order_price > 10000 )
 
Plan hash value: 2841769433
 
----------------------------------------------------------------------------------------
| Id  | Operation           | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |          |       |       |       |  2651 (100)|          |
|   1 |  SORT AGGREGATE     |          |     1 |    41 |       |            |          |
|*  2 |   HASH JOIN SEMI    |          | 52570 |  2104K|  2640K|  2651   (1)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| CUSTOMER |   100K|  1464K|       |   102   (0)| 00:00:01 |
|*  4 |    TABLE ACCESS FULL| ORDERS   |   753K|    18M|       |  1063   (1)| 00:00:01 |
----------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("CUSTOMER"."CUSTOMER_ID"="ORDERS"."CUSTOMER_ID")
   4 - filter("ORDERS"."ORDER_PRICE">10000)
 
Note
-----
   - dynamic statistics used: dynamic sampling (level=2)

很明显,第二个执行计划优于第一个。


来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/638844/viewspace-772370/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论

注册时间:2013-07-30

  • 博文量
    102
  • 访问量
    899702