9/2/12

SEMI/ANTI JOINS


SEMI JOIN

A “semi-join” between two tables returns rows from the first table where one or more matches
are found in the second table.

Semi-joins are written using the EXISTS or IN constructs.

a semi-join can be performed using the nested loops, hash join, or merge join algorithms

Oracle provides the NL_SJ, HASH_SJ, and MERGE_SJ hints in order for you to manipulate the semi-join process if you need to. The hint is applied to the subquery of the EXISTS or IN clause, not the main body of the query itself.

if the MAIN query is more selective i.e returns less number of rows then its better to use EXIST
if the subquery is more selective then its better to use "IN" operator.

ANTI JOIN:

An “anti-join” between two tables returns rows from the first table where no matches are found in the second table. Anti-joins are written using the NOT EXISTS or NOT IN constructs.

if the subquery of "NOT IN" results at least one NULL value then entire NOT IN will be false
and we will not get any results, But "NOT EXIST" will consider NULL as value and returns the value.



Restrictions on SEMI/ANTI joins:

1.      If there is a DISTINCT clause in the query then oracle can’t use semi joins (there is one alternative for DISTINCT claue, see below example i.e using NO_MERGE hint)
2.      The EXIST/IN is part of OR operation  then oracle can’t use semi joins
3.       If the query contains UNION set operator then oracle can’t use semi joins


If there is a DISTINCT clause in the query or the EXIST/IN is part of OR operation then
we can't use SEMI/ANTI joins.

SQL> select DISTINCT username from tab1 where exist 
    (select 1 from tab2 where tabb2.col1 = tab1.col1 );
 
-- The above query will not use SEMI JOIN  because we have the DISTINCT operator in the query

Workaround:

SQL> select /*+ no_merge(tab) */
            DISTINCT username
       from (
              select username from tab1 where exist 
              (select 1 from tab2 where tabb2.col1 = tab1.col1 )
            ) tab;

Example:


Create CUSTOMER table:

SQL> create table test_table_cust_123
     (name varchar2(20), state varchar2(20), cust_id number);
Table created.


begin
for i in 1 .. 100
loop
insert into test_table_cust_123 values
(dbms_random.string('U', 10), dbms_random.string('U', 2),
 round(dbms_random.value(0, 100)) );
end loop
commit;
end;
/

delete duplicate cust_id's:

SQL> delete from test_table_cust_123 where cust_id in
     (select cust_id from (select cust_id, count(*) from test_table_cust_123
     group by cust_id having count(*) > 2));

Create ORDER table:

SQL> create table test_table_order_123
     (item varchar2(20), price number, order_date date, cust_id number);
Table created.

begin
for i in 1 .. 100
loop
insert into test_table_order_123 values (
dbms_random.string('U', 10), round(dbms_random.value(1000, 100000)),
sysdate - round(dbms_random.value(0, 100)), round(dbms_random.value(0, 100)) );
end loop
commit;
end;
/


query:

SELECT  C.name,
        C.cust_id
  FROM  test_table_cust_123 C
 WHERE  C.state = 'CA'
AND EXISTS
          (
            SELECT 1
              FROM test_table_order_123 O
             WHERE O.cust_id = C.cust_id
               AND O.order_date > SYSDATE - 3
          )
ORDER BY C.name;

The above query is using the HASH SEMI JOIN, the execution plan was shown below:

---------------------------------------------------------------------------------
| Id  | Operation           | Name                 | Rows  | Bytes | Cost (%CPU)|
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |                      |     1 |    59 |     6  (34)|
|   1 |  SORT ORDER BY      |                      |     1 |    59 |     6  (34)|
|*  2 |   HASH JOIN SEMI    |                      |     1 |    59 |     5  (20)|
|*  3 |    TABLE ACCESS FULL| TEST_TABLE_CUST_123  |     2 |    74 |     2   (0)|
|*  4 |    TABLE ACCESS FULL| TEST_TABLE_ORDER_123 |     1 |    22 |     2   (0)|
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("O"."CUST_ID"="C"."CUST_ID")
   3 - filter("C"."STATE"='CA')
   4 - filter("O"."ORDER_DATE">SYSDATE@!-3)

We can force the query to use Nested Loop SEMI Join  by using the hint  NL_SJ  in the EXIST subquery block, The execution plan was shown below:
 

---------------------------------------------------------------------------------
| Id  | Operation           | Name                 | Rows  | Bytes | Cost (%CPU)|
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |                      |     1 |    59 |     5  (20)|
|   1 |  SORT ORDER BY      |                      |     1 |    59 |     5  (20)|
|   2 |   NESTED LOOPS SEMI |                      |     1 |    59 |     4   (0)|
|*  3 |    TABLE ACCESS FULL| TEST_TABLE_CUST_123  |     2 |    74 |     2   (0)|
|*  4 |    TABLE ACCESS FULL| TEST_TABLE_ORDER_123 |     1 |    22 |     1   (0)|
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("C"."STATE"='CA')
   4 - filter("O"."ORDER_DATE">SYSDATE@!-3 AND "O"."CUST_ID"="C"."CUST_ID")


We can force the query to not use the SEMI join and to use FILTER operation followd by a SORT, This can be achived with NO_UNNEST hint in the EXIT subquery. Execution plan was shown below:

---------------------------------------------------------------------------------
| Id  | Operation           | Name                 | Rows  | Bytes | Cost (%CPU)|
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |                      |     1 |    37 |     5  (20)|
|   1 |  SORT ORDER BY      |                      |     1 |    37 |     5  (20)|
|*  2 |   FILTER            |                      |       |       |            |        
|*  3 |    TABLE ACCESS FULL| TEST_TABLE_CUST_123  |     2 |    74 |     2   (0)|
|*  4 |    TABLE ACCESS FULL| TEST_TABLE_ORDER_123 |     1 |    22 |     2   (0)|
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "TEST_TABLE_ORDER_123" "O"
              WHERE "O"."CUST_ID"=:B1 AND "O"."ORDER_DATE">SYSDATE@!-3))
   3 - filter("C"."STATE"='CA')
   4 - filter("O"."CUST_ID"=:B1 AND "O"."ORDER_DATE">SYSDATE@!-3

8/18/12

Index Join Vs Index Combine


We have following query, which taking longtime to return the results.

SELECT  object_id
FROM    object_table
WHERE   create_time  > :B1
    AND new_object   = :B2
    AND object_class  IN (:B3, :B4)   
ORDER BY new_object ASC;


Table Details:
-------------
Table: object_table
No of records: 900,000
Table Size: 2 Gb

Indexe Details: (four different single column indexes)
------------------------------------------------------
CREATETIME_IX  on (create_time)
NEW_OBJECT_IDX  on  (new_object)
OBJECT_CLASS_IDX  on (object_class))
OBJECTID_IDX  on (object_id)

The above query was running long and using the full tablescan. The execution plan was shown below:



----------------------------------------------------------------------------
| Id  | Operation          | Name             | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                  |       |       | 10885 (100)|         
|   1 |  SORT ORDER BY     |                  |     3 |    54 | 10885   (1)|
|   2 |   TABLE ACCESS FULL| OBJECT_TABLE     |     3 |    54 | 10884   (1)|
----------------------------------------------------------------------------


The query will always return very less number of rows (10-20 rows), So scanning whole table(900k rows) to get 10-20 rows was not a optimal access method.

How to eliminate the full table scan for above query? 

Ans:  INDEX JOIN  or  INDX COMBINE(Bitmap Conversion)

Index Join:

The SELECT statement involved in four different columns, and all the columns have separate indexes.
So If we join all the indexes we can eliminate the full tablescan. This can be done by using the INDEX_JOIN hint.

Index Joins will be used to avoid the expensive table scans by joining two or more indexes belonging to the same table.

Oracle will join the indexes only if all the columns referenced in the query must be covered in the indexes itself, So it can avoid the table visits.

Index joins will be used only with HASH joins but can use any available index access method.

I forced the query to use index join by using the hint INDEX_JOIN(table_name) , Execution plan was shown below:

Since by using index join, oracle joining the four indexes using the HASH joins.


--------------------------------------------------------------------------------------
| Id  | Operation               | Name                  | Rows  | Bytes | Cost (%CPU)|
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |                       |       |       | 10694 (100)|        
|   1 |  SORT ORDER BY          |                       |     3 |    54 | 10694   (1)|
|   2 |   VIEW                  | index$_join$_001      |     3 |    54 | 10693   (1)|
|   3 |    HASH JOIN            |                       |       |       |            |         
|   4 |     HASH JOIN           |                       |       |       |            |         
|   5 |      HASH JOIN          |                       |       |       |            |         
|   6 |       INDEX RANGE SCAN  | NEW_OBJECT_IDX        |     3 |    54 |   688   (1)|
|   7 |       INDEX RANGE SCAN  | CREATETIME_IX         |     3 |    54 |  1610   (1)|
|   8 |      INLIST ITERATOR    |                       |       |       |            |       
|   9 |       INDEX RANGE SCAN  | OBJECT_CLASS_IDX      |     3 |    54 |  2670   (1)|
|  10 |     INDEX FAST FULL SCAN| OBJECTID_IDX          |     3 |    54 |  6670   (1)|
--------------------------------------------------------------------------------------



The query with index_join hint was running faster than the original one, but response time was still inacceptable to application.

So I have one more option to try, i.e BITMAP CONVERSION OF BTREE INDEXES

Bitmap Conversion for the B-tree Indexes:

Now look at the following two indexes:

CREATETIME_IX  on  (NextRunTime)
NEW_OBJECT_IDX  on (SI_RUNNABLE_OBJECT)

So we have two single column indexes on the two of the columns referenced in the  WHERE claue, So oracle can combine those indexes and make use of it.
Since we have another column was referenced in the table, it needs a visit to table if we use this method.

We can use following hint to force BITMAP CONVERSION plan of BTREE index:  INDEX_COMBINE(table_name index_name1 index_name2)
.
It builds in-memory BITMAP IDEX by using the those two BTREE indexes and based on the rowid it will access the table and applies the filter condition.

The execution plan was shown below:


----------------------------------------------------------------------------------------
| Id  | Operation                         | Name                 | Rows  | Bytes | Cost|   
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |                       |       |     | 12945|         
|   1 |  SORT ORDER BY                    |                       |     2 |  36 | 12945|
|   2 |   TABLE ACCESS BY INDEX ROWID     | CMS_INFOOBJECTS6      |     2 |  36 | 12944|
|   3 |    BITMAP CONVERSION TO ROWIDS    |                       |       |     |      |         
|   4 |     BITMAP AND                    |                       |       |     |      |         
|   5 |      BITMAP CONVERSION FROM ROWIDS|                       |       |     |      |         
|   6 |       INDEX RANGE SCAN            | NEW_OBJECT_IDX        |       | 814 |      |
|   7 |      BITMAP CONVERSION FROM ROWIDS|                       |       |     |      |         
|   8 |       SORT ORDER BY               |                       |       |     |      |         
|   9 |        INDEX RANGE SCAN           | CREATETIME_IX         |       |     | 736  |
----------------------------------------------------------------------------------------




Even though it shows more cost than other two execution plan, The response time was very low compared to other two methods and it solved the response
time problem.

So it looks like a bug in the optimizer while estimating the cost of BITMAP CONVERSION of BTREE INDEXES.