Wednesday, February 20, 2013

ROWS versus default RANGE in analytic window clause

I have talked at KScope about the difference between ROWS and RANGE in the analytic window clause, but haven't yet blogged about it. Recently while working out a quiz for the PL/SQL Challenge I discovered yet another reason for remembering to primarily use ROWS and only use RANGE when the actual problem requires it.



From my KScope presentation examples here is a simple case of a rolling sum of salaries:

select deptno
     , ename
     , sal
     , sum(sal) over (
          partition by deptno
          order by sal
       ) sum_sal
  from scott.emp
order by deptno
       , sal
/

    DEPTNO ENAME             SAL    SUM_SAL
---------- ---------- ---------- ----------
        10 MILLER           1300       1300
        10 CLARK            2450       3750
        10 KING             5000       8750
        20 SMITH             800        800
        20 ADAMS            1100       1900
        20 JONES            2975       4875
        20 SCOTT            3000      10875
        20 FORD             3000      10875
        30 JAMES             950        950
        30 MARTIN           1250       3450
        30 WARD             1250       3450
        30 TURNER           1500       4950
        30 ALLEN            1600       6550
        30 BLAKE            2850       9400

14 rows selected.

Scott and Ford get the same sum and similarly for Martin and Ward. Why? Because when the window clause is omitted, it defaults to RANGE BETWEEN:

select deptno
     , ename
     , sal
     , sum(sal) over (
          partition by deptno
          order by sal
          range between unbounded preceding and current row
       ) sum_sal
  from scott.emp
order by deptno
       , sal
/

    DEPTNO ENAME             SAL    SUM_SAL
---------- ---------- ---------- ----------
        10 MILLER           1300       1300
        10 CLARK            2450       3750
        10 KING             5000       8750
        20 SMITH             800        800
        20 ADAMS            1100       1900
        20 JONES            2975       4875
        20 SCOTT            3000      10875
        20 FORD             3000      10875
        30 JAMES             950        950
        30 MARTIN           1250       3450
        30 WARD             1250       3450
        30 TURNER           1500       4950
        30 ALLEN            1600       6550
        30 BLAKE            2850       9400

14 rows selected.

RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW means, that for Scott it takes the sum of the salary of all employees in department 20 that have the same or smaller salary value (column of the ORDER BY) as Scott. So it will be the sum of all in deptno 20 with salary <= 3000, which includes Ford! When calculating SUM_SAL for Ford, it will be the same result.

We can get the rolling sum we probably expect by using ROWS BETWEEN instead:

select deptno
     , ename
     , sal
     , sum(sal) over (
          partition by deptno
          order by sal
          rows between unbounded preceding and current row
       ) sum_sal
  from scott.emp
order by deptno
       , sal
/

    DEPTNO ENAME             SAL    SUM_SAL
---------- ---------- ---------- ----------
        10 MILLER           1300       1300
        10 CLARK            2450       3750
        10 KING             5000       8750
        20 SMITH             800        800
        20 ADAMS            1100       1900
        20 JONES            2975       4875
        20 SCOTT            3000       7875
        20 FORD             3000      10875
        30 JAMES             950        950
        30 MARTIN           1250       2200
        30 WARD             1250       3450
        30 TURNER           1500       4950
        30 ALLEN            1600       6550
        30 BLAKE            2850       9400

14 rows selected.

The effect of ROWS BETWEEN rather than the default RANGE BETWEEN is that the sum does not take into account the value of the colum in the ORDER BY clause, but rather just takes the sum of salaries from Scott and the previous rows. Note: Since we have a non-unique ordering, it is indeterminate which row comes first of Scott or Ford. You might run this another day and see Ford first having SUM_SAL=7875 and Scott right after with SUM_XAL=10875. Generally I recommend adding something unique (for example order by sal, empno) to make it determinate giving same output on the same data.

But what if we are ordering by EMPNO rather than SAL so we have something unique and we are certain that RANGE BETWEEN never encounters duplicate values? Can't we just allow ourselves to save some typing and just write:

select deptno
     , ename
     , sal
     , sum(sal) over (
          partition by deptno
          order by empno
       ) sum_sal
  from scott.emp
order by deptno
       , empno
/

    DEPTNO ENAME             SAL    SUM_SAL
---------- ---------- ---------- ----------
        10 CLARK            2450       2450
        10 KING             5000       7450
        10 MILLER           1300       8750
        20 SMITH             800        800
        20 JONES            2975       3775
        20 SCOTT            3000       6775
        20 ADAMS            1100       7875
        20 FORD             3000      10875
        30 ALLEN            1600       1600
        30 WARD             1250       2850
        30 MARTIN           1250       4100
        30 BLAKE            2850       6950
        30 TURNER           1500       8450
        30 JAMES             950       9400

14 rows selected.

Since we have a unique ordering, the default RANGE BETWEEN in this case gives the same result as if we add the ROWS BETWEEN clause:

select deptno
     , ename
     , sal
     , sum(sal) over (
          partition by deptno
          order by empno
          rows between unbounded preceding and current row
       ) sum_sal
  from scott.emp
order by deptno
       , empno
/

    DEPTNO ENAME             SAL    SUM_SAL
---------- ---------- ---------- ----------
        10 CLARK            2450       2450
        10 KING             5000       7450
        10 MILLER           1300       8750
        20 SMITH             800        800
        20 JONES            2975       3775
        20 SCOTT            3000       6775
        20 ADAMS            1100       7875
        20 FORD             3000      10875
        30 ALLEN            1600       1600
        30 WARD             1250       2850
        30 MARTIN           1250       4100
        30 BLAKE            2850       6950
        30 TURNER           1500       8450
        30 JAMES             950       9400

14 rows selected.

So why should we bother to type that long ROWS BETWEEN clause when we have something unique?

Well, consider this example of using two analytic functions - one within an inline view. (This example is a bit contrived on the EMP table, but I use similar constructs many times in my daily work.) The inline view creates a rolling sum, the outer select filters on the result and makes a new rolling sum on the leftover rows:

select s.*
     , sum(sal) over (
          partition by deptno
          order by empno
       ) sum_sal_2
  from (
   select deptno
        , empno
        , ename
        , sal
        , sum(sal) over (
             partition by deptno
             order by empno
          ) sum_sal
     from scott.emp
       ) s
 where sum_sal > 5000
 order by deptno
        , empno
/

    DEPTNO      EMPNO ENAME             SAL    SUM_SAL  SUM_SAL_2
---------- ---------- ---------- ---------- ---------- ----------
        10       7839 KING             5000       7450       5000
        10       7934 MILLER           1300       8750       6300
        20       7788 SCOTT            3000       6775       3000
        20       7876 ADAMS            1100       7875       4100
        20       7902 FORD             3000      10875       7100
        30       7698 BLAKE            2850       6950       2850
        30       7844 TURNER           1500       8450       4350
        30       7900 JAMES             950       9400       5300

8 rows selected.

The interesting part comes when examining autotrace output for that SQL:

Execution Plan
----------------------------------------------------------
Plan hash value: 3842450409

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |    14 |   826 |     4  (25)| 00:00:01 |
|   1 |  WINDOW BUFFER       |      |    14 |   826 |     4  (25)| 00:00:01 |
|*  2 |   VIEW               |      |    14 |   826 |     4  (25)| 00:00:01 |
|   3 |    WINDOW SORT       |      |    14 |   238 |     4  (25)| 00:00:01 |
|   4 |     TABLE ACCESS FULL| EMP  |    14 |   238 |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("SUM_SAL">5000)


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          7  consistent gets
          0  physical reads
          0  redo size
        583  bytes sent via SQL*Net to client
        364  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          2  sorts (memory)
          0  sorts (disk)
          8  rows processed

The inner SUM() uses a WINDOW SORT operation and the outer SUM() uses WINDOW BUFFER. Two memory sorts are used.

Let us add the ROWS BETWEEN clause:

select s.*
     , sum(sal) over (
          partition by deptno
          order by empno
          rows between unbounded preceding and current row
       ) sum_sal_2
  from (
   select deptno
        , empno
        , ename
        , sal
        , sum(sal) over (
             partition by deptno
             order by empno
             rows between unbounded preceding and current row
          ) sum_sal
     from scott.emp
       ) s
 where sum_sal > 5000
 order by deptno
        , empno
/

Giving us the exact same output:

    DEPTNO      EMPNO ENAME             SAL    SUM_SAL  SUM_SAL_2
---------- ---------- ---------- ---------- ---------- ----------
        10       7839 KING             5000       7450       5000
        10       7934 MILLER           1300       8750       6300
        20       7788 SCOTT            3000       6775       3000
        20       7876 ADAMS            1100       7875       4100
        20       7902 FORD             3000      10875       7100
        30       7698 BLAKE            2850       6950       2850
        30       7844 TURNER           1500       8450       4350
        30       7900 JAMES             950       9400       5300

8 rows selected.

But a couple of interesting difference in the autotrace output:

Execution Plan
----------------------------------------------------------
Plan hash value: 1485958224

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |    14 |   826 |     4  (25)| 00:00:01 |
|   1 |  WINDOW NOSORT       |      |    14 |   826 |     4  (25)| 00:00:01 |
|*  2 |   VIEW               |      |    14 |   826 |     4  (25)| 00:00:01 |
|   3 |    WINDOW SORT       |      |    14 |   238 |     4  (25)| 00:00:01 |
|   4 |     TABLE ACCESS FULL| EMP  |    14 |   238 |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("SUM_SAL">5000)


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          7  consistent gets
          0  physical reads
          0  redo size
        587  bytes sent via SQL*Net to client
        364  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
          8  rows processed

The inner SUM() uses a WINDOW SORT operation like before, but the outer SUM() uses WINDOW NOSORT this time! And just one memory sort is used.

So when we are lazy and save on typing by not writing a window specification, we will get the default RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. When doing RANGE, the optimizer cannot save an extra sorting operation in this example (probably because RANGE needs to "look ahead" in the row source), but if we explicitly use the ROWS BETWEEN clause, the optimizer knows it can use the sorting it has already created and save some work.

The same is true if the example had used diffent analytic functions in the inline view and the outer select (for example COUNT() as the inner analytic function and SUM() as the outer analytic function.) What matters is that they use the same PARTITION and ORDER BY. In real life I find it is often different functions I use in constructs like this, but the WINDOW NOSORT still applies.

So this is yet another argument for my rule of thumb when I use analytic functions:

If I am doing analytic on the entire partition (or entire resultset), then I skip the window clause (that is: I do not write ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING):

sum(sal) over (
   partition by deptno
)

But once I have an ORDER BY, I generally do not rely on default window clause, but rather explicitly writes either ROWS or RANGE BETWEEN:

sum(sal) over (
   partition by deptno
   order by empno
   rows between unbounded preceding and current row
)

Analytic functions that only support the order-by clause and not a windowing clause - like for example ROW_NUMBER() - are of course excempt from this rule of thumb. But if a function supports both order-by clause and windowing clause, then I do not write an order-by clause without adding the windowing clause.

I use this rule of thumb because in my experience 95% of the time I really want ROWS BETWEEN - only rarely have I a true case for using RANGE BETWEEN. In 80% of the cases I have a unique ORDER BY clause which actually make RANGE BETWEEN give the same result, but as we see above at a cost of potentially doing much more work sorting the data. Also explicitly writing ROWS or RANGE make it clear to other developers what the intention is - if the windowing clause is omitted I do not really know if it is on purpose meant to be RANGE, or if it is a lazy programmer who really meant ROWS :-)

So please be explicit in your windowing clauses. If you need to save typing, then create an auto-replace in your TOAD or whatever IDE you use :-)

1 comment:

  1. Thanks! I've been reading the training kit for 70-461 but it didn't make this clear. You blog here lays it our perfectly.

    ReplyDelete