Monday, April 9, 2012

Top selling items

A classic task given to a programmer is to make a TOP-N report of some data. Often a TOP within each group of some defined grouping (department, country, product type, etc.) And many times the report should also include the percentage of the total - even though you only display the TOP records.

Many ways can be devised for this and many ways has been used and blogged about and given lectures on. This is nothing new, but it is a technique we often use in many practical cases. One typical case is a top list of best selling items within product groups.



Many people have written and shown this technique, you can google it and see... But you will hear me speak on this at KSCope12 (you are going to be there, aren't you?!) so I might as well put this demo on the blog for your convenience ;-) The demo script for this can be downloaded here.

We create a couple of tables first. A table of items with a column grp to identify the product group the item belongs to, and a table of sales history. For this demo the sales table stores sales history by the month, but the technique would be identical if it were weekly or daily data or even if it were individual order lines.

SQL> create table items(
  2   item  varchar2(10) primary key,
  3   grp   varchar2(10),
  4   name  varchar2(20)
  5  )
  6  /

Table created.

SQL> create table sales (
  2   item  varchar2(10) references items (item),
  3   mth   date,
  4   qty   number
  5  )
  6  /

Table created.

We populate the tables with 10 typical items for our shop - 5 spare parts for cars, 5 accessories for mobile phones.

SQL> begin
  2    insert into items values ('101010','AUTO','Brake disc');
  3    insert into items values ('102020','AUTO','Snow chain');
  4    insert into items values ('103030','AUTO','Sparc plug');
  5    insert into items values ('104040','AUTO','Oil filter');
  6    insert into items values ('105050','AUTO','Light bulb');
  7  
  8    insert into items values ('201010','MOBILE','Handsfree');
  9    insert into items values ('202020','MOBILE','Charger');
 10    insert into items values ('203030','MOBILE','iGloves');
 11    insert into items values ('204040','MOBILE','Headset');
 12    insert into items values ('205050','MOBILE','Cover');
 13  
 14    insert into sales values ('101010',date '2011-04-01',10);
 15    insert into sales values ('101010',date '2011-05-01',11);
 16    insert into sales values ('101010',date '2011-06-01',12);
 17    insert into sales values ('102020',date '2011-03-01', 7);
 18    insert into sales values ('102020',date '2011-07-01', 8);
 19    insert into sales values ('103030',date '2011-01-01', 6);
 20    insert into sales values ('103030',date '2011-02-01', 9);
 21    insert into sales values ('103030',date '2011-11-01', 4);
 22    insert into sales values ('103030',date '2011-12-01',14);
 23    insert into sales values ('104040',date '2011-08-01',22);
 24    insert into sales values ('105050',date '2011-09-01',13);
 25    insert into sales values ('105050',date '2011-10-01',15);
 26  
 27    insert into sales values ('201010',date '2011-04-01', 5);
 28    insert into sales values ('201010',date '2011-05-01', 6);
 29    insert into sales values ('201010',date '2011-06-01', 7);
 30    insert into sales values ('202020',date '2011-03-01',21);
 31    insert into sales values ('202020',date '2011-07-01',23);
 32    insert into sales values ('203030',date '2011-01-01', 7);
 33    insert into sales values ('203030',date '2011-02-01', 7);
 34    insert into sales values ('203030',date '2011-11-01', 6);
 35    insert into sales values ('203030',date '2011-12-01', 8);
 36    insert into sales values ('204040',date '2011-08-01',35);
 37    insert into sales values ('205050',date '2011-09-01',13);
 38    insert into sales values ('205050',date '2011-10-01',15);
 39  
 40    commit;
 41  end;
 42  /

PL/SQL procedure successfully completed.

Now let’s see the result of an ordinary aggregate group by for sales data for the year 2011:

SQL> select i.grp
  2     , i.item
  3     , max(i.name) name
  4     , sum(s.qty) qty
  5    from items i
  6    join sales s
  7    on s.item = i.item
  8   where s.mth between date '2011-01-01' and date '2011-12-01'
  9   group by i.grp, i.item
 10   order by i.grp, sum(s.qty) desc, i.item
 11  /

GRP        ITEM       NAME                        QTY
---------- ---------- -------------------- ----------
AUTO       101010     Brake disc                   33
AUTO       103030     Sparc plug                   33
AUTO       105050     Light bulb                   28
AUTO       104040     Oil filter                   22
AUTO       102020     Snow chain                   15
MOBILE     202020     Charger                      44
MOBILE     204040     Headset                      35
MOBILE     203030     iGloves                      28
MOBILE     205050     Cover                        28
MOBILE     201010     Handsfree                    18

10 rows selected.

Ordering by the sum shows us which items are the top three selling items in each group. Let us see if we cannot do the same with analytic functions :-)

First let’s try using the aggregate query as source for three different analytic functions - DENSE_RANK(), RANK() and ROW_NUMBER(). Each of them we do PARTITION BY in order to get a ranking within each produkt group and we use ORDER BY to specify the ranking order:

SQL> select g.grp
  2     , g.item
  3     , g.name
  4     , g.qty
  5     , dense_rank() over (partition by g.grp order by g.qty desc) drnk
  6     , rank() over (partition by g.grp order by g.qty desc) rnk
  7     , row_number() over (partition by g.grp order by g.qty desc, g.item) rnum
  8    from (
  9   select i.grp
 10        , i.item
 11        , max(i.name) name
 12        , sum(s.qty) qty
 13     from items i
 14     join sales s
 15       on s.item = i.item
 16    where s.mth between date '2011-01-01' and date '2011-12-01'
 17    group by i.grp, i.item
 18  ) g
 19   order by g.grp, g.qty desc, g.item
 20  /

GRP        ITEM       NAME                 QTY       DRNK        RNK       RNUM
---------- ---------- ------------- ---------- ---------- ---------- ----------
AUTO       101010     Brake disc            33          1          1          1
AUTO       103030     Sparc plug            33          1          1          2
AUTO       105050     Light bulb            28          2          3          3
AUTO       104040     Oil filter            22          3          4          4
AUTO       102020     Snow chain            15          4          5          5
MOBILE     202020     Charger               44          1          1          1
MOBILE     204040     Headset               35          2          2          2
MOBILE     203030     iGloves               28          3          3          3
MOBILE     205050     Cover                 28          3          3          4
MOBILE     201010     Handsfree             18          4          5          5

10 rows selected.

Notice the differences in the three ranking functions:
  • DENSE_RANK gives the same rank to items having the same value - brake disc and sparc plug both get 1, light bulb gets 2.
  • RANK does the same, but it works like olympic medals: if there are two gold medals then there is no silver medal, the next one gets bronze - here light bulb gets 3.
  • ROW_NUMBER never gives the same rank, it will always give 1, 2, 3, 4 … If two items have the same value it will be “random” which one gets to be first - here brake disc gets 1 and sparc plug gets 2, but that is because we gave the “ORDER BY g.qty desc, g.item”, this tells Oracle that if there is a tie for g.qty, then use item id for ordering and breaking the tie. For ROW_NUMBER it is good practice to always use an ORDER BY clause that ensures consistent ordering - otherwise you might get different results in different runs of the query.

You can get the same result as above without using an inline view. It is possible simply to use the analytic functions directly on the aggregations:

SQL> select i.grp
  2     , i.item
  3     , max(i.name) name
  4     , sum(s.qty) qty
  5     , dense_rank() over (partition by i.grp order by sum(s.qty) desc) drnk
  6     , rank() over (partition by i.grp order by sum(s.qty) desc) rnk
  7     , row_number() over (partition by i.grp order by sum(s.qty) desc, i.item) rnum
  8    from items i
  9    join sales s
 10    on s.item = i.item
 11   where s.mth between date '2011-01-01' and date '2011-12-01'
 12   group by i.grp, i.item
 13   order by i.grp, sum(s.qty) desc, i.item
 14  /

GRP        ITEM       NAME                 QTY       DRNK        RNK       RNUM
---------- ---------- ------------- ---------- ---------- ---------- ----------
AUTO       101010     Brake disc            33          1          1          1
AUTO       103030     Sparc plug            33          1          1          2
AUTO       105050     Light bulb            28          2          3          3
AUTO       104040     Oil filter            22          3          4          4
AUTO       102020     Snow chain            15          4          5          5
MOBILE     202020     Charger               44          1          1          1
MOBILE     204040     Headset               35          2          2          2
MOBILE     203030     iGloves               28          3          3          3
MOBILE     205050     Cover                 28          3          3          4
MOBILE     201010     Handsfree             18          4          5          5

10 rows selected.

The effect is the same - the aggregation is done first and then the analytic functions are applied. Sometimes it can be nice to do the inline view anyway for readability and to make it clear what happens, but for simpler cases like this I tend to prefer the compact method. One of the reasons is that often you need to put the entire thing in another inline view anyway in order to filter the result, and then you can get confused by too many inline views inside inline views :-)

The analytic functions cannot be used directly in a where clause, so when we wish to filter on the ranking function in order to get our TOP-3 report, we need an inline view:

SQL> select g.grp
  2     , g.item
  3     , g.name
  4     , g.qty
  5     , g.rnk
  6    from (
  7    select i.grp
  8         , i.item
  9         , max(i.name) name
 10         , sum(s.qty) qty
 11         , rank() over (partition by i.grp order by sum(s.qty) desc) rnk
 12      from items i
 13      join sales s
 14        on s.item = i.item
 15     where s.mth between date '2011-01-01' and date '2011-12-01'
 16     group by i.grp, i.item
 17  ) g
 18   where g.rnk <= 3
 19   order by g.grp, g.rnk, g.item
 20  /

GRP        ITEM       NAME                        QTY        RNK
---------- ---------- -------------------- ---------- ----------
AUTO       101010     Brake disc                   33          1
AUTO       103030     Sparc plug                   33          1
AUTO       105050     Light bulb                   28          3
MOBILE     202020     Charger                      44          1
MOBILE     204040     Headset                      35          2
MOBILE     203030     iGloves                      28          3
MOBILE     205050     Cover                        28          3

7 rows selected.

Here we used RANK and thus we got three AUTO items and four MOBILE items. Let us try DENSE_RANK:

SQL> select g.grp
  2     , g.item
  3     , g.name
  4     , g.qty
  5     , g.rnk
  6    from (
  7    select i.grp
  8         , i.item
  9         , max(i.name) name
 10         , sum(s.qty) qty
 11         , dense_rank() over (partition by i.grp order by sum(s.qty) desc) rnk
 12      from items i
 13      join sales s
 14        on s.item = i.item
 15     where s.mth between date '2011-01-01' and date '2011-12-01'
 16     group by i.grp, i.item
 17  ) g
 18   where g.rnk <= 3
 19   order by g.grp, g.rnk, g.item
 20  /

GRP        ITEM       NAME                        QTY        RNK
---------- ---------- -------------------- ---------- ----------
AUTO       101010     Brake disc                   33          1
AUTO       103030     Sparc plug                   33          1
AUTO       105050     Light bulb                   28          2
AUTO       104040     Oil filter                   22          3
MOBILE     202020     Charger                      44          1
MOBILE     204040     Headset                      35          2
MOBILE     203030     iGloves                      28          3
MOBILE     205050     Cover                        28          3

8 rows selected.

That gave us four items in each group. So how about ROW_NUMBER:

SQL> select g.grp
  2     , g.item
  3     , g.name
  4     , g.qty
  5     , g.rnk
  6    from (
  7    select i.grp
  8         , i.item
  9         , max(i.name) name
 10         , sum(s.qty) qty
 11         , row_number() over (partition by i.grp order by sum(s.qty) desc, i.item) rnk
 12      from items i
 13      join sales s
 14        on s.item = i.item
 15     where s.mth between date '2011-01-01' and date '2011-12-01'
 16     group by i.grp, i.item
 17  ) g
 18   where g.rnk <= 3
 19   order by g.grp, g.rnk, g.item
 20  /

GRP        ITEM       NAME                        QTY        RNK
---------- ---------- -------------------- ---------- ----------
AUTO       101010     Brake disc                   33          1
AUTO       103030     Sparc plug                   33          2
AUTO       105050     Light bulb                   28          3
MOBILE     202020     Charger                      44          1
MOBILE     204040     Headset                      35          2
MOBILE     203030     iGloves                      28          3

6 rows selected.

That gave us exactly three items in each group. Notice here that even though both iGloves and Cover sold 28 pieces and ties for third place in group MOBILE, we only get iGloves here. That is because we told Oracle to resolve ties by ordering by item id and 203030 is less than 205050. If we had omitted the “, i.item” part of the ORDER BY for ROW_NUMBER(), then it would have been random whether we got iGloves or Cover.

Along with the quantity sold and the rank, we would also like to know how big a percentage of the total sales each item had. And we would like that information both within the product groups, but at the same time also how big percentage of the grand total sold.

For that we use RATIO_TO_REPORT which returns the ratio of the particular row to the total (then we multiply the ratio by 100 to get percent.)

If we use the PARTITION BY clause, then we can get a percentage within the group. If we omit the PARTITION BY, then we get a percentage of the grand total:

SQL> select g.grp
  2     , g.item
  3     , g.name
  4     , g.qty
  5     , g.rnk
  6     , round(g.g_pct,1) g_pct
  7     , round(g.t_pct,1) t_pct
  8    from (
  9    select i.grp
 10         , i.item
 11         , max(i.name) name
 12         , sum(s.qty) qty
 13         , rank() over (partition by i.grp order by sum(s.qty) desc) rnk
 14         , 100 * ratio_to_report(sum(s.qty)) over (partition by i.grp) g_pct
 15         , 100 * ratio_to_report(sum(s.qty)) over () t_pct
 16      from items i
 17      join sales s
 18        on s.item = i.item
 19     where s.mth between date '2011-01-01' and date '2011-12-01'
 20     group by i.grp, i.item
 21  ) g
 22   where g.rnk <= 3
 23   order by g.grp, g.rnk, g.item
 24  /

GRP        ITEM       NAME                 QTY        RNK      G_PCT      T_PCT
---------- ---------- ------------- ---------- ---------- ---------- ----------
AUTO       101010     Brake disc            33          1       25.2       11.6
AUTO       103030     Sparc plug            33          1       25.2       11.6
AUTO       105050     Light bulb            28          3       21.4        9.9
MOBILE     202020     Charger               44          1       28.8       15.5
MOBILE     204040     Headset               35          2       22.9       12.3
MOBILE     203030     iGloves               28          3       18.3        9.9
MOBILE     205050     Cover                 28          3       18.3        9.9

7 rows selected.

Here we see that light bulb was 21.4% of the total sales in the AUTO group and 9.9% of the grand total sales.

This is technique we use often in many of our reports. Sometimes (as this case) we combine aggregation with analytics, where the aggregation creates the set of data that the analytic functions then work on. Sometimes the data is ready for use by the analytic functions and no aggregation is needed. That depends on the actual case whether you need aggregation combined with analytics.

Read more in the Oracle documentation on DENSE_RANK, RANK, ROW_NUMBER and RATIO_TO_REPORT.

1 comment: