Monday, October 8, 2012

Group by Groups

For some time now I have struggled to efficiently "group by groups" on data containing references between our suppliers item numbers and the original equipment manufacturer (OEM) number. I can group those data by supplier and their item number and get a "group of OEM numbers." That group I will call a unique item and I will group the data once more by that group to find which suppliers agree on an item having the same set (group) of OEM numbers.

It took me some time to figure out how to do this efficiently (the data was several million records and the "unique groups" to be found was about 3/4 million groups.) But after several tries, some good help from the OTN forums and some reading in my Tom Kyte books, I finally got a solution that seems to work well ;-)



Let me show what I tried with an example using good old scott.emp:

SQL> create table orig_data as
  2  select distinct job, deptno
  3  from scott.emp e
  4  / 
 
Table created.
 
SQL> select job
  2       , deptno
  3    from orig_data
  4   order by
  5         job
  6       , deptno
  7  / 
 
JOB           DEPTNO
--------- ----------
ANALYST           20
CLERK             10
CLERK             20
CLERK             30
MANAGER           10
MANAGER           20
MANAGER           30
PRESIDENT         10
SALESMAN          30
 
9 rows selected.


First I group by job (I use xmlagg here because I am on version 11.1 and therefore no listagg  ):

SQL> select od.job
  2       , rtrim(xmlagg(xmlelement(d,od.deptno,',').extract('//text()') order by od.deptno),',') deptnos
  3    from orig_data od
  4   group by od.job
  5  / 
 
JOB       DEPTNOS
--------- ------------------------------
ANALYST   20
CLERK     10,20,30
MANAGER   10,20,30
PRESIDENT 10
SALESMAN  30

I notice here that both job CLERK and MANAGER has the same set of deptnos.

So if I group by deptnos I can get this result:

SQL> select s2.deptnos
  2       , rtrim(xmlagg(xmlelement(j,s2.job,',').extract('//text()') order by s2.job),',') jobs
  3    from (
  4     select od.job
  5          , rtrim(xmlagg(xmlelement(d,od.deptno,',').extract('//text()') order by od.deptno),',') deptnos
  6       from orig_data od
  7      group by od.job
  8         ) s2
  9   group by s2.deptnos
 10  / 
 
DEPTNOS                        JOBS
------------------------------ ------------------------------
10                             PRESIDENT
10,20,30                       CLERK,MANAGER
20                             ANALYST
30                             SALESMAN

(If you substitute OEM number for deptnos and supplier items for jobs, this represents my real world case - except the real world finds near 3/4 million such "group by groups.")

So my requirement is to identify all such unique groups of deptnos in my orig_data table, give each such group a surrogate key in a parent table, and then populate two child tables with the deptnos of each group and the jobs that have that group of deptnos:

SQL> create table groups (
  2     groupkey number primary key
  3  )
  4  / 
 
Table created.
 
SQL> create table groups_depts (
  2     groupkey number references groups (groupkey)
  3   , deptno number(2)
  4  )
  5  / 
 
Table created.
 
SQL> create table groups_jobs (
  2     groupkey number references groups (groupkey)
  3   , job varchar2(9)
  4  )
  5  / 
 
Table created.

For the surrogate groupkey I can just use a rownumber on my group by deptnos query:

SQL> select row_number() over (order by s2.deptnos) groupkey
  2       , s2.deptnos
  3       , rtrim(xmlagg(xmlelement(j,s2.job,',').extract('//text()') order by s2.job),',') jobs
  4    from (
  5     select od.job
  6          , rtrim(xmlagg(xmlelement(d,od.deptno,',').extract('//text()') order by od.deptno),',') deptnos
  7       from orig_data od
  8      group by od.job
  9         ) s2
 10   group by s2.deptnos
 11  / 
 
  GROUPKEY DEPTNOS                        JOBS
---------- ------------------------------ ------------------------------
         1 10                             PRESIDENT
         2 10,20,30                       CLERK,MANAGER
         3 20                             ANALYST
         4 30                             SALESMAN

That query I can use for a (slow) insert into my three tables in this simple manner:

SQL> begin
  2     for g in (
  3        select row_number() over (order by s2.deptnos) groupkey
  4             , s2.deptnos
  5             , rtrim(xmlagg(xmlelement(j,s2.job,',').extract('//text()') order by s2.job),',') jobs
  6          from (
  7           select od.job
  8                , rtrim(xmlagg(xmlelement(d,od.deptno,',').extract('//text()') order by od.deptno),',') deptnos
  9             from orig_data od
 10            group by od.job
 11               ) s2
 12         group by s2.deptnos
 13     ) loop
 14        insert into groups values (g.groupkey);
 15
 16        insert into groups_depts
 17           select g.groupkey
 18                , to_number(regexp_substr(str, '[^,]+', 1, level)) deptno
 19             from (
 20                    select rownum id
 21                         , g.deptnos str
 22                      from dual
 23                  )
 24           connect by instr(str, ',', 1, level-1) > 0
 25                  and id = prior id
 26                  and prior dbms_random.value is not null;
 27
 28        insert into groups_jobs
 29           select g.groupkey
 30                , regexp_substr(str, '[^,]+', 1, level) job
 31             from (
 32                    select rownum id
 33                         , g.jobs str
 34                      from dual
 35                  )
 36           connect by instr(str, ',', 1, level-1) > 0
 37                  and id = prior id
 38                  and prior dbms_random.value is not null;
 39
 40     end loop;
 41  end;
 42  / 
 
PL/SQL procedure successfully completed.

The tables now contain this data:


SQL> select *
  2    from groups
  3   order by groupkey
  4  / 
 
  GROUPKEY
----------
         1
         2
         3
         4
 
SQL> select *
  2    from groups_depts
  3   order by groupkey, deptno
  4  / 
 
  GROUPKEY     DEPTNO
---------- ----------
         1         10
         2         10
         2         20
         2         30
         3         20
         4         30
 
6 rows selected.
 
SQL> select *
  2    from groups_jobs
  3   order by groupkey, job
  4  / 
 
  GROUPKEY JOB
---------- ---------
         1 PRESIDENT
         2 CLERK
         2 MANAGER
         3 ANALYST
         4 SALESMAN

I can now from these data get the same result as before (just to test I have created the desired data):

SQL> select g.groupkey
  2       , d.deptnos
  3       , j.jobs
  4    from groups g
  5    join (
  6           select groupkey
  7                , rtrim(xmlagg(xmlelement(d,deptno,',').extract('//text()') order by deptno),',') deptnos
  8             from groups_depts
  9            group by groupkey
 10         ) d
 11         on d.groupkey = g.groupkey
 12    join (
 13           select groupkey
 14                , rtrim(xmlagg(xmlelement(j,job,',').extract('//text()') order by job),',') jobs
 15             from groups_jobs
 16            group by groupkey
 17         ) j
 18         on j.groupkey = g.groupkey
 19  / 
 
  GROUPKEY DEPTNOS                        JOBS
---------- ------------------------------ ------------------------------
         1 10                             PRESIDENT
         2 10,20,30                       CLERK,MANAGER
         3 20                             ANALYST
         4 30                             SALESMAN

So far so good. This all works pretty much as desired - except for a couple of things:

The very simple loop insert code will be slow. OK, it is a one-time conversion job (in theory, but very few times at least) so that could probably be acceptable (except for my professional pride. ;-)

But worse is, that I have groups where the string aggregation won't work - the string would have to be about varchar2(10000) for my real world case (some groups would contain something like 1600 OEM numbers) which won't work in SQL in the group by. :-(

So I have tried an attempt using collections. First a collection of deptnos:

SQL> create type deptno_tab_type as table of number(2)
  2  / 
 
Type created.
 
SQL> select od.job
  2       , cast(collect(od.deptno order by od.deptno) as deptno_tab_type) deptnos
  3    from orig_data od
  4   group by od.job
  5  / 
 
JOB       DEPTNOS
--------- ------------------------------
ANALYST   DEPTNO_TAB_TYPE(20)
CLERK     DEPTNO_TAB_TYPE(10, 20, 30)
MANAGER   DEPTNO_TAB_TYPE(10, 20, 30)
PRESIDENT DEPTNO_TAB_TYPE(10)
SALESMAN  DEPTNO_TAB_TYPE(30)

All very good - no problems here. But then a collection of jobs:

SQL> create type job_tab_type as table of varchar2(9)
  2  / 
 
Type created.
 
SQL> select s2.deptnos
  2       , cast(collect(s2.job order by s2.job) as job_tab_type) jobs
  3    from (
  4     select od.job
  5          , cast(collect(od.deptno order by od.deptno) as deptno_tab_type) deptnos
  6       from orig_data od
  7      group by od.job
  8         ) s2
  9   group by s2.deptnos
 10  / 
 group by s2.deptnos
          *
ERROR at line 9:
ORA-00932: inkonsistente datatyper: forventede -, fik XAL_SUPERVISOR.DEPTNO_TAB_TYPE

Now it fails with an "inconsistent datatype" error - I cannot group by a collection datatype (without doing something more...)

This is the time when I asked for help on the OTN forums, and odie_63 kindly reminded me that an object datatype can be grouped if it has a MAP or an ORDER member method. A MAP function would probably be the most efficient, but it requires being able to return a numerical value that will be unique - not realistic in this case. Odie suggested a simple ORDER method instead:

SQL> create type deptno_container as object (
  2    nt deptno_tab_type
  3  , order member function match (o deptno_container) return integer
  4  );
  5  /

Type created.

SQL>
SQL> create or replace type body deptno_container as
  2    order member function match (o deptno_container) return integer is
  3    begin
  4      return case when nt = o.nt then 0 else 1 end;
  5    end;
  6  end;
  7  /

Type body created.

It relies on the fact that when the nested tables are collections of simple datatypes, they can be compared for equality by default.

That allows me to "group by groups":

SQL> select row_number() over(order by null) groupkey
  2       , s2.deptnos
  3       , cast(collect(s2.job order by s2.job) as job_tab_type) jobs
  4    from (
  5     select od.job
  6          , deptno_container(
  7              cast(collect(od.deptno order by od.deptno) as deptno_tab_type)
  8            ) deptnos
  9       from orig_data od
 10      group by od.job
 11    ) s2
 12   group by s2.deptnos
 13  /

  GROUPKEY DEPTNOS(NT)              JOBS
---------- ------------------------ --------------------------------
         1 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('SALESMAN')
           TAB_TYPE(30))

         2 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('PRESIDENT')
           TAB_TYPE(10))

         3 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('CLERK', 'MANAGER')
           TAB_TYPE(10, 20, 30))

         4 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('ANALYST')
           TAB_TYPE(20))

And Odie then also suggested a method to use that query to insert into the three tables:

SQL> insert all
  2     when rn0 = 1 then
  3        into groups (groupkey)
  4        values (groupkey)
  5     when rn1 = 1 then
  6        into groups_jobs (groupkey, job)
  7        values (groupkey, job)
  8     when rn2 = 1 then
  9        into groups_depts (groupkey, deptno)
 10           values(groupkey, deptno)
 11  with all_groups as (
 12     select row_number() over(order by null) groupkey
 13          , s2.deptnos
 14          , cast(collect(s2.job order by s2.job) as job_tab_type) jobs
 15       from (
 16        select od.job
 17             , deptno_container(
 18                 cast(collect(od.deptno order by od.deptno) as deptno_tab_type)
 19               ) deptnos
 20          from orig_data od
 21         group by od.job
 22       ) s2
 23      group by s2.deptnos
 24  )
 25  select groupkey
 26       , value(j) job
 27       , value(d) deptno
 28       , row_number() over (
 29           partition by groupkey
 30           order by null
 31         ) rn0
 32       , row_number() over (
 33           partition by groupkey, value(j)
 34           order by null
 35         ) rn1
 36       , row_number() over (
 37           partition by groupkey, value(d)
 38           order by null
 39         ) rn2
 40    from all_groups t
 41       , table(t.jobs) j
 42       , table(t.deptnos.nt) d
 43  ;

15 rows created.

Let's just check the result again:

SQL> select g.groupkey
  2       , d.deptnos
  3       , j.jobs
  4    from groups g
  5    join (
  6           select groupkey
  7                , rtrim(xmlagg(xmlelement(d,deptno,',').extract('//text()') order by deptno),',') deptnos
  8             from groups_depts
  9            group by groupkey
 10         ) d
 11         on d.groupkey = g.groupkey
 12    join (
 13           select groupkey
 14                , rtrim(xmlagg(xmlelement(j,job,',').extract('//text()') order by job),',') jobs
 15             from groups_jobs
 16            group by groupkey
 17         ) j
 18         on j.groupkey = g.groupkey
 19  /

  GROUPKEY DEPTNOS                  JOBS
---------- ------------------------ --------------------------------
         1 30                       SALESMAN
         2 10                       PRESIDENT
         3 10,20,30                 CLERK,MANAGER
         4 20                       ANALYST

Yes! This creates the same groupings as I did with the string aggregation (OK, different key, but that is just a pseudo key so no matter ;-). Now I have a method that can "group by groups" using collections rather than string aggregation and thus work on larger groups than can be contained in a VARCHAR2.

But then I used that method on the real large set of data. Bummer :-( It created about 1½ million groups rather than 3/4 millions - there were duplicate groups!

So I guessed it might have something to do with the ORDER member method always returning "1" for un-equal collections, rather than "-1" for "smaller than" and "1" for "greater than". So I made a new body for the container object type:

SQL> create or replace type body deptno_container as
  2    order member function match (o deptno_container) return integer is
  3    begin
  4      for i in 1..nt.count loop
  5        if i > o.nt.count then
  6           return 1;
  7        elsif nt(i) > o.nt(i) then
  8           return 1;
  9        elsif nt(i) < o.nt(i) then
 10           return -1;
 11        end if;
 12      end loop;
 13      if nt.count < o.nt.count then
 14        return -1;
 15      else
 16        return 0;
 17      end if;
 18    end;
 19  end;
 20  /

Type body created.

This one compares the elements of the collections and defines when a collection is "greater than" or "smaller than" another.

Let's test the new type body:

SQL> select row_number() over(order by null) groupkey
  2       , s2.deptnos
  3       , cast(collect(s2.job order by s2.job) as job_tab_type) jobs
  4    from (
  5     select od.job
  6          , deptno_container(
  7              cast(collect(od.deptno order by od.deptno) as deptno_tab_type)
  8            ) deptnos
  9       from orig_data od
 10      group by od.job
 11    ) s2
 12   group by s2.deptnos
 13  /

  GROUPKEY DEPTNOS(NT)              JOBS
---------- ------------------------ --------------------------------
         1 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('PRESIDENT')
           TAB_TYPE(10))

         2 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('CLERK', 'MANAGER')
           TAB_TYPE(10, 20, 30))

         3 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('ANALYST')
           TAB_TYPE(20))

         4 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('SALESMAN')
           TAB_TYPE(30))

Hooray - we get the same groups! (OK, different group keys - presumably due to the new order method - but that's OK.)

But I was still troubled that it did not really work as it should. If I just did the above query, it would perfectly correct create about 3/4 million groups. But when I put the same query into the INSERT ALL construct, somewhat over 800.000 groups were created - duplicates still existed :-(

So I went back to my good old trusted copy of Tom Kyte's "Expert One-On-One Oracle" book in the chapter on object relational coding. I decided to try to do this using an O-R view and an instead-of trigger:

First yet another object type:

SQL> create type groups_container as object (
  2     groupkey number
  3   , deptnos  deptno_container
  4   , jobs     job_tab_type
  5  )
  6  /

Type created.

Then an object type view:

SQL> create view groups_view
  2     of groups_container
  3     with object identifier (groupkey)
  4  as
  5  select g.groupkey
  6       , deptno_container(
  7           cast(
  8              multiset(
  9                 select gd.deptno
 10                   from groups_depts gd
 11                  where gd.groupkey = g.groupkey
 12                  order by gd.deptno
 13              )
 14              as deptno_tab_type
 15           )
 16         ) deptnos
 17       , cast(
 18           multiset(
 19              select gj.job
 20                from groups_jobs gj
 21               where gj.groupkey = g.groupkey
 22               order by gj.job
 23           )
 24           as job_tab_type
 25         ) jobs
 26    from groups g
 27  /

View created.

And finally the instead-of trigger on the view:

SQL> create trigger groups_view_ins
  2     instead of insert
  3     on groups_view
  4  begin
  5     insert into groups (groupkey)
  6        values (:new.groupkey);
  7
  8     forall i in 1..:new.deptnos.nt.count
  9        insert into groups_depts (groupkey, deptno)
 10           values (:new.groupkey, :new.deptnos.nt(i));
 11
 12     forall i in 1..:new.jobs.count
 13        insert into groups_jobs (groupkey, job)
 14           values (:new.groupkey, :new.jobs(i));
 15  end;
 16  /

Trigger created.

Now I can insert into my three tables by inserting into the O-R view:

SQL> insert into groups_view (
  2     groupkey, deptnos, jobs
  3  )
  4  select rownum groupkey
  5       , s3.deptnos
  6       , s3.jobs
  7    from (
  8    select s2.deptnos
  9         , cast(collect(s2.job order by s2.job) as job_tab_type) jobs
 10      from (
 11      select od.job
 12           , deptno_container(
 13               cast(collect(od.deptno order by od.deptno) as deptno_tab_type)
 14             ) deptnos
 15        from orig_data od
 16       group by od.job
 17    ) s2
 18   group by s2.deptnos
 19  ) s3
 20  /

4 rows created.

Note how it says it created 4 "rows" - which is 4 "instances of the object", but translates into this much data in the three tables:

SQL> select *
  2    from groups
  3   order by groupkey
  4  /

  GROUPKEY
----------
         1
         2
         3
         4

SQL> select *
  2    from groups_depts
  3   order by groupkey, deptno
  4  /

  GROUPKEY     DEPTNO
---------- ----------
         1         10
         2         10
         2         20
         2         30
         3         20
         4         30

6 rows selected.

SQL> select *
  2    from groups_jobs
  3   order by groupkey, job
  4  /

  GROUPKEY JOB
---------- ---------
         1 PRESIDENT
         2 CLERK
         2 MANAGER
         3 ANALYST
         4 SALESMAN

Or if we take the same control statement I've used before:

SQL> select g.groupkey
  2       , d.deptnos
  3       , j.jobs
  4    from groups g
  5    join (
  6           select groupkey
  7                , rtrim(xmlagg(xmlelement(d,deptno,',').extract('//text()') order by deptno),',') deptnos
  8             from groups_depts
  9            group by groupkey
 10         ) d
 11         on d.groupkey = g.groupkey
 12    join (
 13           select groupkey
 14                , rtrim(xmlagg(xmlelement(j,job,',').extract('//text()') order by job),',') jobs
 15             from groups_jobs
 16            group by groupkey
 17         ) j
 18         on j.groupkey = g.groupkey
 19  /

  GROUPKEY DEPTNOS                  JOBS
---------- ------------------------ --------------------------------
         1 10                       PRESIDENT
         2 10,20,30                 CLERK,MANAGER
         3 20                       ANALYST
         4 30                       SALESMAN

Yes - we got the same groups of groups again. And we now even have a short and sweet method of querying out the same group objects:

SQL> select *
  2    from groups_view
  3   order by groupkey
  4  /

  GROUPKEY DEPTNOS(NT)              JOBS
---------- ------------------------ --------------------------------
         1 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('PRESIDENT')
           TAB_TYPE(10))

         2 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('CLERK', 'MANAGER')
           TAB_TYPE(10, 20, 30))

         3 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('ANALYST')
           TAB_TYPE(20))

         4 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('SALESMAN')
           TAB_TYPE(30))

Same result as we've also seen before.

And finally I am happy with the result. When I did this to my real world data, I got exactly the 3/4 million unique groups. Yay :-D

I had actually expected the instead-of trigger insertion to slow down the process, as I introduced a kind of "row-by-row" processing here. But in reality it turned out to be slightly faster (about 50 minutes compared to an hour) than the method using TABLE to "break out" the collections and then INSERT ALL. There may have been differences in load on the server at the time, so I won't conclude that one method or the other is faster - I'll just note that the instead-of trigger method for this particular use case at least didn't slow things down ;-)

And that's a wrap - now I can start using the generated groups of data :-D

No comments:

Post a Comment