0

I have a table of relationships between contracts with two variables: ID1 (refinanced contract) and ID2 (refinancing contract). I want to create a variable that groups all rows of N to M refinancings. For example, if we have:

ID1   ID2
A     Z
A     Y
A     X
B     Z
B     Y
C     W
D     W
E     V
E     U
F     T

I want to create a variable such that:

ID1   ID2    Group
A     Z      1
A     Y      1
A     X      1
B     Z      1
B     Y      1
C     W      2
D     W      2
E     V      3
E     U      3
F     T      4

The logic is the following:

  • All rows with the same ID1 should have the same value of Group.
  • All rows with the same ID2 should have the same value of Group.
  • The two above conditions have to be combined. In the example, since A->Z and also B->Z, then all rows with ID1 = A or ID1 = B should have the same value of Group, since they share at least one ID2. Conversely, all rows with ID2 = Y or ID2 = X should have the same value of Group, since they share at least one ID1 (=A).

A particular case which is difficult to treat is the following:

ID1   ID2          
A     X      
B     X      
B     Y      
C     Y      
C     Z     
D     Z      

All these rows should have the same value of Group, because:

  • Since A->X and also B->X, all rows with ID1 = A and ID1 = B should have the same value of Group.
  • Since B->Y and also C->Y, all all rows with ID1 = B and ID1 = C should have the same value of Group, which should be the same value as the rows with ID1 = A.
  • Since C->Z and also D->Z, all all rows with ID1 = C and ID1 = D should have the same value of Group, which should be the same value as the rows with ID1 = A and ID1 = B.

However, I can't see how to do it without iteratively doing joins that substantially increase the table size when programming. Since my table contains millions of rows, it is not feasible to apply this iterative logic. Could you help me with a more optimized method?

4
  • 1
    You could add some of your code to the question and comment on why it is unsatisfactory for you. This should help to understand your problem better. Have a look at this question, it seems that yours is a duplicate. Commented Nov 13 at 18:56
  • Do you have access to SAS Network Optimization algorithms? This can be formulated as an undirected network, and PROC OPTGRAPH / PROC OPTNETWORK can group this for you in very little code. Otherwise, you're going to be stuck doing a lot of iterative work with hash tables or SQL joins. Commented Nov 13 at 18:59
  • This question is similar to: SAS - grouping pairs. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. Commented Nov 13 at 19:26
  • This question looks different as it does NOT require that the values of the variables come from the same domain. For example ID1 could be character and ID2 could be numeric. Commented Nov 13 at 22:03

3 Answers 3

2

If the data is sorted by either ID1 or ID2 (and the number of distinct values of ID1 and ID2 are small enough to store in hash objects) then you can do it in one pass by using two hash objects. One with distinct values of ID1 and the other with distinct values of ID2.

data want;
  set have ;
  if _n_=1 then do;
    declare hash h1 (ordered:'yes');
    h1.definekey('id1');
    h1.definedata('id1','group1');
    h1.definedone();
    declare hash h2 (ordered:'yes');
    h2.definekey('id2');
    h2.definedata('id2','group2');
    h2.definedone();
  end;
  if h1.find() and h2.find() then do;
    next+1;
    group1=next;
    group2=next;
  end;
  else if h1.find() then group1=group2;
  else if h2.find() then group2=group1;
  h1.replace();
  h2.replace();
  keep id1 id2 group1;
  rename group1=group;
run;

But if the data is not sorted then it could make too many groups. Which you can see by moving one of the ID1='B' observations.

 ID1    ID2    group

  A      Z       1
  B      Y       2
  A      Y       1
  A      X       1
  B      Z       2
  C      W       3
  D      W       3
  E      V       4
  E      U       4
  F      T       5

So make sure the data is sorted.

If the data is too large for hash objects then you could resort to a brute force iterative method. For example if the values of ID1 and ID2 come from the same domain, that is that a value of 'A' in ID1 has the same meaning as a value of 'A' in ID2, then you could use something like this %subnet() macro.

%subnet(in=have,out=want,from=id1,to=id2,directed=0,subnet=group);
Sign up to request clarification or add additional context in comments.

Comments

1

If you treat this as an undirected network where each group is defined as a connected component, you can solve this with PROC OPTNETWORK

enter image description here

proc optnetwork 
    links=have
    direction=undirected
    outlinks=want(rename=(concomp=group));
    connectedcomponents;
    linksvar from=id1 to=id2;
run;

On SAS 9.4, you will need to use PROC OPTNET instead and do some data manipulation of outnodes afterwards:

proc optnet 
    links=have
    direction=undirected
    out_nodes=outnodes;
    concomp;
    data_links_var from=id1 to=id2;
run;

proc sql;
    create table want as
        select t1.*, t2.concomp as group
        from have as t1
        left join 
             outnodes as t2
        on t1.id1=t2.node;
quit;
ID1 ID2 group
A   Z   1
A   Y   1
A   X   1
B   Z   1
B   Y   1
C   W   2
D   W   2
E   V   3
E   U   3
F   T   4

Comments

0

Here is a hash table solution:

%macro grouplitize(src=, dest=);
    proc sort data=&src.;
        by ID1 ID2;
    run;

    data &dest.;
        set &src.;
        
        /* Arrays to store observed IDs and their group numbers */
        array id_lookup[100] $20 _temporary_;   /* store both ID1 and ID2 values */
        array group_lookup[100] _temporary_;    /* store group number for each observed ID */
        
        retain counter 0 n_obs;                 /* counter = current group number, n_obs = number of items in lookup */
        
        length group 8;
        
        if _N_ = 1 then n_obs = 0;
        
        /* determine if ID1 and ID2 are already in lookup */
        id1_found = 0; id2_found = 0;
        do i = 1 to n_obs;
            if id_lookup[i] = ID1 then do; id1_found = 1; group1 = group_lookup[i]; end;
            if id_lookup[i] = ID2 then do; id2_found = 1; group2 = group_lookup[i]; end;
        end;
        
        /* Case 1: Neither ID1 nor ID2 seen before */
        if id1_found=0 and id2_found=0 then do;
            counter + 1;
            n_obs + 1; id_lookup[n_obs] = ID1; group_lookup[n_obs] = counter;
            n_obs + 1; id_lookup[n_obs] = ID2; group_lookup[n_obs] = counter;
            group = counter;
        end;
        /* Case 2: ID1 seen, ID2 not seen */
        else if id1_found=1 and id2_found=0 then do;
            n_obs + 1; id_lookup[n_obs] = ID2; group_lookup[n_obs] = group1;
            group = group1;
        end;
        /* Case 3: ID1 not seen, ID2 seen */
        else if id1_found=0 and id2_found=1 then do;
            n_obs + 1; id_lookup[n_obs] = ID1; group_lookup[n_obs] = group2;
            group = group2;
        end;
        /* Case 4: Both ID1 and ID2 seen */
        else do;
            if group1=group2 then do;
                group = group1;
            end;
            else do;
                group = .;
            end;
        end;
        
        drop i id1_found id2_found group1 group2 n_obs counter;
    run;

%mend grouplitize;

Example 1

data example1;
    infile datalines dlm=' ' dsd truncover;
    input ID1 $ ID2 $;
    datalines;
A Z
A Y
A X
B Z
B Y
C W
D W
E V
E U
F T
;
run;

%grouplitize(src=example1, dest=example1_out);

proc print data=example1_out;
run;

enter image description here

Example 2

data example2;
    infile datalines dlm=' ' dsd truncover;
    input ID1 $ ID2 $;
    datalines;
A X
B X
B Y
C Y
C Z
D Z
;
run;

%grouplitize(src=example2, dest=example2_out);

proc print data=example2_out;
run;

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.