Redshift Join on Regex Match

Question

Attempting to join one table (Table a) with a string (field fruit) against another table (Table b) with a list of comma separated strings (field all_fruits) as the join condition. I want to return records that do not have a corresponding match.

Key restriction is I can't use a UDF and when I've tried using the following, it results in an a:

The pattern must be a valid UTF-8 literal character expression

error. Table b can be quite large so exploding the field isn't an option.

SELECT 
    a.*,
    b.*
FROM
    a
    LEFT JOIN  b
        ON a.fruit !~ b.all_fruits
;

The operative word is "literal". The regexp has to be a literal string, it can't be a variable or column. So it can't be dynamic. — Barmar
– Barmar, Commented May 7 at 19:46
Edited the title to remove 'dynamic' and trying to see if there are alternatives. — Jammy
– Jammy, Commented May 7 at 20:08
In general, putting comma-separated strings in SQL tables is a bad idea. You should normalize the schema. — Barmar
– Barmar, Commented May 7 at 20:17
REGEXP_INSTR() requires a string literal as well unfortunately. — Jammy
– Jammy, Commented May 7 at 20:30

halfer · Accepted Answer · 2025-11-05 00:22:04Z

This was a little tougher than I thought it would be. I hoped I could run the regex's directly from the database and that it would output the result set for me. I searched and searched and found absolutely nothing that worked. First I had to install MySQL and get some permissions issues solved, next I had to get Emacs to run MySQL queries correctly, then I had to get Perl and MySQL to understand UTF8 correctly.

Then I looked for a way to run simple regex's against queries from a database and found either nothing or incorrect answers. My first stop was here, but none of those answers worked. My next stop was here but none of those answers worked either.

I decided either it couldn't be done in pure SQL or I just can't figure it out. The only way I was going to get anything done was to write something myself in Perl like with the previous two database regex solutions. Using Perl DBI I manually grabbed every row from the first table, and compared it to every row from the second table using the regex you asked for.

I created a table called farmers with each individual fruit, a fruit_id number, and the cultivated_region where the fruit was grown. I created a table called grocers with a grocery_store_name, grocery_store_id, and a list of all_fruits which I assumed to be something like an inventory list. The script will run the regex we discussed, and create a view with the results called join_on_regex. This is a results view showing the grocery stores that do not contain farmers.fruit in the comma separated list grocers.all_fruits. I did some light testing and it appears to be working. If you find anything not working properly please respond I found this question pretty interesting.

Here is the code...

#!/usr/bin/perl -w

use DBI;
use Data::Dumper;
#pass all database information as command line argument
my ($database,$host,$port,$user,$pass) = (shift,shift,shift,shift, shift);
my $dsn = "DBI:mysql:database=$database;host=$host;port=$port";
my $dbh = DBI->connect($dsn,$user,$pass) or die "Connection Error: $DBI::errstr\n";

#drop tables first.  If it fails its ok just go ahead and create it anyway
eval { $dbh->do("DROP TABLE farmers") };
print "Drop table failed: $@\n" if $@;

eval { $dbh->do("DROP TABLE grocers") };
print "Drop table failed: $@\n" if $@;

my $createTableSql = q(
    CREATE TABLE `farmers` (
      `fruit_id` int(16) unsigned NOT NULL AUTO_INCREMENT,
      `fruit` varchar(32) NOT NULL,
      `cultivated_region` varchar(32) NOT NULL,
      PRIMARY KEY (`fruit_id`)
    );
);

my $createTableSql2 = q(
    CREATE TABLE `grocers` (
      `grocery_store_id` int(16) unsigned NOT NULL AUTO_INCREMENT,
      `all_fruits` varchar(128) NOT NULL,
      `grocery_store_name` varchar(32) NOT NULL,
      PRIMARY KEY (`grocery_store_id`)
    );
);

#for some reason it wouldnt work when I tried to create them both in the same query
$dbh->do($createTableSql) or die("Couldnt create tables\n$@");
$dbh->do($createTableSql2) or die("Couldnt create tables\n$@");

#Test Data 1:
#populate farmers table with 10 fruits
for(1..10){
  $sql = "INSERT INTO farmers(fruit,cultivated_region) VALUES (?,?)";
  $sth = $dbh->prepare($sql);
  $sth->execute ("fruit$_" , "region$_") or die "SQL Error: $dbh->errstr()\n";
}
    
#Test Data 2:
#populate growers table with 10 grocery stores with lists of 10 fruits (inventory)
#separated by commas
for(1..10){
  $sql = "INSERT INTO grocers(grocery_store_name,all_fruits) VALUES (?,?)";
  $sth = $dbh->prepare($sql);

  my @insertValues;
  my $insertString = "";
  push(@insertValues,"grocery_store_name$_");
  for(0..10){
    $insertString .= "fruit" . int(rand(25)) . ",";
  }
  $insertString =~ s/,$//;
  push(@insertValues,$insertString);
  $sth->execute (@insertValues) or die "SQL Error: $DBI::errstr\n";
}

my @farmersRows;     #contains reference to each row from "select * from farmers" query
my @farmersMatch;    #contains reference to entire row from farmers which matches the regex
my @grocersRows;      #contains reference to each row from "select * from grocers" query
my @grocersMatch;     #contains reference to entire row from grocers which matches the regex

my %match; #@farmersMatch and @grocersMatch are synced by index.
            #This hash contains a one f.fruit_id to many g.grocery_store_id
             #wherever it is missing from g.all_fruits list
              #this hash is later used to create $finalQuery

#retrieve data from farmers table.
my $sth = $dbh->prepare("SELECT * FROM farmers");
$sth->execute() or die "Query failed\n";
while (my $ref = $sth->fetchrow_hashref()) {
  push(@farmersRows,$ref);
}

#retrieve data from grocers table.
$sth = $dbh->prepare("SELECT * FROM grocers");
$sth->execute() or die "Query failed\n";
while (my $ref = $sth->fetchrow_hashref()) {
  push(@grocersRows,$ref);
}
#DEBUG: print "$_->{fruit_id}" for(@farmersRows);
#DEBUG: print "$_->{grocery_store_id}" for(@grocersRows);

#compare data from both tables using the discussed regex in an n squared operation
#I dont know any way it can be done faster, or with pure SQL
for $f (@farmersRows){
  for $g (@grocersRows){
    if($g->{all_fruits} !~ /(^|,)$f->{fruit}(,|$)/){
      push(@farmersMatch, $f);
      push(@grocersMatch, $g);
      push(@{$match{$f->{fruit_id}}},$g->{grocery_store_id});
      #DEBUG: print "Found $f->{fruit}($f->{fruit_id}) NOT IN $g->{grocery_store_name}($g->{grocery_store_id})\nlist $g->{all_fruits}\n";
      #DEBUG: print "$f->{fruit_id} NOT IN $g->{grocery_store_id} $g->{all_fruits}\n";
    }
  }
}
#DEBUG: print Dumper(%match);  #all relevant data should be in hash

#print match hash, one fruit_id to many grocery_store_id (where f.fruit_id is missing from g.all_fruits)
for my $k (sort {$a <=> $b} keys %match){
  print "fruit_id: $k is not in Grocery_store_id: (";
  my $s = "";
  for( @{$match{$k}}){
    $s .= "$_,";
  }
  chop($s);
  print "$s)\n";
}

print "\nHere is the final query\n\n";

my $finalQuery =
"
create or replace view join_on_regex as
select *
from (
";
for my $k (sort {$a <=> $b} keys %match){
  my $s = "";
  for( @{$match{$k}}){
    $s .= "$_,";
  }
  chop($s);
  $finalQuery .= "(select f.fruit_id, f.fruit, g.grocery_store_id, g.grocery_store_name, g.all_fruits
         from
         (select * from farmers where fruit_id=$k) f
         cross join
         (select * from grocers where grocery_store_id in($s)) g)
UNION ALL\n";
}
$finalQuery =~ s/UNION ALL\n$//;
$finalQuery .= ")t;";
print "$finalQuery\n";

$dbh->do($finalQuery) or die("Couldnt create tables\n$@");
print "\n\nFinal query executed, to see result set run the following command:\n";
print "select * from join_on_regex order by fruit_id, grocery_store_id;\n";
$sth->finish();

That was a little long but that was the only way I could figure out to do it. To run it, pass the database information as a command line argument. It will print a list of which farmers.fruit_id is missing from each grocers.grocery_store_id inventory list. It will print the database command used to create the view join_on_regex, then print the command to show the view. Output looks like this...

$ perl join.table.on.regex.pl DATABASE HOST PORT USERNAME PASSWORD

fruit_id: 1 is not in Grocery_store_id: (3,4,5,6,7,10)
fruit_id: 2 is not in Grocery_store_id: (1,3,6,7,8,9,10)
fruit_id: 3 is not in Grocery_store_id: (1,2,4,5,7,10)
fruit_id: 4 is not in Grocery_store_id: (1,2,3,4,5,6,8,9)
fruit_id: 5 is not in Grocery_store_id: (1,2,3,6,7,8,9,10)
fruit_id: 6 is not in Grocery_store_id: (1,2,4,6,7,9,10)
fruit_id: 7 is not in Grocery_store_id: (1,2,3,4,9)
fruit_id: 8 is not in Grocery_store_id: (2,3,4,5,6,7,8,10)
fruit_id: 9 is not in Grocery_store_id: (2,3,5,8,9)
fruit_id: 10 is not in Grocery_store_id: (2,3,4,5,6,7,9)

Here is the final query


create or replace view join_on_regex as
select *
from (
(select f.fruit_id, f.fruit, g.grocery_store_id, g.grocery_store_name, g.all_fruits
         from
         (select * from farmers where fruit_id=1) f
         cross join
         (select * from grocers where grocery_store_id in(3,4,5,6,7,10)) g)
UNION ALL
(select f.fruit_id, f.fruit, g.grocery_store_id, g.grocery_store_name, g.all_fruits
         from
         (select * from farmers where fruit_id=2) f
         cross join
         (select * from grocers where grocery_store_id in(1,3,6,7,8,9,10)) g)

<cut, goes on like this for 10 tables>

UNION ALL
(select f.fruit_id, f.fruit, g.grocery_store_id, g.grocery_store_name, g.all_fruits
         from
         (select * from farmers where fruit_id=10) f
         cross join
         (select * from grocers where grocery_store_id in(2,3,4,5,6,7,9)) g)
)t;

Final query executed, to see result set run the following command:
select * from join_on_regex order by fruit_id, grocery_store_id;

You can run the last select in a query editor like mysql-workbench or via the command line using mysqlsh and you will get a result set something like this...

+----------+---------+------------------+----------------------+--------------------------------------------------------------------------------------+
| fruit_id | fruit   | grocery_store_id | grocery_store_name   | all_fruits                                                                           |
+----------+---------+------------------+----------------------+--------------------------------------------------------------------------------------+
|        1 | fruit1  |                3 | grocery_store_name3  | fruit23,fruit23,fruit3,fruit0,fruit16,fruit6,fruit16,fruit17,fruit17,fruit22,fruit14 |
|        1 | fruit1  |                4 | grocery_store_name4  | fruit18,fruit2,fruit5,fruit18,fruit11,fruit9,fruit11,fruit5,fruit17,fruit23,fruit5   |
|        1 | fruit1  |                5 | grocery_store_name5  | fruit2,fruit20,fruit7,fruit5,fruit6,fruit0,fruit12,fruit5,fruit15,fruit2,fruit18     |
|        1 | fruit1  |                6 | grocery_store_name6  | fruit20,fruit0,fruit13,fruit3,fruit21,fruit9,fruit11,fruit23,fruit24,fruit7,fruit13  |
|        1 | fruit1  |                7 | grocery_store_name7  | fruit11,fruit4,fruit12,fruit7,fruit19,fruit9,fruit21,fruit22,fruit15,fruit19,fruit0  |
|        1 | fruit1  |               10 | grocery_store_name10 | fruit10,fruit9,fruit11,fruit17,fruit7,fruit14,fruit4,fruit19,fruit10,fruit9,fruit17  |
|        2 | fruit2  |                1 | grocery_store_name1  | fruit10,fruit11,fruit9,fruit0,fruit15,fruit1,fruit9,fruit8,fruit12,fruit10,fruit17   |
|        2 | fruit2  |                3 | grocery_store_name3  | fruit23,fruit23,fruit3,fruit0,fruit16,fruit6,fruit16,fruit17,fruit17,fruit22,fruit14 |
|        2 | fruit2  |                6 | grocery_store_name6  | fruit20,fruit0,fruit13,fruit3,fruit21,fruit9,fruit11,fruit23,fruit24,fruit7,fruit13  |
|        2 | fruit2  |                7 | grocery_store_name7  | fruit11,fruit4,fruit12,fruit7,fruit19,fruit9,fruit21,fruit22,fruit15,fruit19,fruit0  |
<cut result set is very long>

The tables are randomly generated for testing purposes. Go ahead and put some real sample data in the tables to verify it works like you wanted. That got quite a bit longer than I expected but I couldn't find anything else that worked properly.

tinazmu · Accepted Answer · 2025-05-07 23:17:55Z

-1

You can use the regular CHARINDEX(substring, string); reserve regular expressions for more complex pattern matching.

SELECT *
FROM table_a a
JOIN table_b b
ON CHARINDEX(',' + a.fruit + ',', ',' + b.all_fruits + ',') <> 0;

You could also use array features:

ON a.fruit = ANY(string_to_array(b.all_fruits, ','))

I can't test these, though!

edited May 7 at 23:17

answered May 7 at 22:46

tinazmu

5,5442 gold badges12 silver badges25 bronze badges

Collectives™ on Stack Overflow

Redshift Join on Regex Match

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related