Dask large outer join with gzip files

Question

I'm working with an omics dataset (1000+ files) which is a folder of about ~1GB of .txt.gz files which are tab separated. They each look roughly like this for a patient ABC:

pos	ABC_count1	ABC_count2
100	3	4
101	3	7
102	4	4

(and so on). Each file has ~300k rows and these are 3 columns as above, and note that the pos column is not necessarily overlapping between the files. I want to join all of them so that in the end I have a dataframe/file like:

pos	ABC_count1	ABC_count2	DEF_count1	DEF_count2
100	3	4	NA	NA
101	3	7	NA	NA
102	4	4	NA	NA
200	NA	NA	1	16
201	NA	NA	3	4
202	NA	NA	9	20

My understanding is that this is a full outer join, and I can write one in SQL (via duckdb) or with Python dataframe libraries like Pandas etc. I tried doing this with Dask, but the program always runs out of memory seemingly no matter what I set the limit to despite having 256GB RAM. I experimented using polars with functools.reduce but this had similar problems also.

The specific query for DuckDB after reading in the data as t0,t1,t2,... followed by the query:

((t0 FULL OUTER JOIN t1 USING (pos)) FULL OUTER JOIN t2 USING (pos)) -- and so on for all the files

Maybe Dask isn't the smartest choice for this job, but I wanted to try it for this problem. At first I just tried a simple nested outer join, but after some recursion depth issues I turned the call graph into a tree instead. My code is below.

if __name__ == "__main__":
    local_cluster = LocalCluster(
        dashboard_address=":8786",
        local_directory="/dev/shm/tmpdir",
        n_workers=2,
        threads_per_worker=6,
        memory_limit="70GB", # I've tried so many different combinations with workers threads and memory
    )
    client = Client(local_cluster)

   for file in data_path.rglob("*"):
        patient_id = file.name.split("_")[1]  # extract the patient id
        patient_df = dd.read_csv(
            file,
            sep="\t",
            blocksize=None,
            header=None,
            engine="pyarrow",
            compression="gzip")

        patient_df = patient_df.set_index("pos", sorted=False)
        patient_df = patient_df.repartition(partition_size="256MB")
        dfs.append(patient_df)


    merged_df = tree_reduction(dfs)

    merged_df.to_parquet(
        "out_folder/", engine="pyarrow", write_index=True
    )

and tree_reduction is:

def tree_reduction(dfs):
    if len(dfs) == 1:
        return dfs[0]
    elif len(dfs) == 2:
        return dd.merge(dfs[0], dfs[1], how="outer", left_index=True, right_index=True)
    else:
        midpoint = len(dfs) // 2
        left = tree_reduction(dfs[:midpoint])
        right = tree_reduction(dfs[midpoint:])
        return dd.merge(left, right, how="outer", left_index=True, right_index=True)

This consistently runs out of memory and will crash about 30s from when it starts doing the merging. I also tried with duckdb and it also runs out of memory. So I wanted to ask:

Is there a smarter way of doing this? Maybe only merging two files at a time using pandas, writing to disk and doing that until I'm left with one file? I've found Dask isn't too great with gzip files generally.
Why does the Dask code have memory problems and is there anything I can do to get around them without using a cluster with more RAM available? Is there a way to make a runtime/memory tradeoff?
What other ways can this problem be simplified? I thought about building a column of unique positions, but I'm not sure how to use this to join all the data.

> the pos column is not necessarily overlapping between the files But is it? This can generates a lot of combination of key on a full outer join. If you only have 1GB of zipped text file, this should not be a problem, except if you have multiple times the same join column values. Also I'm not sure if I understand correctly your data structure, based on your example you'll end up with 2000+ columns? Could you do this differently and just concatenate the dataframes at first? — Guillaume EB
– Guillaume EB, Commented Jul 11 at 14:56
@GuillaumeEB They are never overlapping completely, and I know how annoying this is. The problem comes upstream from domain-specific info-namely that if all values are null for a pos on a patient, the row is actually dropped before we receive the data. I thought this could be fixed by taking the union (followed by unique) of the pos but I'm not sure how to make progress with this. — AnthonyML
– AnthonyML, Commented Jul 12 at 22:28

Purushottam Sahu · Accepted Answer · 2025-07-11 21:50:16Z

1

The desired output is memory inefficient, since there will be lot of NA values generated. If the further postprocessing doesn't require these NA values then it should be avoided.

Instead you can concatenate each patient data row wise and use a separate index column to identify and access a specific patient data.

answered Jul 11 at 21:50

Purushottam Sahu

112 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

AnthonyML Jul 12 at 2:29

Hi. Unfortunately the NA values are required downstream as this is a domain specific task. I wish the latter approach were possible, but I think the problem would need to be reformulated to avoid the outer join

Collectives™ on Stack Overflow

Dask large outer join with gzip files

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related