0

I'm working with an omics dataset (1000+ files) which is a folder of about ~1GB of .txt.gz files which are tab separated. They each look roughly like this for a patient ABC:

pos ABC_count1 ABC_count2
100 3 4
101 3 7
102 4 4

(and so on). Each file has ~300k rows and these are 3 columns as above, and note that the pos column is not necessarily overlapping between the files. I want to join all of them so that in the end I have a dataframe/file like:

pos ABC_count1 ABC_count2 DEF_count1 DEF_count2
100 3 4 NA NA
101 3 7 NA NA
102 4 4 NA NA
200 NA NA 1 16
201 NA NA 3 4
202 NA NA 9 20

My understanding is that this is a full outer join, and I can write one in SQL (via duckdb) or with Python dataframe libraries like Pandas etc. I tried doing this with Dask, but the program always runs out of memory seemingly no matter what I set the limit to despite having 256GB RAM. I experimented using polars with functools.reduce but this had similar problems also.

The specific query for DuckDB after reading in the data as t0,t1,t2,... followed by the query:

((t0 FULL OUTER JOIN t1 USING (pos)) FULL OUTER JOIN t2 USING (pos)) -- and so on for all the files

Maybe Dask isn't the smartest choice for this job, but I wanted to try it for this problem. At first I just tried a simple nested outer join, but after some recursion depth issues I turned the call graph into a tree instead. My code is below.

if __name__ == "__main__":
    local_cluster = LocalCluster(
        dashboard_address=":8786",
        local_directory="/dev/shm/tmpdir",
        n_workers=2,
        threads_per_worker=6,
        memory_limit="70GB", # I've tried so many different combinations with workers threads and memory
    )
    client = Client(local_cluster)

   for file in data_path.rglob("*"):
        patient_id = file.name.split("_")[1]  # extract the patient id
        patient_df = dd.read_csv(
            file,
            sep="\t",
            blocksize=None,
            header=None,
            engine="pyarrow",
            compression="gzip")

        patient_df = patient_df.set_index("pos", sorted=False)
        patient_df = patient_df.repartition(partition_size="256MB")
        dfs.append(patient_df)


    merged_df = tree_reduction(dfs)

    merged_df.to_parquet(
        "out_folder/", engine="pyarrow", write_index=True
    )

and tree_reduction is:

def tree_reduction(dfs):
    if len(dfs) == 1:
        return dfs[0]
    elif len(dfs) == 2:
        return dd.merge(dfs[0], dfs[1], how="outer", left_index=True, right_index=True)
    else:
        midpoint = len(dfs) // 2
        left = tree_reduction(dfs[:midpoint])
        right = tree_reduction(dfs[midpoint:])
        return dd.merge(left, right, how="outer", left_index=True, right_index=True)

This consistently runs out of memory and will crash about 30s from when it starts doing the merging. I also tried with duckdb and it also runs out of memory. So I wanted to ask:

  1. Is there a smarter way of doing this? Maybe only merging two files at a time using pandas, writing to disk and doing that until I'm left with one file? I've found Dask isn't too great with gzip files generally.
  2. Why does the Dask code have memory problems and is there anything I can do to get around them without using a cluster with more RAM available? Is there a way to make a runtime/memory tradeoff?
  3. What other ways can this problem be simplified? I thought about building a column of unique positions, but I'm not sure how to use this to join all the data.
2
  • > the pos column is not necessarily overlapping between the files But is it? This can generates a lot of combination of key on a full outer join. If you only have 1GB of zipped text file, this should not be a problem, except if you have multiple times the same join column values. Also I'm not sure if I understand correctly your data structure, based on your example you'll end up with 2000+ columns? Could you do this differently and just concatenate the dataframes at first? Commented Jul 11 at 14:56
  • @GuillaumeEB They are never overlapping completely, and I know how annoying this is. The problem comes upstream from domain-specific info-namely that if all values are null for a pos on a patient, the row is actually dropped before we receive the data. I thought this could be fixed by taking the union (followed by unique) of the pos but I'm not sure how to make progress with this. Commented Jul 12 at 22:28

1 Answer 1

1

The desired output is memory inefficient, since there will be lot of NA values generated. If the further postprocessing doesn't require these NA values then it should be avoided.

Instead you can concatenate each patient data row wise and use a separate index column to identify and access a specific patient data.

Sign up to request clarification or add additional context in comments.

1 Comment

Hi. Unfortunately the NA values are required downstream as this is a domain specific task. I wish the latter approach were possible, but I think the problem would need to be reformulated to avoid the outer join

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.