I'm working with an omics dataset (1000+ files) which is a folder of about ~1GB of .txt.gz files which are tab separated. They each look roughly like this for a patient ABC:
| pos | ABC_count1 | ABC_count2 |
|---|---|---|
| 100 | 3 | 4 |
| 101 | 3 | 7 |
| 102 | 4 | 4 |
(and so on). Each file has ~300k rows and these are 3 columns as above, and note that the pos column is not necessarily overlapping between the files. I want to join all of them so that in the end I have a dataframe/file like:
| pos | ABC_count1 | ABC_count2 | DEF_count1 | DEF_count2 |
|---|---|---|---|---|
| 100 | 3 | 4 | NA | NA |
| 101 | 3 | 7 | NA | NA |
| 102 | 4 | 4 | NA | NA |
| 200 | NA | NA | 1 | 16 |
| 201 | NA | NA | 3 | 4 |
| 202 | NA | NA | 9 | 20 |
My understanding is that this is a full outer join, and I can write one in SQL (via duckdb) or with Python dataframe libraries like Pandas etc. I tried doing this with Dask, but the program always runs out of memory seemingly no matter what I set the limit to despite having 256GB RAM. I experimented using polars with functools.reduce but this had similar problems also.
The specific query for DuckDB after reading in the data as t0,t1,t2,... followed by the query:
((t0 FULL OUTER JOIN t1 USING (pos)) FULL OUTER JOIN t2 USING (pos)) -- and so on for all the files
Maybe Dask isn't the smartest choice for this job, but I wanted to try it for this problem. At first I just tried a simple nested outer join, but after some recursion depth issues I turned the call graph into a tree instead. My code is below.
if __name__ == "__main__":
local_cluster = LocalCluster(
dashboard_address=":8786",
local_directory="/dev/shm/tmpdir",
n_workers=2,
threads_per_worker=6,
memory_limit="70GB", # I've tried so many different combinations with workers threads and memory
)
client = Client(local_cluster)
for file in data_path.rglob("*"):
patient_id = file.name.split("_")[1] # extract the patient id
patient_df = dd.read_csv(
file,
sep="\t",
blocksize=None,
header=None,
engine="pyarrow",
compression="gzip")
patient_df = patient_df.set_index("pos", sorted=False)
patient_df = patient_df.repartition(partition_size="256MB")
dfs.append(patient_df)
merged_df = tree_reduction(dfs)
merged_df.to_parquet(
"out_folder/", engine="pyarrow", write_index=True
)
and tree_reduction is:
def tree_reduction(dfs):
if len(dfs) == 1:
return dfs[0]
elif len(dfs) == 2:
return dd.merge(dfs[0], dfs[1], how="outer", left_index=True, right_index=True)
else:
midpoint = len(dfs) // 2
left = tree_reduction(dfs[:midpoint])
right = tree_reduction(dfs[midpoint:])
return dd.merge(left, right, how="outer", left_index=True, right_index=True)
This consistently runs out of memory and will crash about 30s from when it starts doing the merging. I also tried with duckdb and it also runs out of memory. So I wanted to ask:
- Is there a smarter way of doing this? Maybe only merging two files at a time using pandas, writing to disk and doing that until I'm left with one file? I've found Dask isn't too great with gzip files generally.
- Why does the Dask code have memory problems and is there anything I can do to get around them without using a cluster with more RAM available? Is there a way to make a runtime/memory tradeoff?
- What other ways can this problem be simplified? I thought about building a column of unique positions, but I'm not sure how to use this to join all the data.