Skip to content

Commit 1d176e2

Browse files
committed
update reqs
1 parent a62aa40 commit 1d176e2

File tree

14 files changed

+122
-91
lines changed

14 files changed

+122
-91
lines changed

LeafletSC.egg-info/PKG-INFO

Lines changed: 43 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
Metadata-Version: 2.1
22
Name: LeafletSC
3-
Version: 0.1.2
3+
Version: 0.1.4
44
Summary: Alternative splicing quantification in single cells with Leaflet
55
Home-page: https://github.com/daklab/Leaflet
66
Author: Karin Isaev, Columbia University and NYGC
@@ -12,26 +12,27 @@ Classifier: Operating System :: OS Independent
1212
Requires-Python: >=3.9.15
1313
Description-Content-Type: text/markdown
1414
License-File: LICENSE
15-
Requires-Dist: gtfparse==1.3.0
16-
Requires-Dist: matplotlib==3.7.1
17-
Requires-Dist: numpy==1.23.5
18-
Requires-Dist: pandas==1.5.3
15+
Requires-Dist: gtfparse==2.5.0
16+
Requires-Dist: matplotlib
17+
Requires-Dist: numpy
18+
Requires-Dist: pandas
1919
Requires-Dist: pyranges==0.0.129
20-
Requires-Dist: scanpy==1.9.8
21-
Requires-Dist: scikit_learn==1.2.2
22-
Requires-Dist: scipy==1.10.1
23-
Requires-Dist: seaborn==0.13.2
24-
Requires-Dist: setuptools==69.1.1
25-
Requires-Dist: torch==2.2.1
26-
Requires-Dist: tqdm==4.66.2
27-
Requires-Dist: umap_learn==0.5.3
20+
Requires-Dist: scanpy
21+
Requires-Dist: scikit_learn
22+
Requires-Dist: scipy
23+
Requires-Dist: seaborn
24+
Requires-Dist: setuptools
25+
Requires-Dist: torch==1.12.1
26+
Requires-Dist: tqdm
27+
Requires-Dist: umap
28+
Requires-Dist: tables==3.4.4
2829

2930
# LeafletSC
3031

3132
LeafletSC is a binomial mixture model designed for the analysis of alternative splicing events in single-cell RNA sequencing data. The model facilitates understanding and quantifying splicing variability at the single-cell level. Below is the graphical model representation:
3233

3334
<p align="center">
34-
<img src="https://github.com/daklab/Leaflet/assets/23510936/3e147ba5-7ee8-47ae-b84c-5e99e0551acf" width="500">
35+
<img src="https://github.com/daklab/Leaflet/assets/23510936/2c7981fe-91ec-4830-b010-b74ac4140940">
3536
</p>
3637

3738
## Compatibility with sequencing platforms
@@ -42,29 +43,50 @@ LeafletSC supports analysis from the following single-cell RNA sequencing platfo
4243

4344
## Getting Started
4445

45-
LeafletSC is implemented in Python and requires Python version 3.9 or higher. You can easily install LeafletSC via PyPI using the following command:
46+
LeafletSC is implemented in Python and requires Python version 3.9 or higher. We recommend the following approach:
4647

4748
```bash
48-
pip install LeafletSC
49+
# create a conda environment with python 3.9
50+
conda create -n "LeafletSC" python=3.9.15 ipython
51+
# activate environment
52+
conda activate LeafletSC
53+
# install latest version of LeafletSC into this environment
54+
pip install LeafletSC==0.1.2
4955
```
5056

51-
Please also make sure you have regtools installed. Prior to using LeafletSC, please run regtools on your single-cell BAM files. Here is an example of what this might look like in a Snakefile:
57+
Once the package is installed, you can load it in python as follows:
58+
```python
59+
import LeafletSC
60+
61+
# or specific submodules
62+
from LeafletSC.utils import *
63+
from LeafletSC.clustering import *
64+
```
65+
66+
## Requirements
67+
Prior to using LeafletSC, please run **regtools** on your single-cell BAM files. Here is an example of what this might look like in a Snakefile:
5268

5369
```Snakemake
5470
{params.regtools_path} junctions extract -a 6 -m 50 -M 500000 {input.bam_use} -o {output.juncs} -s XS -b {output.barcodes}
5571
# Combine junctions and cell barcodes
5672
paste --delimiters='\t' {output.juncs} {output.barcodes} > {output.juncswbarcodes}
5773
```
58-
59-
Once you have your junction files, you can try out the mixture model tutorial under [Tutorials](Tutorials/run_binomial_mixture_model.ipynb)
74+
- Once you have your junction files, you can try out the mixture model tutorial under [Tutorials](Tutorials/run_binomial_mixture_model.ipynb)
75+
- While optional, we recommend running LeafletSC intron clustering with a gtf file so that junctions can be first mapped to annotated splicing events.
6076

6177
## Capabilities
6278
With LeafletSC, you can:
6379

64-
- Infer cell states influenced by alternative splicing and identify significant splice junctions.
80+
- Infer cell states influenced by alternative splicing and identify differentially spliced regions.
6581
- Conduct differential splicing analysis between specific cell groups if cell identities are known.
6682
- Generate synthetic alternative splicing datasets for robust analysis testing.
6783

84+
## How does it work?
85+
The full method can be found in our [paper](https://www.biorxiv.org/content/10.1101/2023.10.17.562774v3) while the graphical model is shown below:
86+
<p align="center">
87+
<img src="https://github.com/daklab/Leaflet/assets/23510936/3e147ba5-7ee8-47ae-b84c-5e99e0551acf">
88+
</p>
89+
6890
## If you use Leaflet, please cite our [paper](https://www.biorxiv.org/content/10.1101/2023.10.17.562774v3)
6991

7092
```
@@ -85,3 +107,4 @@ With LeafletSC, you can:
85107
2. Add 10X/split-seq mode in addition to smart-seq2
86108
3. Extend framework to seurat/scanpy anndata objects
87109
4. Add notes on generative model and inference method
110+
5. Clean up dependencies

LeafletSC.egg-info/SOURCES.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,9 @@ LeafletSC.egg-info/SOURCES.txt
77
LeafletSC.egg-info/dependency_links.txt
88
LeafletSC.egg-info/requires.txt
99
LeafletSC.egg-info/top_level.txt
10-
LeafletSC/beta-binomial-mix/__init__.py
11-
LeafletSC/beta-binomial-mix/cellstate_consistency.py
12-
LeafletSC/beta-binomial-mix/model.py
10+
LeafletSC/beta_binomial_mix/__init__.py
11+
LeafletSC/beta_binomial_mix/cellstate_consistency.py
12+
LeafletSC/beta_binomial_mix/model.py
1313
LeafletSC/clustering/__init__.py
1414
LeafletSC/clustering/load_cluster_data.py
1515
LeafletSC/clustering/obtain_intron_clusters.py

LeafletSC.egg-info/requires.txt

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,14 @@
1-
gtfparse==1.3.0
2-
matplotlib==3.7.1
3-
numpy==1.23.5
4-
pandas==1.5.3
1+
gtfparse==2.5.0
2+
matplotlib
3+
numpy
4+
pandas
55
pyranges==0.0.129
6-
scanpy==1.9.8
7-
scikit_learn==1.2.2
8-
scipy==1.10.1
9-
seaborn==0.13.2
10-
setuptools==69.1.1
11-
torch==2.2.1
12-
tqdm==4.66.2
13-
umap_learn==0.5.3
6+
scanpy
7+
scikit_learn
8+
scipy
9+
seaborn
10+
setuptools
11+
torch==1.12.1
12+
tqdm
13+
umap
14+
tables==3.4.4
File renamed without changes.

build/lib/LeafletSC/beta-binomial-mix/cellstate_consistency.py renamed to build/lib/LeafletSC/beta_binomial_mix/cellstate_consistency.py

File renamed without changes.
File renamed without changes.

build/lib/LeafletSC/clustering/obtain_intron_clusters.py

Lines changed: 8 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -74,13 +74,8 @@
7474
default="no",
7575
help='yes if want to remove lowly used junctions in clusters, default is no')
7676

77-
parser.add_argument('--strict_filter', dest='strict_filter',
78-
default=True,
79-
help='default is True, this means that only clusters with less junctions that the mean \
80-
junction count per cluster is included. This is meant to remove very complex \
81-
splicing events that might be hard to make sense of in the single cell context especially.')
82-
83-
args = parser.parse_args()
77+
#args = parser.parse_args()
78+
args = parser.parse_args(args=[])
8479

8580
#+++++++++++++++++++++++++++++++++++++++++++++++++++++++
8681
# Utilities
@@ -89,7 +84,6 @@
8984
def process_gtf(gtf_file): #make this into a seperate script that processes the gtf file into gr object that can be used in the main scriptas input
9085

9186
print("The gtf file you provided is " + gtf_file)
92-
print("Now reading gtf file using gtfparse")
9387
print("This step may take a while depending on the size of your gtf file")
9488

9589
# calculate how long it takes to read gtf_file and report it
@@ -129,9 +123,11 @@ def process_gtf(gtf_file): #make this into a seperate script that processes the
129123
gtf_exons_gr = gtf_exons_gr.drop_duplicate_positions(strand=True) # Why are so many gone after this?
130124

131125
# Print the number of unique exons, transcript ids, and gene ids
126+
print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++")
132127
print("The number of unique exons is " + str(len(gtf_exons_gr.exon_id.unique())))
133128
print("The number of unique transcript ids is " + str(len(gtf_exons_gr.transcript_id.unique())))
134129
print("The number of unique gene ids is " + str(len(gtf_exons_gr.gene_id.unique())))
130+
print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++")
135131
return(gtf_exons_gr)
136132

137133
def filter_junctions_by_shared_splice_sites(df):
@@ -153,7 +149,7 @@ def filter_group(group):
153149
# Run analysis and obtain intron clusters
154150
#+++++++++++++++++++++++++++++++++++++++++++++++++++++++
155151

156-
def main(junc_files, gtf_file, output_file, sequencing_type, junc_bed_file, threshold_inc, min_intron, max_intron, min_junc_reads, singleton, strict_filter, junc_suffix, min_num_cells_wjunc, filter_low_juncratios_inclust):
152+
def main(junc_files, gtf_file, output_file, sequencing_type, junc_bed_file, threshold_inc, min_intron, max_intron, min_junc_reads, singleton, junc_suffix, min_num_cells_wjunc, filter_low_juncratios_inclust):
157153

158154
#1. Check format of junc_files and convert to list if necessary
159155
# Can either be a list of folders with junction files or a single folder with junction files
@@ -170,10 +166,6 @@ def main(junc_files, gtf_file, output_file, sequencing_type, junc_bed_file, thre
170166
# 3. Process each path
171167
for junc_path in junc_files:
172168

173-
# make sure junc_path has "/" at the end
174-
#if not junc_path.endswith("/"):
175-
# junc_path = junc_path + "/"
176-
177169
junc_path = Path(junc_path)
178170
print(f"Reading in junction files from {junc_path}")
179171

@@ -183,11 +175,6 @@ def main(junc_files, gtf_file, output_file, sequencing_type, junc_bed_file, thre
183175
print(f"No junction files found in {junc_path} with suffix {junc_suffix}")
184176
continue
185177

186-
#junc_files_in_path = glob.glob(junc_path + "*" + junc_suffix) # Adjusted to correctly form the glob pattern
187-
#if not junc_files_in_path:
188-
# print(f"No junction files found in {junc_path} with suffix {junc_suffix}")
189-
# continue
190-
191178
print(f"The number of regtools junction files to be processed is {len(junc_files_in_path)}")
192179

193180
files_not_read = []
@@ -197,14 +184,14 @@ def main(junc_files, gtf_file, output_file, sequencing_type, junc_bed_file, thre
197184
try:
198185
juncs = pd.read_csv(junc_file, sep="\t", header=None)
199186
juncs['file_name'] = junc_file # Add the file name as a new column
200-
#juncs['cell_type'] = junc_file.split("/")[-1]
201187
juncs['cell_type'] = junc_file
202188
all_juncs_list.append(juncs) # Append the DataFrame to the list
203189
except Exception as e:
204190
print(f"Could not read in {junc_file}: {e}")
205191
files_not_read.append(junc_file)
206192

207-
print("The total number of files that could not be read is " + str(len(files_not_read)) + " as these had no junctions")
193+
if(len(files_not_read) > 0):
194+
print("The total number of files that could not be read is " + str(len(files_not_read)) + " as these had no junctions")
208195

209196
# 5. Concatenate all DataFrames into a single DataFrame
210197
all_juncs = pd.concat(all_juncs_list, ignore_index=True) if all_juncs_list else pd.DataFrame()
@@ -247,7 +234,6 @@ def main(junc_files, gtf_file, output_file, sequencing_type, junc_bed_file, thre
247234
all_juncs["intron_length"] = all_juncs["chromEnd"] - all_juncs["chromStart"]
248235
mask = (all_juncs["intron_length"] >= min_intron) & (all_juncs["intron_length"] <= max_intron)
249236
all_juncs = all_juncs[mask]
250-
print("Filtering based on intron length")
251237

252238
# Filter for 'chrom' column to handle "chr" prefix
253239
all_juncs = all_juncs.copy()
@@ -264,7 +250,6 @@ def main(junc_files, gtf_file, output_file, sequencing_type, junc_bed_file, thre
264250
all_juncs['junction_id'] = all_juncs['chrom'] + '_' + all_juncs['chromStart'].astype(str) + '_' + all_juncs['chromEnd'].astype(str)
265251

266252
# Get total score for each junction and merge with all_juncs with new column "total_counts"
267-
268253
all_juncs = all_juncs.groupby('junction_id').agg({'score': 'sum'}).reset_index().merge(all_juncs, on='junction_id', how='left')
269254

270255
# rename score_x and score_y to total_junc_counts and score
@@ -319,7 +304,6 @@ def main(junc_files, gtf_file, output_file, sequencing_type, junc_bed_file, thre
319304

320305
# 9. if singleton is False, remove clusters with only one junction
321306
if singleton == False:
322-
print(clusters.Count.value_counts())
323307
clusters = clusters[clusters.Count > 1]
324308
print("The number of clusters after removing singletons is " + str(len(clusters.Cluster.unique())))
325309

@@ -349,7 +333,6 @@ def main(junc_files, gtf_file, output_file, sequencing_type, junc_bed_file, thre
349333

350334
# check if any clusters are singletons now and remove if have singleton == False
351335
if singleton == False:
352-
print(filtered_clusters_df.Count.value_counts())
353336
filtered_clusters_df = filtered_clusters_df[filtered_clusters_df.Count > 1]
354337
print("The number of clusters after removing singletons is " + str(len(filtered_clusters_df.Cluster.unique())))
355338

@@ -411,11 +394,6 @@ def main(junc_files, gtf_file, output_file, sequencing_type, junc_bed_file, thre
411394
singleton=True
412395
else:
413396
singleton=False
414-
# ensure strict_filter is boolean
415-
if args.strict_filter == "True":
416-
strict_filter=True
417-
else:
418-
strict_filter=False
419397

420398
# print out all user defined arguments that were chosen
421399
print("The following arguments were chosen:" , flush=True)
@@ -431,7 +409,6 @@ def main(junc_files, gtf_file, output_file, sequencing_type, junc_bed_file, thre
431409
print("junc_suffix: " + junc_suffix, flush=True)
432410
print("min_num_cells_wjunc: " + str(min_num_cells_wjunc), flush=True)
433411
print("singleton: " + str(singleton), flush=True)
434-
print("strict_filter: " + str(strict_filter), flush=True)
435412
print("filter_low_juncratios_inclust: " + (filter_low_juncratios_inclust), flush=True)
436413

437-
main(junc_files, gtf_file, output_file, sequencing_type, junc_bed_file, threshold_inc, min_intron, max_intron, min_junc_reads, singleton, strict_filter, junc_suffix, min_num_cells_wjunc, filter_low_juncratios_inclust)
414+
main(junc_files, gtf_file, output_file, sequencing_type, junc_bed_file, threshold_inc, min_intron, max_intron, min_junc_reads, singleton, junc_suffix, min_num_cells_wjunc, filter_low_juncratios_inclust)

build/lib/LeafletSC/clustering/prep_model_input.py

Lines changed: 19 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -5,26 +5,34 @@
55
from tqdm import tqdm
66
import concurrent.futures
77
import time
8+
import tables
89

9-
pd.options.mode.chained_assignment = None # default='warn'
10-
11-
import warnings
12-
warnings.filterwarnings("ignore", category=FutureWarning, module="pandas.core.strings")
10+
#pd.options.mode.chained_assignment = None # default='warn'
11+
#import warnings
12+
#warnings.filterwarnings("ignore", category=FutureWarning, module="pandas.core.strings")
1313

1414
parser = argparse.ArgumentParser(description='Read in file that lists junctions for all samples, one file per line and no header')
1515

1616
parser.add_argument('--intron_clusters', dest='intron_clusters',
1717
help='path to the file that has the intron cluster events and junction information from running intron_clustering.py')
18-
parser.add_argument('--output_file', dest='output_file',
18+
19+
parser.add_argument('--output_file', dest='output_file',
20+
default="output_file",
1921
help='how you want to name the output file, this will be the input for all Leaflet models')
22+
2023
parser.add_argument('--has_genes', dest='has_genes',
24+
default="no",
2125
help='yes if intron clustering was done with a gtf file, No if intron clustering was done in an annotation free manner')
22-
parser.add_argument('--chunk_size', dest='chunk_size', default=5000,
26+
27+
parser.add_argument('--chunk_size', dest='chunk_size',
28+
default=5000,
2329
help='how many lines to read in at a time, default is 5000')
30+
2431
parser.add_argument('--metadata', dest='metadata',
2532
default=None,
2633
help='path to the metadata file, if provided, the output file will have cell type information')
27-
args = parser.parse_args()
34+
35+
args, unknown = parser.parse_known_args()
2836

2937
#+++++++++++++++++++++++++++++++++++++++++++++++++++++++
3038
# Utilities
@@ -125,17 +133,11 @@ def main(intron_clusters, output_file, has_genes, chunk_size, metadata):
125133
print("The number of total cells evaluated is " + str(len(all_cells)))
126134

127135
cells_types = clusts[["cell_type", "cell_id"]].drop_duplicates()
128-
print(clusts.head())
129136
print("The number of cells per cell type is:")
130137
print(cells_types.groupby(["cell_type"])["cell_type"].count())
131138

132-
print("Ensuring that each cell-junction pair appears only once")
133139
summarized_data = summarized_data.drop_duplicates(subset=['cell_id', 'junction_id'], keep='last') #double check if this is still necessary
134-
135-
print("Merge cluster counts with summarized data")
136-
137140
summarized_data = clust_cell_counts.merge(summarized_data)
138-
print("Done merging cluster counts with summarized data")
139141

140142
print(np.unique(summarized_data['cell_id'].values))
141143
summarized_data["junc_ratio"] = summarized_data["junc_count"] / summarized_data["Cluster_Counts"]
@@ -152,15 +154,18 @@ def main(intron_clusters, output_file, has_genes, chunk_size, metadata):
152154
# if "/" detected in name (cell_type) replace it with "_"
153155
if "/" in name:
154156
name = name.replace("/", "_")
155-
print("saving " + name + " as hdf file")
156157
group.to_hdf(output_file + "_" + name + ".h5", key='df', mode='w', complevel=9, complib='zlib')
158+
print("You can find the resulting file at " + output_file + "_" + name + ".h5")
157159

158160
if metadata is None:
159161
# save summarized_data as hdf file
160162
summarized_data.to_hdf(output_file + ".h5", key='df', mode='w', complevel=9, complib='zlib')
163+
print("You can find the resulting file at " + output_file + ".h5")
164+
161165
print("Done generating input file for Leaflet model. This process took " + str(round(time.time() - start_time)) + " seconds")
162166

163167
if __name__ == '__main__':
168+
164169
intron_clusters=args.intron_clusters
165170
output_file=args.output_file
166171
has_genes=args.has_genes

dist/LeafletSC-0.1.2.tar.gz

-23.2 KB
Binary file not shown.

0 commit comments

Comments
 (0)