Skip to content

Visualization by gwaslab

Import gwaslab package

import gwaslab as gl

Load sumstats

sumstats = gl.Sumstats("1kgeas.B1.glm.firth",fmt="plink2")

stdout:

Tue Dec 26 15:56:49 2023 GWASLab v3.4.22 https://cloufield.github.io/gwaslab/
Tue Dec 26 15:56:49 2023 (C) 2022-2023, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com
Tue Dec 26 15:56:49 2023 Start to load format from formatbook....
Tue Dec 26 15:56:49 2023  -plink2 format meta info:
Tue Dec 26 15:56:49 2023   - format_name  : PLINK2 .glm.firth, .glm.logistic,.glm.linear
Tue Dec 26 15:56:49 2023   - format_source  : https://www.cog-genomics.org/plink/2.0/formats
Tue Dec 26 15:56:49 2023   - format_version  : Alpha 3.3 final (3 Jun)
Tue Dec 26 15:56:49 2023   - last_check_date  :  20220806
Tue Dec 26 15:56:49 2023  -plink2 to gwaslab format dictionary:
Tue Dec 26 15:56:49 2023   - plink2 keys: ID,#CHROM,POS,REF,ALT,A1,OBS_CT,A1_FREQ,BETA,LOG(OR)_SE,SE,T_STAT,Z_STAT,P,LOG10_P,MACH_R2,OR
Tue Dec 26 15:56:49 2023   - gwaslab values: SNPID,CHR,POS,REF,ALT,EA,N,EAF,BETA,SE,SE,T,Z,P,MLOG10P,INFO,OR
Tue Dec 26 15:56:49 2023 Start to initiate from file :1kgeas.B1.glm.firth
Tue Dec 26 15:56:50 2023  -Reading columns          : REF,ID,ALT,POS,OR,LOG(OR)_SE,Z_STAT,OBS_CT,A1,#CHROM,P,A1_FREQ
Tue Dec 26 15:56:50 2023  -Renaming columns to      : REF,SNPID,ALT,POS,OR,SE,Z,N,EA,CHR,P,EAF
Tue Dec 26 15:56:50 2023  -Current Dataframe shape : 1128732  x  12
Tue Dec 26 15:56:50 2023  -Initiating a status column: STATUS ...
Tue Dec 26 15:56:50 2023  NEA not available: assigning REF to NEA...
Tue Dec 26 15:56:50 2023  -EA,REF and ALT columns are available: assigning NEA...
Tue Dec 26 15:56:50 2023  -For variants with EA == ALT : assigning REF to NEA ...
Tue Dec 26 15:56:50 2023  -For variants with EA != ALT : assigning ALT to NEA ...
Tue Dec 26 15:56:50 2023 Start to reorder the columns...
Tue Dec 26 15:56:50 2023  -Current Dataframe shape : 1128732  x  14
Tue Dec 26 15:56:50 2023  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,SE,Z,P,OR,N,STATUS,REF,ALT
Tue Dec 26 15:56:50 2023 Finished sorting columns successfully!
Tue Dec 26 15:56:50 2023  -Column: SNPID  CHR   POS   EA       NEA      EAF     SE      Z       P       OR      N     STATUS   REF      ALT
Tue Dec 26 15:56:50 2023  -DType : object int64 int64 category category float64 float64 float64 float64 float64 int64 category category category
Tue Dec 26 15:56:50 2023 Finished loading data successfully!
sumstats.data
SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT
1:15774:G:A 1 15774 A G 0.028283 NaN NaN NaN NaN 495 9999999 G A
1:15777:A:G 1 15777 G A 0.073737 NaN NaN NaN NaN 495 9999999 A G
1:57292:C:T 1 57292 T C 0.104675 NaN NaN NaN NaN 492 9999999 C T
1:77874:G:A 1 77874 A G 0.019153 0.462750 0.249299 0.803130 1.122280 496 9999999 G A
1:87360:C:T 1 87360 T C 0.023139 NaN NaN NaN NaN 497 9999999 C T
... ... ... ... ... ... ... ... ... ... ... ... ... ...
22:51217954:G:A 22 51217954 A G 0.033199 NaN NaN NaN NaN 497 9999999 G A
22:51218377:G:C 22 51218377 C G 0.033333 0.362212 -0.994457 0.320000 0.697534 495 9999999 G C
22:51218615:T:A 22 51218615 A T 0.033266 0.362476 -1.029230 0.303374 0.688618 496 9999999 T A
22:51222100:G:T 22 51222100 T G 0.039157 NaN NaN NaN NaN 498 9999999 G T
22:51239678:G:T 22 51239678 T G 0.034137 NaN NaN NaN NaN 498 9999999 G T

Check the lead variants in significant loci

sumstats.get_lead(sig_level=5e-8)

stdout:

Tue Dec 26 15:56:51 2023 Start to extract lead variants...
Tue Dec 26 15:56:51 2023  -Processing 1128732 variants...
Tue Dec 26 15:56:51 2023  -Significance threshold : 5e-08
Tue Dec 26 15:56:51 2023  -Sliding window size: 500  kb
Tue Dec 26 15:56:51 2023  -Found 43 significant variants in total...
Tue Dec 26 15:56:51 2023  -Identified 4 lead variants!
Tue Dec 26 15:56:51 2023 Finished extracting lead variants successfully!
SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT
1:167562605:G:A 1 167562605 A G 0.391481 0.159645 7.69462 1.419150e-14 3.415780 493 9999999 G A
2:55513738:C:T 2 55513738 C T 0.376008 0.153159 -7.96244 1.686760e-15 0.295373 496 9999999 C T
7:134368632:T:G 7 134368632 G T 0.138105 0.225526 6.89025 5.569440e-12 4.730010 496 9999999 T G
20:42758834:T:C 20 42758834 T C 0.227273 0.184323 -7.76902 7.909780e-15 0.238829 495 9999999 T C

Create mahattan plot

sumstats.plot_mqq(skip=2,anno=True)

stdout:

Tue Dec 26 15:59:17 2023 Start to plot manhattan/qq plot with the following basic settings:
Tue Dec 26 15:59:17 2023  -Genomic coordinates version: 99...
Tue Dec 26 15:59:17 2023    -WARNING!!! Genomic coordinates version is unknown...
Tue Dec 26 15:59:17 2023  -Genome-wide significance level is set to 5e-08 ...
Tue Dec 26 15:59:17 2023  -Raw input contains 1128732 variants...
Tue Dec 26 15:59:17 2023  -Plot layout mode is : mqq
Tue Dec 26 15:59:17 2023 Finished loading specified columns from the sumstats.
Tue Dec 26 15:59:17 2023 Start conversion and sanity check:
Tue Dec 26 15:59:17 2023  -Removed 0 variants with nan in CHR or POS column ...
Tue Dec 26 15:59:17 2023  -Removed 0 varaints with CHR <=0...
Tue Dec 26 15:59:17 2023  -Removed 220793 variants with nan in P column ...
Tue Dec 26 15:59:17 2023  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
Tue Dec 26 15:59:17 2023  -Sumstats P values are being converted to -log10(P)...
Tue Dec 26 15:59:17 2023  -Sanity check: 0 na/inf/-inf variants will be removed...
Tue Dec 26 15:59:17 2023  -Maximum -log10(P) values is 14.772946706439042 .
Tue Dec 26 15:59:17 2023 Finished data conversion and sanity check.
Tue Dec 26 15:59:17 2023 Start to create manhattan plot with 6866 variants:
Tue Dec 26 15:59:17 2023  -Found 4 significant variants with a sliding window size of 500 kb...
Tue Dec 26 15:59:17 2023 Finished creating Manhattan plot successfully!
Tue Dec 26 15:59:17 2023  -Annotating using column CHR:POS...
Tue Dec 26 15:59:17 2023  -Adjusting text positions with repel_force=0.03...
Tue Dec 26 15:59:17 2023 Start to create QQ plot with 6866 variants:
Tue Dec 26 15:59:17 2023 Expected range of P: (0,1.0)
Tue Dec 26 15:59:17 2023  -Lambda GC (MLOG10P mode) at 0.5 is   0.98908
Tue Dec 26 15:59:17 2023 Finished creating QQ plot successfully!
Tue Dec 26 15:59:17 2023  -Skip saving figures!
(<Figure size 3000x1000 with 2 Axes>, <gwaslab.Log.Log at 0x7f55daa2f400>)

Output image

QC check

sumstats.basic_check()

stdout:

Tue Dec 27 23:08:13 2022 Start to check IDs...
Tue Dec 27 23:08:13 2022  -Current Dataframe shape : 1122299  x  11
Tue Dec 27 23:08:13 2022  -Checking if SNPID is chr:pos:ref:alt...(separator: - ,: , _)
Tue Dec 27 23:08:14 2022 Finished checking IDs successfully!
Tue Dec 27 23:08:14 2022 Start to fix chromosome notation...
Tue Dec 27 23:08:14 2022  -Current Dataframe shape : 1122299  x  11
Tue Dec 27 23:08:17 2022  -Vairants with standardized chromosome notation: 1122299
Tue Dec 27 23:08:19 2022  -All CHR are already fixed...
Tue Dec 27 23:08:21 2022 Finished fixing chromosome notation successfully!
Tue Dec 27 23:08:21 2022 Start to fix basepair positions...
Tue Dec 27 23:08:21 2022  -Current Dataframe shape : 1122299  x  11
Tue Dec 27 23:08:21 2022  -Converting to Int64 data type ...
Tue Dec 27 23:08:22 2022  -Position upper_bound is: 250,000,000
Tue Dec 27 23:08:24 2022  -Remove outliers: 0
Tue Dec 27 23:08:24 2022  -Converted all position to datatype Int64.
Tue Dec 27 23:08:24 2022 Finished fixing basepair position successfully!
Tue Dec 27 23:08:24 2022 Start to fix alleles...
Tue Dec 27 23:08:24 2022  -Current Dataframe shape : 1122299  x  11
Tue Dec 27 23:08:25 2022  -Detected 0 variants with alleles that contain bases other than A/C/T/G .
Tue Dec 27 23:08:25 2022  -Converted all bases to string datatype and UPPERCASE.
Tue Dec 27 23:08:27 2022 Finished fixing allele successfully!
Tue Dec 27 23:08:27 2022 Start sanity check for statistics ...
Tue Dec 27 23:08:27 2022  -Current Dataframe shape : 1122299  x  11
Tue Dec 27 23:08:27 2022  -Checking if  0 <=N<= inf  ...
Tue Dec 27 23:08:27 2022  -Removed 0 variants with bad N.
Tue Dec 27 23:08:27 2022  -Checking if  -37.5 <Z< 37.5  ...
Tue Dec 27 23:08:27 2022  -Removed 14 variants with bad Z.
Tue Dec 27 23:08:27 2022  -Checking if  5e-300 <= P <= 1  ...
Tue Dec 27 23:08:27 2022  -Removed 0 variants with bad P.
Tue Dec 27 23:08:27 2022  -Checking if  0 <SE< inf  ...
Tue Dec 27 23:08:27 2022  -Removed 0 variants with bad SE.
Tue Dec 27 23:08:27 2022  -Checking if  -10 <log(OR)< 10  ...
Tue Dec 27 23:08:27 2022  -Removed 0 variants with bad OR.
Tue Dec 27 23:08:27 2022  -Checking STATUS...
Tue Dec 27 23:08:28 2022  -Coverting STAUTUS to interger.
Tue Dec 27 23:08:28 2022  -Removed 14 variants with bad statistics in total.
Tue Dec 27 23:08:28 2022 Finished sanity check successfully!
Tue Dec 27 23:08:28 2022 Start to normalize variants...
Tue Dec 27 23:08:28 2022  -Current Dataframe shape : 1122285  x  11
Tue Dec 27 23:08:29 2022  -No available variants to normalize..
Tue Dec 27 23:08:29 2022 Finished normalizing variants successfully!

Create regional plot

sumstats.plot_mqq(mode="r",anno=True,region=(2,54513738,56513738),region_grid=True,build="19")
#2:55513738

stdout:

Tue Dec 26 15:58:10 2023 Start to plot manhattan/qq plot with the following basic settings:
Tue Dec 26 15:58:10 2023  -Genomic coordinates version: 19...
Tue Dec 26 15:58:10 2023  -Genome-wide significance level is set to 5e-08 ...
Tue Dec 26 15:58:10 2023  -Raw input contains 1128732 variants...
Tue Dec 26 15:58:10 2023  -Plot layout mode is : r
Tue Dec 26 15:58:10 2023  -Region to plot : chr2:54513738-56513738.
Tue Dec 26 15:58:10 2023  -Extract SNPs in region : chr2:54513738-56513738...
Tue Dec 26 15:58:10 2023  -Extract SNPs in specified regions: 865
Tue Dec 26 15:58:10 2023 Finished loading specified columns from the sumstats.
Tue Dec 26 15:58:10 2023 Start conversion and sanity check:
Tue Dec 26 15:58:10 2023  -Removed 0 variants with nan in CHR or POS column ...
Tue Dec 26 15:58:10 2023  -Removed 0 varaints with CHR <=0...
Tue Dec 26 15:58:10 2023  -Removed 160 variants with nan in P column ...
Tue Dec 26 15:58:10 2023  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
Tue Dec 26 15:58:10 2023  -Sumstats P values are being converted to -log10(P)...
Tue Dec 26 15:58:10 2023  -Sanity check: 0 na/inf/-inf variants will be removed...
Tue Dec 26 15:58:11 2023  -Maximum -log10(P) values is 14.772946706439042 .
Tue Dec 26 15:58:11 2023 Finished data conversion and sanity check.
Tue Dec 26 15:58:11 2023 Start to create manhattan plot with 705 variants:
Tue Dec 26 15:58:11 2023  -Extracting lead variant...
Tue Dec 26 15:58:11 2023  -Loading gtf files from:default

stderr:

INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name', 'gene_biotype']

stdout:

Tue Dec 26 15:58:40 2023  -plotting gene track..
Tue Dec 26 15:58:40 2023  -Finished plotting gene track..
Tue Dec 26 15:58:40 2023  -Found 1 significant variants with a sliding window size of 500 kb...
Tue Dec 26 15:58:40 2023 Finished creating Manhattan plot successfully!
Tue Dec 26 15:58:40 2023  -Annotating using column CHR:POS...
Tue Dec 26 15:58:40 2023  -Adjusting text positions with repel_force=0.03...
Tue Dec 26 15:58:40 2023  -Skip saving figures!
(<Figure size 3000x2000 with 3 Axes>, <gwaslab.Log.Log at 0x7f55daa2f400>)

Output image

Create regional plot with LD information

gl.download_ref("1kg_eas_hg19")

stdout:

Tue Dec 27 22:44:52 2022 Start to download  1kg_eas_hg19  ...
Tue Dec 27 22:44:52 2022  -Downloading to: /home/he/anaconda3/envs/py38/lib/python3.8/site-packages/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz
Tue Dec 27 22:52:33 2022  -Updating record in config file...
Tue Dec 27 22:52:35 2022  -Updating record in config file...
Tue Dec 27 22:52:35 2022  -Downloading to: /home/he/anaconda3/envs/py38/lib/python3.8/site-packages/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi
Tue Dec 27 22:52:35 2022 Downloaded  1kg_eas_hg19  successfully!
sumstats.plot_mqq(mode="r",anno=True,region=(2,54531536,56731536),region_grid=True,vcf_path=gl.get_path("1kg_eas_hg19"),build="19")

stdout:

Tue Dec 26 15:58:41 2023 Start to plot manhattan/qq plot with the following basic settings:
Tue Dec 26 15:58:41 2023  -Genomic coordinates version: 19...
Tue Dec 26 15:58:41 2023  -Genome-wide significance level is set to 5e-08 ...
Tue Dec 26 15:58:41 2023  -Raw input contains 1128732 variants...
Tue Dec 26 15:58:41 2023  -Plot layout mode is : r
Tue Dec 26 15:58:41 2023  -Region to plot : chr2:54531536-56731536.
Tue Dec 26 15:58:41 2023  -Checking prefix for chromosomes in vcf files...
Tue Dec 26 15:58:41 2023  -No prefix for chromosomes in the VCF files.
Tue Dec 26 15:58:41 2023  -Extract SNPs in region : chr2:54531536-56731536...
Tue Dec 26 15:58:41 2023  -Extract SNPs in specified regions: 967
Tue Dec 26 15:58:41 2023 Finished loading specified columns from the sumstats.
Tue Dec 26 15:58:41 2023 Start conversion and sanity check:
Tue Dec 26 15:58:41 2023  -Removed 0 variants with nan in CHR or POS column ...
Tue Dec 26 15:58:41 2023  -Removed 0 varaints with CHR <=0...
Tue Dec 26 15:58:41 2023  -Removed 172 variants with nan in P column ...
Tue Dec 26 15:58:41 2023  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
Tue Dec 26 15:58:41 2023  -Sumstats P values are being converted to -log10(P)...
Tue Dec 26 15:58:41 2023  -Sanity check: 0 na/inf/-inf variants will be removed...
Tue Dec 26 15:58:41 2023  -Maximum -log10(P) values is 14.772946706439042 .
Tue Dec 26 15:58:41 2023 Finished data conversion and sanity check.
Tue Dec 26 15:58:41 2023 Start to load reference genotype...
Tue Dec 26 15:58:41 2023  -reference vcf path : /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz
Tue Dec 26 15:58:43 2023  -Retrieving index...
Tue Dec 26 15:58:43 2023  -Ref variants in the region: 71908
Tue Dec 26 15:58:43 2023  -Matching variants using POS, NEA, EA ...
Tue Dec 26 15:58:43 2023  -Calculating Rsq...
Tue Dec 26 15:58:43 2023 Finished loading reference genotype successfully!
Tue Dec 26 15:58:43 2023 Start to create manhattan plot with 795 variants:
Tue Dec 26 15:58:43 2023  -Extracting lead variant...
Tue Dec 26 15:58:44 2023  -Loading gtf files from:default

stderr:

INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name', 'gene_biotype']

stdout:

Tue Dec 26 15:59:12 2023  -plotting gene track..
Tue Dec 26 15:59:12 2023  -Finished plotting gene track..
Tue Dec 26 15:59:13 2023  -Found 1 significant variants with a sliding window size of 500 kb...
Tue Dec 26 15:59:13 2023 Finished creating Manhattan plot successfully!
Tue Dec 26 15:59:13 2023  -Annotating using column CHR:POS...
Tue Dec 26 15:59:13 2023  -Adjusting text positions with repel_force=0.03...
Tue Dec 26 15:59:13 2023  -Skip saving figures!
(<Figure size 3000x2000 with 4 Axes>, <gwaslab.Log.Log at 0x7f55daa2f400>)

Output image