Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

README.md

ENCODE feature generation and preprocessing

Requirements

  • pyBigWig

Instructions to create Codalab bundle

Here are instructions to reproduce the Codalab bundle, in a directory path BUNDLE_ROOT_DIRECTORY.

  1. Download the human genome sequence (hg19 assembly) in FASTA format from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz and extract it into SEQUENCE_PATH.

  2. Run python prep_sequence.py --seq_path SEQUENCE_PATH --output_dir OUTPUT_DIR to write the fasta file found in SEQUENCE_PATH to a numpy array archive in OUTPUT_PATH. (The dataset loader assumes OUTPUT_PATH to be <bundle root directory>/sequence.npz.)

  3. Download the DNase accessibility data. This consists of whole-genome DNase files in bigwig format from https://guanfiles.dcmb.med.umich.edu/Leopard/dnase_bigwig/. Save these to filenames <bundle root directory>/DNASE.<celltype>.fc.signal.bigwig in the code.

  4. Run python prep_accessibility.py. This writes samples of each bigwig file to <bundle root directory>/qn.<celltype>.npy. These are used at runtime when the dataset loader is initialized, to perform quantile normalization on the DNase accessibility signals.

  5. Download the labels from the challenge into a label directory <bundle root directory>/labels/ created for this purpose:

  1. Run python prep_metadata_labels.py.