Experiments

This page contains information on how to replicate the experiments described in ASGAL paper. To create a fully reproducible data analysis, we used the Snakemake workflow manager.

All the scripts have been tested on ubuntu 18.04.


Prerequisites

sudo apt update
sudo apt-get install build-essential cmake git curl wget unzip samtools python python3 python3-pip \
                     python3-biopython python3-biopython-sql python3-pysam python-scipy python-pysam \
                     python-h5py emboss python-cutadapt libboost1.65-all-dev zlib1g-dev liblzma-dev \
                     libjemalloc1 libjemalloc-dev libghc-bzlib-dev snakemake
pip3 install --user gffutils


Simulated Data

Let us download the data and run the experiments in the following folder:

SimFold='~/asgal_exp/SimData'
mkdir -p ${SimFold}
  1. move to the folder containing the snakefile
    cd paper/experiments/SimulatedData
    
  2. download the tools:
    bash getTools.sh ${SimFold}
    
  3. download the files from here and move them to SimFold
    $ ls ${SimFold}
      5000000_reads.fastq.gz
      10000000_reads.fastq.gz
      gencode.v19.annotation.mult_iso_subsample_1000_genes.gtf
      GRCh37.p13.genome.fa.gz
    
  4. setup the input files (this script will split the input files and create the reduced annotations):
    bash setupData.sh ${SimFold}
    
  5. change the root folder in the config.yaml file with the desired folder (i.e. ${SimFold})

  6. run the experiments using snakemake
    # check if everything is okay
    snakemake -n all
    # run the experiments
    snakemake all
    

The outputs of the tools are stored in the folder: ${SimFold}/Results.

In the same folder, you can find 4 csv which summarize the results:

  • alignmentsAccuracy.csv contains the results on the basewise accuracy of ASGAL aligner and STAR
  • alignmentsStatistics.csv contains the statistics on number of unique primary alignments, number of mismatches and read truncations
  • full-annot-comparisons.csv contains the number of true positives, false positives, and false negatives reported by each tool (ASGAL, SplAdder, rMATS, and SUPPA) when considering the original gene annotations
  • full-novel-comparisons.csv contains the number of true positives, false positives, and false negatives reported by each tool (ASGAL, SplAdder, rMATS, and SUPPA) when considering the reduced annotations


Real Data

Let us download the data and run the experiments in the following folder:

RealFold='~/asgal_exp/RealData'
mkdir -p ${RealFold}
  1. move to the folder containing the snakefile
    cd paper/experiments/RealData
    
  2. setup the tools folder (we will create a symbolic link to the Tools folder used before):
    ln -s ${SimFold}/Tools/ ${RealFold}/Tools/
    
  3. download the files from here and move them to RealFold
    $ ls ${RealFold}
      genes_information.tar.gz
      RealSamples.tar.gz
    
  4. download the other required data and setup all the data:
    bash setupData.sh ${RealFold}
    
  5. change the root folder in the config.yaml file with the desired folder (i.e. ${RealFold})

  6. run the experiments using snakemake
    # check if everything is okay
    snakemake -n all
    # run the experiments
    snakemake all
    

The outputs of the tools are stored in the folder: ${RealFold}/Results.