Experiments
This page contains information on how to replicate the experiments described in ASGAL paper. To create a fully reproducible data analysis, we used the Snakemake workflow manager.
All the scripts have been tested on ubuntu 18.04.
Prerequisites
sudo apt update
sudo apt-get install build-essential cmake git curl wget unzip samtools python python3 python3-pip \
python3-biopython python3-biopython-sql python3-pysam python-scipy python-pysam \
python-h5py emboss python-cutadapt libboost1.65-all-dev zlib1g-dev liblzma-dev \
libjemalloc1 libjemalloc-dev libghc-bzlib-dev snakemake
pip3 install --user gffutils
Simulated Data
Let us download the data and run the experiments in the following folder:
SimFold='~/asgal_exp/SimData'
mkdir -p ${SimFold}
- move to the folder containing the snakefile
cd paper/experiments/SimulatedData
- download the tools:
bash getTools.sh ${SimFold}
- download the files from here and move them to SimFold
$ ls ${SimFold} 5000000_reads.fastq.gz 10000000_reads.fastq.gz gencode.v19.annotation.mult_iso_subsample_1000_genes.gtf GRCh37.p13.genome.fa.gz
- setup the input files (this script will split the input files and create the reduced annotations):
bash setupData.sh ${SimFold}
-
change the root folder in the
config.yaml
file with the desired folder (i.e.${SimFold}
) - run the experiments using snakemake
# check if everything is okay snakemake -n all # run the experiments snakemake all
The outputs of the tools are stored in the folder: ${SimFold}/Results
.
In the same folder, you can find 4 csv which summarize the results:
- alignmentsAccuracy.csv contains the results on the basewise accuracy of ASGAL aligner and STAR
- alignmentsStatistics.csv contains the statistics on number of unique primary alignments, number of mismatches and read truncations
- full-annot-comparisons.csv contains the number of true positives, false positives, and false negatives reported by each tool (ASGAL, SplAdder, rMATS, and SUPPA) when considering the original gene annotations
- full-novel-comparisons.csv contains the number of true positives, false positives, and false negatives reported by each tool (ASGAL, SplAdder, rMATS, and SUPPA) when considering the reduced annotations
Real Data
Let us download the data and run the experiments in the following folder:
RealFold='~/asgal_exp/RealData'
mkdir -p ${RealFold}
- move to the folder containing the snakefile
cd paper/experiments/RealData
- setup the tools folder (we will create a symbolic link to the Tools folder used before):
ln -s ${SimFold}/Tools/ ${RealFold}/Tools/
- download the files from here and move them to RealFold
$ ls ${RealFold} genes_information.tar.gz RealSamples.tar.gz
- download the other required data and setup all the data:
bash setupData.sh ${RealFold}
-
change the root folder in the config.yaml file with the desired folder (i.e.
${RealFold}
) - run the experiments using snakemake
# check if everything is okay snakemake -n all # run the experiments snakemake all
The outputs of the tools are stored in the folder: ${RealFold}/Results
.