Documentation

ASGAL is composed of different modules written in c++ and python. It has been developed and tested on Ubuntu Linux, but should work on any *nix-based system.

Installation

To compile ASGAL and the 3rd party libraries it uses (sdsl-lite and lemon), install:

On an ubuntu system (18.04), the following commands suffice:

sudo apt update
sudo apt install -y build-essential cmake make wget git python3-pip samtools zlib1g-dev python3 python3-pip python3-setuptools python3-biopython python3-biopython-sql python3-pysam python3-pandas python3-setuptools
pip3 install --user gffutils

Then, to download and compile sdsl-lite, lemon, Salmon and ASGAL:

git clone --recursive https://github.com/AlgoLab/galig.git
cd galig
make prerequisites
make

This creates the executable in the bin folder. The python scripts, instead, are in the scripts folder.


Input

ASGAL takes as input:

  • a reference chromosome (in FASTA format)
  • a gene annotation (in GTF format)
  • an RNA-Seq sample (in FASTA or FASTQ format, it can be gzipped)


Note 1 (genome-wide analysis)

ASGAL tool takes as input the annotation of a single gene and the relative chromosome: if you want to use the tool in a genome-wide analysis, please refer to this page.

Note 2 (paired-end sample)

At the moment, ASGAL is not able to directly manage paired-end samples: the only way to use both the fastq files (the reverse and the forward one) consists in merging them into a unique fastq file and then using this file as input for ASGAL. In this way, ASGAL will align each read independently and then it will use the alignments to detect the AS events. Note that this is needed only if you run ASGAL on a single gene: if you use the genome-wide pipeline, the merge is done automatically.


Usage

ASGAL is composed of three different modules, each one performing a different task. We made available a script that runs the full pipeline.

Full Pipeline Script

To run ASGAL pipeline on a single gene, run the following command:

./asgal -g [genome] -a [annotation] -s [sample] -o outputFolder

This command will produce four files in the output folder:

  • a .mem file containing the alignments to the splicing graph
  • a .sam file, containing the alignments to the splicing graph mapped to the reference genome
  • a .events.csv file, containing the alternative splicing events detected in the RNA-Seq sample
  • a .log file, containing the log of the execution

We will now specify in more detail the three steps of ASGAL pipeline.

Step 1 - Splice-Aware Aligner

To build the splicing graph of the input gene and to align the input sample to it, run the following command:

./bin/SpliceAwareAligner -g [reference] -a [annotation] -s [sample] -o outputFolder/output.mem

In this way, the alignments to the splicing graph are computed and stored in the output.mem file.

Step 2 (optional) - SAM Formatter

The file obtained in the previous step can be converted into a SAM file (that can be open, for example, with IGV) using:

python3 ./scripts/formatSAM.py -m output.mem -g [reference] -a [anotation] -o outputFolder/output.sam
Observation

The SAM file produce by ASGAL (output.sam) contains the alignments to the splicing graph mapped to the reference genome. These alignments must not be confused with the spliced alignments to the reference genome computed by any spliced aligner (such as STAR or HISAT2). Indeed, our alignments could contain long insertions representing the portions of reads that could align to an intron (and that could represent, for example, a cassette exon).

Step 3 - Events Detector

To analyze the alignments and detect the possible presence of alternative splicing events, run:

python3 ./scripts/detectEvents.py -g [reference] -a [annotation] -m output.mem -o outputFolder/output.events.csv
Output Format

The alternative splicing events are stored in the file output.events.csv, a space-separated value file where each line represents an event. Each alternative splicing event is associated to the intron inducing it and it is described by:

  1. type of event:
    • ES, Exon Skipping
    • A3, Alternative acceptor (3’) site
    • A5, Alternative donor (5’) site
    • IR, Intron Retention
  2. genomic positions (start and end) of the intron inducing the event
  3. number of reads confirming the event
  4. list of annotated transcripts with respect to which the event is expressed

For example,

Type Start End Support Transcripts
ES 2000 2500 150 Transcript1/Transcript2/Transcript3

this entry describes an exon skipping event, induced by 150 reads that support the intron starting at position 2000 and ending at position 2500. The event is expressed with respect to the annotated transcript Transcript1, Transcript2, and Transcript3.


Example

We want align the RNA-Seq sample contained in the example folder to gene CG13375 of Drosophila Melanogaster.

cd example

# Extract the RNA-Seq sample
tar xfz input.tar.gz

# Download reference and annotation from ensembl
wget ftp://ftp.ensembl.org/pub/release-91/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.dna.chromosome.X.fa.gz
gunzip Drosophila_melanogaster.BDGP6.dna.chromosome.X.fa.gz
mv Drosophila_melanogaster.BDGP6.dna.chromosome.X.fa input/genome.fa
wget ftp://ftp.ensembl.org/pub/release-91/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.91.chr.gtf.gz
gunzip Drosophila_melanogaster.BDGP6.91.chr.gtf.gz

# Extract the annotation of the gene of interest
grep "CG13375" Drosophila_melanogaster.BDGP6.91.chr.gtf > input/annotation.gtf
rm Drosophila_melanogaster.BDGP6.91.chr.gtf

# Run ASGAL
../asgal -g input/genome.fa -a input/annotation.gtf -s input/sample_1.fa -o CG13375

We should obtain a novel exon skipping events:

Type,Start,End,Support,Transcripts
ES,287042,289040,411,FBtr0300326/FBtr0070103/FBtr0342963
Observation

The correctness of the output can be verified by opening the genome, the annotation and the alignments with IGV and by generating a sashimi plot:

From the picture, we can see that 411 reads support the skipping of the third exon and this event involves the three transcript of the gene.


Command Line Arguments

  • all programs show usage information with -h (--help)

Full Pipeline Script

File:

./asgal

Required parameters:

-g,--genome INFILE          FASTA input file containing the reference
-a,--annotation INFILE      GTF input file containing the gene annotation
-s,--sample INFILE          FASTA/Q input file containing the RNA-Seq reads (can be a gzipped file)
-o,--output OUTFOLD         output folder

Optional parameters:

-l,--L <int>                minimum lenght of MEMs used to build the
                            alignments (default: 15)
-e,--eps <int>              error rate, a value from 0 to 100 used to
                            compute the maximum number of allowed errors
                            (default: 3)
-w,--support <int>          minimum number of reads needed to confirm an event
                            (default: 3)
--allevents                 output all events, not only the novel ones
                            (default: false, ie only novels)
--verbose                   verbose print (default: False)
--debug                     debug print (default: False)

Splice-Aware Aligner

File:

./bin/SpliceAwareAligner

Required parameters:

-g,--genome INFILE          FASTA input file containing the reference
-a,--annotation INFILE      GTF input file containing the gene annotation
-s,--sample INFILE          FASTA/Q input file containing the RNA-Seq reads (can be a gzipped file)
-o,--output OUTFILE         output file containing the alignments to the
                            splicing graph

Optional parameters:

-l,--L <int>                minimum lenght of MEMs used to build the
                            alignments (default: 15)
-e,--eps <int>              error rate, a value from 0 to 100 used to
                            compute the maximum number of allowed errors
                            (default: 3)
Observations
  • an higher value of L improves the speed but reduces the number of aligned reads
  • the number of allowed errors is computed for each read as the ratio between the error rate and its length


SAM Formatter

File:

./scripts/formatSAM.py

Required parameters:

-g,--genome INFILE          FASTA input file containing the reference
-a,--annotation INFILE      GTF input file containing the gene annotation
-m,--mems INFILE            input file containing the alignments to
                            the splicing graph
-o,--output OUTFILE         SAM output file

Optional paramter:

-e,--erate <int>            error rate, a value from 0 to 100 (default: 3)
Observations
  • the error rate should be the same used to compute the alignments


Events Detector

File:

./scripts/detectEvents.py

Required parameters:

-g,--genome INFILE          FASTA input file containing the reference
-a,--annotation INFILE      GTF input file containing the gene annotation
-m,--mems INFILE            input file containing the alignments to
                            the splicing graph
-o,--output OUTFILE         output file containing the events

Optional parameters:

-e,--erate <int>            error rate, a value from 0 to 100 (default: 3)
-w,--support <int>          minimum number of reads needed to confirm an event
                            (default: 3)
--allevents                 output all events, not only the novel ones
                            (default: only novels)
Observation
  • the error rate should be the same used to compute the alignments
  • by default, ASGAL outputs only the alternative splicing events that are novel with respect to the given annotation, that are the the events induced by an intron not contained in the annotation