Anton Larsson

Research summary - stitcher.py


The problem

Unprocessed RNA transcripts are composed of introns and exons, where the exons are spliced together to form the mature transcript. Recent research have shown that each gene can undergo what is known as alternative splicing, where the available exons are spliced in various combinations. These splicing differences may have dramatic impact on the function the resulting protein can perform. We wanted to use widely available short-read sequencing technology to measure this variation in splicing across individual cells. The approach we chose was in silico reconstruction of RNA using advanced biotechnology methods.

The solution

In the wetlab method stitcher.py is meant for, the end of each RNA molecule is tagged by a unique sequence (UMI). stitcher.py can then group the resulting sequencing reads based on the UMI sequence, each read hopefully covering more of the original molecule sequence. The software can then calculate the most likely nucleotide sequenced on each position based on probabilistic measurements across all reads (known as the PHRED score). Once the RNA sequence has been, at least partially, reconstructed it can be assigned to a equivalence class of compatible splice variants.

The tools

This project used Python, with the packages numpy, scipy and pysam. Thorough understanding of the Sequencing Alignment/Map format.

All code can be found on github here.

The impact

This method has the potential to help uncover characterize fundamental aspects of human cell biology.

The paper

Read an associated paper here