Short read mapping with mismatches and junctions
Consider the following BWT-based algorithm for junction read detection in RNAseq analysis. For a given short read A of length l and a collection of exons,
1) Align A to all exons with BWT from both ends of A to get partial alignment of A to subsequences in the exons (hint: you might need to inverse both A and the exons for BWT alignment to get the alignments starting from two ends; ignore the complementary of A and gaps for simplicity of this assignment).
2) Keep all partial alignments of A with length between (l/4, 3l/4) that 1) begin from the beginning of A and end at the end of an exon or 2) end the end of A and begin at the beginning of an exon.
3) Find the matched partial alignment pairs (one from 1) and one from 2) and their length add to l) as a junction mapping.
Problems:
1. Revise your BWT algorithm implemented in HW3 to allow up to two mismatches in the FM indexing. Read Figure 2 in the Bowtie paper Download Bowtie paper for details. Design a test case of a few reads against a short reference sequence.
2. Implement the junction-read alignment algorithm. The input of your algorithm is a short read file and a fasta file of exons and the output is a file of junction reads listed with the linked exons by the read. Test your algorithm on the toy dataset here Download here. The junction read information is in the excel file.
3. Download human genome annotation and find one RNA sequencing dataset from SRA Links to an external site. to show your algorithm is working. Hint: you can just work on smaller chromosomes to demonstrate how your algorithm works.