Homework 4: Implementation of BWT and FM Indexing for short read mapping
- Due Dec 2, 2021 by 11:59pm
- Points 100
- Submitting a file upload
- Available until Dec 5, 2021 at 11:59pm
This assignment is on the implementation of Burrows-Wheller Transform (BWT) and Ferragina & Manzini Indexing (FM indexing) for NGS short read mapping. Below are the instruction for each step of implementation.
1. (20 points) Implement function bwt that generates the reversible transformation of a sequence:
- Syntax: bwtSeq = bwt(seq)
- Input: seq that is a single letter-code representation of a nucleotide sequence.
- Output: bwtSeq that is BWT of seq.
Use file 'sample1.fa
Download sample1.fa' provided in section Datasets to get the input sequence, and report the BWT of that sequence to file 'bwtSample1.fa'.
2. (20 points) Implement function reverseBwt that reverses the Burrows-Wheeler Transform:
- Syntax: seq = reverseBwt(bwtSeq)
- Input: bwtSeq that is the BWT of a sequence seq.
- Output: seq that is the original nucleotide sequence such that BWT(seq) = bwtSeq.
Use file 'bwtHomoSapiens.fa
Download bwtHomoSapiens.fa' provided in section Datasets to get the input sequence, and report the reversed BWT of that sequence to file 'rBwtSample2.fa'.
3. (60 points in total) Short read mapping:
Two fastq files storing the subset of reads are provided in section Datasets. The BWT of chromosome Y also has been generated for you. Your task is mapping these short reads to chrY applying two following methods:
3.1. (30 points) Implement function fullindexFM to map short reads to the chromosome. You need three big arrays of the same size of the chromosomes: 1) the array storing BWT, 2) the array storing the reference positions to the beginning of the sequence and 3) the array storing the rank of the nucleotides.
Report how many mapped reads and their starting position on chrY to text file 'mapping_fullindexFM.txt'.
3.2. (10 points) Implement function hybridFM that uses hybrid solution by storing 30% sample of the reference position array (i.e. marking every (3*k+1)-th row), and "walk left".
Report how many mapped reads and their starting position on chrY to text file 'mapping_FMhybrid.txt'.
3.3. (10 points) Based your implementation in 3.2, implement function checkpointingFM that uses checkpointing in FM index by pre-calculating cumulative counts for A/C/G/T up to every (6*k+1)-th checkpoint in BWT (~15% reduced).
Report how many mapped reads and their starting positions on chrY to text file 'mapping_FMCheckpointing.txt'.
3.4. (10 points) Compare the running time and space used between method 3.1, 3.2 and 3.3. Report your comparison in README file. Some useful (only matlab) functions and examples that help you to analyze your code performance can be found here Links to an external site..
Report how many mapped reads and their starting position on chrY to text file 'mapping_FMhybrid.txt'.
3.3. (10 points) Based your implementation in 3.2, implement function checkpointingFM that uses checkpointing in FM index by pre-calculating cumulative counts for A/C/G/T up to every (6*k+1)-th checkpoint in BWT (~15% reduced).
Report how many mapped reads and their starting positions on chrY to text file 'mapping_FMCheckpointing.txt'.
3.4. (10 points) Compare the running time and space used between method 3.1, 3.2 and 3.3. Report your comparison in README file. Some useful (only matlab) functions and examples that help you to analyze your code performance can be found here Links to an external site..
Note: Three files 'mapping_fullindexFM.txt', 'mapping_FMhybrid.txt', and 'mapping_FMCheckpointing.txt' should be in the following format:
x out of y short read pairs map on chrY:20M~40M:
<read header> (p_1, p_2,...), (q_1, q_2,...)
...
where p_i and q_i are the starting positions of read 1 and read 2 in 'SRR089545_1.fq' and 'SRR089545_2.fq' respectively.
* In your program, always load data files with relative path in the same folder of the source files (i.e. "../files/data_filename") such that TA can grade your submission simply. Don’t use absolute path.
Datasets:
-
Click here Download here for a toy sample used in part 1.
-
Click here Download here for a bwt of sequence (GL000190.1, Homo sapiens chromosome Y genomic contig ~ 0.3M bp) used in part 2.
- Work on short region of chrY:25,000,000~26,000,000. Note that you need to report the real locations of mapped reads over the whole chromosome Y, not the location in chrY:25M~26M.
-
-
Click bwtChrYnew(25M-26M).fa.zip Download bwtChrYnew(25M-26M).fa.zip for the BWT of chrY:25M~26M.
- Click short_reads(chrY-25M-26M).zip Download short_reads(chrY-25M-26M).zip for short reads on from chrY:25M~26M.
-
- Source file (your code containing at least 5 functions listed above. You can submit other additional functions.)
- Readme file (text)
- Output files including (1) 'bwtSample1.fa'; (2) 'rBwtSample2.fa'; (3) 'mapping_FMhybrid.txt'; and (4) 'mapping_FMCheckpointing.txt'.
All files and source code should be under HW4_name dir and submitted zipped dir.