Homework 3: Phylogenetic tree and parsimony analysis
- Due Nov 14, 2021 by 11:59pm
- Points 100
- Submitting a file upload
- Available until Nov 17, 2021 at 11:59pm
In this homework assignment, you will use a collection of small subunit ribosomal RNAs to infer a phylogenetic tree and the parsimony reconstruction of the ancestral sequences. You will use the neighbor-joining algorithm and implement the Sankoff algorithm.
Dataset:
-
Click here for the rRNA sequences (38 sequences total).
-
Use this scoring matrix Links to an external site. to implement problems 3.
Problems:
1. (20 points): Use Matlab function multialign, seqpdist, and seqneighjoin or the webtool Clustal Omega Links to an external site. to align the rRNA sequences and build a phylogenetic tree.
2. (60 points): Implement the Sankoff algorithm and test your algorithm with a few toy examples to demonstrate that your implementation is correct.
* As a toy example, if you got ['AUUCGUGAUU', 'AUUGAA-AUU', 'GUCCUCGGUU', 'GA-CACGAUC'], the model should be returned ['AUUCGUGAUU', 'AUUGAA-AUU', 'GUCCUCGGUU', 'GA-CACGAUC', 'AUUCAUGAUU', 'GUUCACGAUU', 'AUUCAUGAUU'].
3. (20 points): Use your Sankoff algorithm for a parsimony analysis with the multiple sequence alignment of the rRNA sequences using the tree structure inferred by the neighbor-joining algorithm.
* You can use BioPython for generating tree structure in this homework if you want.
* You are not allowed to use any existing implementations of the Sankoff algorithm.
Submission:
For this problem you should submit the following files:
- Problem 1: Submit (1) a file showing the multiple sequence alignment results. (PDF), (2) a text file showing the pairwise distances (.txt), and (3) an image file of the tree. (PDF)
- Problem 2: Submit (1) your source code of the Sankoff algorithm (.py or .m), (2) a source file calling the function to solve a toy example (.py or .m), and (3) a pdf file showing a toy example and its result in a format like page 83 of the phylogeny slides Download phylogeny slides.
- Problem 3: Submit (1) your source file of applying the Sankoff algorithm in problem 2 for a parsimony analysis on the multiple sequence alignment (from problem 1) using the tree inferred in problem 1, and (2) a file reporting the total parsimony score and the inferred most likely sequence of each internal node as a multiple sequence alignment along with the given 38 rRNA sequences. (You can label the internal nodes in the tree and indicate a sequence corresponding to each label).
- Please submit README.txt that states each problem containing which files with their inputs, and how to compile/run the scripts. (5 points will be deducted if no README.txt is submitted).
(Note: You may need to use dir Links to an external site. (Matlab function) to obtain a list of file names in a current folder so that you do not have to manually get every single rRNA sequence.)