Homework 2: Detecting CpG islands in Human Genome with a Hidden Markov Model

Due Oct 31, 2021 by 11:59pm
Points 100
Submitting a file upload
Available until Nov 3, 2021 at 11:59pm

This assignment was locked Nov 3, 2021 at 11:59pm.

Implement a Hidden Markov Model with 8 states for CpG island identification and apply your model to search for CpG island in a large chunk of DNA from chromosome 21 in the human genome. You need to implement the 8-state HMM described in our class with the following transition matrix (p=0.98, q=0.999):

Link to google sheets Links to an external site.

The emission probabilities at all the states are: the state A+ emits A with probability 1, while eA+(b) = 0 for all b different from A, etc. Test your implementation on a toy example for validation and then apply it to searching a DNA segment of about 3.4M base pairs from Human chromosome 21 for CpG islands.

Dataset:

Click here Download here for the toy samples
Click here Download here for the sequences (Contig NT_011515.11, ~3.4M basepairs from 43507093 to 46944323 on Human chromosome 21).
Click here Download here for the location of the genes on human chromosome 21.

* The indexing in human chromosome 21 is assumed to use Matlab, but it's no big deal. You can use python's 0-start and Matlab's 1-start indexing. But you can check the sequence to figure out which is more likely to be better.

* You can ignore the ’N’s when you calculate the emission probability.

* The initial probability (or initial transition probability) should be set equally for each state.

Input and Output Format:

Your program should take one command line argument specifying the name of a file containing the nucleotide sequence which you will search for CpG islands. You can use python or Matlab. The command line should be of the form programName sequence.fasta.

Your program should output the number of CpG island candidates found, followed by the length of each region along with the sequence index numbers and the downstream gene of the CpG island (note: you should examine both positive strand and negative strand). The output should be of the form (example):

Total CpG islands found: 3; 2 out of 3 islands are followed by a coding region
CpG island 1: 348 bp (739 - 1086) gene_name (+), gene_name (-)
CpG island 2: 1148 bp (1508 - 2655) No gene
CpG island 3: 971 bp (4029 - 4999) gene_name (+)

Problems:

1. (40 points): Implement the Hidden Markov model and test your code on the toy samples (don't forget your comments).
2. (40 points): Use your algorithm to analyze the provided DNA sequences. Report all the identified CpG islands with lengths of at least 200 base pairs.
3. (20 points): Identify the downstream genes that immediately follow your CpG islands (starting within 500 basepairs).

Submission:

For this problem you should submit the following files:

Source code (.py or .m)
- Main file (main.py or main.m): call functions to apply HMM to toy samples and human chromosome 21, and the code to identify the downstream genes of the CpG islands.
- Hidden Markov Model (HMM.py or .m): implementation of HMM
- Identify the downstream genes (GeneIdentifier.py or .m): implementation of gene identification
Result (PDF)
Explanation/ReadMe of main file (PDF)

Rubric

Title:

Find a Rubric

Title

Title
Criteria	Ratings	Pts
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 pts Full Marks blank 0 to >0 pts No Marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 pts Full Marks blank 0 to >0 pts No Marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --