Restricted mismatch-kernel for protein classification
In this project, you will implement a restricted version of mismatch kernel and apply the kernel to compute sequence similarity between SCOP sequences. The restricted-mismatch kernel only allows mismatches between pairs that are pre-specified. With the restriction only meaningful/plausible mismatches are used in the mismatch neighborhood of each k-mer. Derive the allowed mismatch pairs with a cutoff of substitution scores from PAM120 matrix (page 83 in the textbook).
Problems:
1. Use Needleman-Wunsch algorithm (your implementation in HW1 or other existing programs) to compute sequence alignment scores between all pairs of SCOP sequences. Visualize the similarity matrix by tools such as matrix2png, matlab or python. You do not need to normalize the matrix since some scores are negative.
2. Implement mismatch kernel and restricted-mismatch kernel. Your program reads the sequence file as input and outputs a N by N similarity matrix, where N is the number of the sequences in the file. Hint: you only need to slightly change restricted-mismatch kernel to get mismatch kernel. Feel free to modify the source code from https://cbio.mskcc.org/leslielab/software/string_kernels.html.
3. Use your restricted-mismatch kernel and mismatch kernel to compute the kernel value between all pairs of SCOP sequences. Try different parameters such as k=4,5,6,7 m=0, 1 and 2 and normalize the matrix by dividing each number at (i,j) by the square-root of (i,i) and the square-root of (j,j). Visualize the normalized similarity matrices produced by restricted-mismatch kernel as in problem 1.
4. Compare the normalized similarity matrices and discuss your observations. (Hint: design some measure of pairwise sequence similarity within superfamilies and between superfamilies from the similarity matrices.) Discuss your results.
Note: If you are interested, feel free to test classification of the sequences using SVM as the experiment in the reference paper.
Dataset:
A non-redundant protein domain dataset from SCOP v1.53 is provided here
Download here. In the header, you can find out the classification of the sequence. For example,
"d1b0b__ 1.1.1.1.2": class-1, fold-1.1, superfamily 1.1.1, family: 1.1.1.1.
References:
1. Mismatch string kernels for discriminative protein classification.
Christina S. Leslie, Eleazar Eskin, Adiel Cohen, Jason Weston, and William Stafford Noble. Bioinformatics 20(4):467-476 (2004)
2. Matrix2png: http://www.chibi.ubc.ca/matrix2png
Links to an external site.