Exercise I
- Due Sep 28, 2021 by 11:59pm
- Points 100
- Submitting a file upload
Exercise I: Analysis of GC content in human genome.
GC content is the frequency of the G/C nucleotides in the DNA sequence. It often reveals important characteristics of a genome. In this exercise, you will analyze the GC content of human genome and genes.
Instructions:
Step 1: Individual work on programming
1. Download human genome sequences (version hg19) from UCSC genome browser
Links to an external site.. The filenames of the chromosomes are chr*.fa.gz (* is the chromosome # in {1,2,...,22,X,Y}). Choose any chromosome for your implementation.
2. Calculate the frequency of the four types of nucleotides (the number of each type of nucleotides divided by the length of the sequence).
3. Besides the global base composition of a genome, it is also interesting to analyze the local fluctuations of GC content throughout the sequence. We can measure local base composition by sliding a window of size k along the sequence and reporting the frequencies in the window. This produces four vectors of length L - k + 1 that can be calculated for each base, where L is the length of the sequence.
4. Download the human gene annotation file
Download human gene annotation file (hg19). Understand the format of the file and find out the locations of genes/transcripts. The format of the file is as follows:
Column 1: gene names,
Column 2: transcript names.
Column 3/4: start/end positions of the genes
Column 5: chromosome number
Column 6: either '+' or '-' representing if the gene is in one of the double strands, i.e. '+' represents the 5' to 3' strand, whereas '-' represents the 3' to 5' strand.
Note that you need to know if the gene is in the forward or the reverse strand as you will use this information for step 5.
5. Plot the locations of the genes together with the GC content. Use excel or matlab to plot the frequencies. Note that if your sequence is very long, you probably need to sample some positions to plot. The scripts for the plotting in matlab/python
Download The scripts for the plotting in matlab/python are provided for generating the below figure.
To run the Python script, it takes parameters as "python exercise1.py chrx.fa hg19_annotation.txt chrx the_start_position the_end_position" where chrx is the chromosome number such as chr22. The matlab function can be used similarly in matlab.
6. Submit your source code and the plots of some chromosomes/genes you explored in Canvas. You must submit your source code (could be modified from the provided code) and the plots to show you have studied the problem in this exercise as a class participation. Your submission will not be graded with specific comments.
Step 2: Group work on data analysis
1. Download the chromosome that is assigned to your group to study. The assigned chromosome is the one numbered the same as your group, e.g. group x will analyze chromosome x, where x = 1, 2, ..., 13. Your group needs to study the disease-gene association database OMIM Links to an external site. to pick a few disease genes that you are interested in studying in your assigned chromosome. Visualize the GC content of the chromosome and zoom into the regions around the genes. Select a few genes with strong GC content in the promoter regions. Go to GeneCard Links to an external site. and GO to find out their biological functions. Note that those websites are probably already using hg38. To visualize the genes in hg19, you can go to UCSC genome browser Links to an external site. and choose hg19 as your reference genome for the visualization.
2. You can use matlab or Python for the exercise. Your group will need to present your findings in a recorded five-minutes talk uploaded in Zoom. Use up to four slides (to be submitted along with the individual submission) to show the GC content of the chromosome and the regions around the genes that you are interested in exploring. Tell us about their functions and disease associations.
3. One of your group members should submit your slides and the video with this link.
Step 3: Watch two videos of the presentations by other groups
1. Each of you needs to watch the videos from two other groups (Let x be your student number. You should watch the videos by group (x mod 13) + 1 and (x+1 mod 13) + 1 and write one short paragraph of what you learned from each videos.
2. You will submit a pdf of the two paragraphs in another submission link.
Grading for class participation:
This is only an exercise rather than homework assignment. Your goal is to learn about GC content, how to use programing tool to analyze GC content in the genome data and how to obtain gene information in human genome in this class participation.
1. TA will read your submissions to confirm your participation and the quality of the video.
2. As long as they are reasonably completed, you will get all the points. You will not receive any feedback unless something significant is wrong/missing.