Exercise III: Analyzing a family trio to assess risk of familiar diseases
- Due Dec 5, 2021 by 11:59pm
- Points 100
- Submitting a file upload
- Available until Dec 8, 2021 at 11:59pm
In this exercise, you will learn how to analyze genome short-read sequencing data with various software for disease association. There is no programming task in this exercise.
(1) A family trio of samples: We picked the samples from the same family in 1000 Genome Project Links to an external site.. The family members are: NA19238(mother), NA19239(father), NA19240 (daughter).
(2) Exome sequencing data: In this exercise, We only work on the exome sequencing data which are significantly smaller than the whole genome sequencing data. The URL of the paired-end exome sequencing data for each member of the family. (Note: you only need to download one pair of data files such as the pair for NA19238 to proceed to step (3). After you complete most part of (3), you can come back to process the other two pairs to save disk space.)
NA19238(mother) Links to an external site.
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR071/SRR071195/SRR071195_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR071/SRR071195/SRR071195_2.fastq.gz
NA19239(father) Links to an external site.
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR792/SRR792097/SRR792097_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR792/SRR792097/SRR792097_2.fastq.gz
NA19240 (daughter) Links to an external site.
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR151/007/SRR1518137/SRR1518137_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR151/007/SRR1518137/SRR1518137_2.fastq.gz
You can use UNIX “wget” command to download the exome sequencing data:
wget 'URL/Text'
and then decompress the gz files:
gzip -d filename.gz
(3) Shortread alignment and SNP callings. Follow the steps below to process the files for SNP calling. Process each pair of exome sequences one by one.
Note: You may need 30 GB for processing if you work with all 24 chromosomes. If so, remember to remove some unnecessarily intermediate files to save the space. When using bowtie / samtools / bcftools commands, you should also provide the path to the directory containing them if they are not in your current directory. Steps (3.2), (3.3), and (3.4) only need to be done once.
(3.1) Download hg38 canonical reference genome Links to an external site. chromosomes. To save computing time, you can use just one or concatenate some chromosomes to construct the reference genome sequence file. To concatenate the downloaded fasta files into one unique fasta file called “hg38_canonical.fasta”, use UNIX “cat” command:
cat *.fa > hg38_canonical.fa
(3.2) Obtain bowtie from here Links to an external site.. You can read to more details and options of bowtie2 from http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml Links to an external site..
(3.3) Index the reference genome so that reads can be quickly aligned. Refer to bowtie2-build Links to an external site. command or this example Links to an external site. to know how to use "bowtie2-build" command. (Let <reference_in> = hg38_canonical.fa, and <bt2_base> = hg38_canonical).
(3.4) Create a SAM format alignment file by using bowtie2 Links to an external site. command. (You may also refer to this paired-end example Links to an external site..)
In order to save the space, you can delete two exome sequence files after getting the SAM file.
(3.5) Download and install samtools and bcftools from here Links to an external site..
Refer to Samtools details and options: http://samtools.sourceforge.net/samtools.shtml Links to an external site..
(3.6) Filter out the unmapped reads in the SAM file (obtained from (3.5)) and then convert it into a BAM file:
samtools view -F 4 -ubS filename.sam > filename.bam
Refer to “view” command in the samtools manual Links to an external site..
In order to save the space, you can delete the SAM file after having obtained BAM file.
(3.7) Use samtools sort to convert the BAM file to a sorted BAM file. Sorted BAM is a useful format because the alignments are compressed (convenient for long-term storage), and sorted (convenient for variant discovery).
samtools sort filename.bam filename
Refer to “sort” command in the samtools manual Links to an external site..
(3.8) Index the BAM and reference FASTA files for rapid access:
Use “index” command to index the sorted alignment of three BAM files “mother.bam”, “father.bam”, and “daughter.bam”.
Use “faidx” command to index the reference FASTA file “hg38_canonical.fa”. Refer to Refer to “index” command and “faidx” command in the samtools manual Links to an external site..
(3.9) Use “mplieup” command to call all nonreference bases in sorted BAM file:
samtools mpileup -uf hg38_canonical.fa filename.bam | bcftools call -mv -Oz > calls.vcf.gz
Find more details about bcftools commands here Links to an external site..
You now can remove the BAM file and keep only VCF file and hg38_canoncial files.
(4) Install IGV Links to an external site. to visualize the read alignments and the SNPs you across the family trio. (Below is a quick IGV tutorial).
(5) Select a few disease genes from OMIM. Compare the SNPs in the region/gene that you selected across the family members. Observe the relation of the SNPs among the parents and the child. Report your findings for potential disease risk of the family by referring to the known SNPs in dbGap (http://www.ncbi.nlm.nih.gov/projects/gapplusprev/sgap_plus.htm Links to an external site.) or those in OMIM. Visualize the SNPs in IGV.
(6) You just need to submit a screenshot of the SNPs and the read alignment around them in the father, mother and child sample in a pdf file.
Note: Process the samples in the trio one by one. After you have the results from one sample, delete the intermediate files to free the space for next sample. Don't try to run more than two active jobs at the same time.