CS313 Project 4

Videos/Reading

Introduction to Next Generation Sequencing

How to turn in this Project

You are required to turn in both a hardcopy and a softcopy. If working as a team, the team should submit a single hardcopy and a single softcopy (to either team member's account). Please make sure to keep a copy of your work, on your computer, in your private directory (or, to play it safe, both).

Hardcopy Submission

Your hardcopy packet should consist of:

The cover page;
Your modified Assembler.java file from Task 1.
Your newly developed Mapper.java file from Task 2.

Staple these together and submit them on the due date.

Softcopy Submission

You should submit your final version of your Assembler directory and your Mapper directory to the drop/project4 directory in your account on the CS server.

A local biotech company recently contacted your professor to ask if the professor knew of any sharp students that could help the company over the summer. Scientists at the company have performed numerous DNA-seq and RNA-seq experiments and are overwhelmed trying to analyze the large data sets. The company is hoping to hire someone for the summer to develop some in-house tools to map the sequencing reads to a genome. Your professor gives your name to the company and when the company contacts you and offers you the position, you figure "why not?", after all you have a keen understanding of sequencing algorithms, working in the "real world" would be a welcome respite from the ivory tower of academia and, quite frankly, the pay is good to boot.

Task 1: Genome Assembly

High-throughput DNA sequencing (DNA-seq) experiments generate large numbers of short sequencing reads. In this task, you will develop an application that assembles a genome from a set of DNA-seq reads. We are providing some starter code, Assembler.java, that can be found in the /home/cs313/download/Assembler directory on the CS server. The starter code implements some but not all of the methods described in the Assembler contract found here. You should begin by studying the starter code. Your job is to complete the Assembler class by implementing the three methods described below. Along with the starter code, we have provided some files of sequencing reads in the data directory, which can be used to test your Assembler application.

public void populate_kmer_dictionary(String filename)
Reads in a file of sequencing reads. For each k-mer in a sequencing read, the k-mer is added to the dictionary and the number of occurrences of the k-mer is updated appropriately.
public void extendGenomeSequenceForward()
Attempts to extend the genome sequence forward (to the right) repeatedly, one character at a time. The genome sequence is extended and a character added to its end if there exists a nucleotide character that can be added to the end of the k-1 final characters of the genome sequence to form a k-mer that occurs in the k-mer dictionary. If there are multiple individual nucleotide characters that can be added to the final k-1 characters in the genome sequence to form k-mers in the dictionary, then the character resulting in the k-mer with the largest number of occurrences in the dictionary is chosen. Each time the genome sequence is extended by a character, the corresponding k-mer is removed from the dictionary so that it will not be used in future extensions.
public void extendGenomeSequenceBackward()
Attempts to extend the genome sequence backward (to the left) repeatedly, one character at a time. The genome sequence is extended and a character added to its front if there exists a nucleotide character that can be added to the front of the k-1 first characters of the genome sequence to form a k-mer that occurs in the k-mer dictionary. If there are multiple individual nucleotide characters that can be prepended to the first k-1 characters in the genome sequence to form k-mers in the dictionary, then the character resulting in the k-mer with the largest number of occurrences in the dictionary is chosen. Each time the genome sequence is extended by a character, the corresponding k-mer is removed from the dictionary so that it will not be used in future extensions.

As examples, when your Assembler is executed on the file readsSmall.txt, it may output an assembled genome of length 88, and when executed on the file reads.txt, it may output an assembled genome of length 7815.

Task 2: Mapping Sequencing Reads to a Reference Genome

For organisms whose genome sequences are known, one of the first steps following DNA-seq or RNA-seq experiments is mapping the short sequencing reads to the organism's genome, i.e., the reference genome. Most sequencing reads are likely to correspond to a region from the reference genome, but some reads may not map to the reference genome for a variety of reasons, including errors in the sequencing process or reads that correspond to contaminating nucleic acid sequences from impure samples.

Mapping a sequencing read to a reference genome involves determining if the short nucleotide sequence of the read can be found somewhere in the genome sequence and, if so, at what location. This problem is analagous to searching for a small substring in a much larger string. Because high-throughput sequencing experiments generate large numbers of reads, often millions or billions, we need to map each read to a reference genome as efficiently as possible, e.g., in O(n) time rather than O(m) time, where n represents the length of the read and m represents the length of the reference genome.

One approach for quickly searching for a short read sequence in a large reference genome involves using a Burrows-Wheeler Transform (BWT). As a pre-processing step, a BWT is computed for the reference genome. Using this BWT, each read sequence can be checked quickly to determine if it is contained in the genome sequence.

In the /home/cs313/download/Mapper directory on the CS server, we have provided you with a BWT class that computes the BWT of a genome sequence from a Fasta file. The contract for the BWT class can be found here. You should begin by studying this contract. Your task is to write an application, Mapper, that takes in two-command line arguments: the name of a Fasta file containing a reference genome sequence and the name of a file containing a set of short sequencing reads. Your Mapper application should construct a BWT of the reference genome sequence using the provided BWT class. Then, your Mapper application should read in the file of sequencing reads and attempt to map each read to the BWT using the algorithm studied in class. Keep in mind that a sequencing read could correspond to either strand of the reference genome, so you should check both. Ultimately, your Mapper application should print out:

the number of reads in the file
the number of reads that could not be mapped to the reference genome
the number of reads that map to exacly one location in the reference genome
the number of reads that map to more than one location in the reference genome

For example, when executed on the genome of the human papillomavirus and the file of sequencing reads found in the data directory at /home/cs313/download/Mapper on the CS server, your Mapper application might print out the following:

[cs313@CS Mapper] java Mapper data/papillomavirus.txt data/reads.txt

Number of total reads processed:                 52100
Number of unmapped reads:                        1627
Number of reads mapping to exactly one location: 50382
Number of reads mapping to multiple locations:   91