Videos/Reading
How to turn in this Project
You are required to turn in both a hardcopy and a softcopy. If working as a team, the team should submit a single hardcopy and a single softcopy (to either team member's account). Please make sure to keep a copy of your work, on your computer, in your private directory (or, to play it safe, both).Hardcopy Submission
Your hardcopy packet should consist of:- The cover page;
- Your modified
Assembler.java
file from Task 1. - Your newly developed
Mapper.java
file from Task 2.
Softcopy Submission
You should submit your final version of your
Assembler
directory and your
Mapper
directory to the drop/project4
directory
in your account on the CS server.
A local biotech company recently contacted your professor to ask if the professor knew of any sharp students that could help the company over the summer. Scientists at the company have performed numerous DNA-seq and RNA-seq experiments and are overwhelmed trying to analyze the large data sets. The company is hoping to hire someone for the summer to develop some in-house tools to map the sequencing reads to a genome. Your professor gives your name to the company and when the company contacts you and offers you the position, you figure "why not?", after all you have a keen understanding of sequencing algorithms, working in the "real world" would be a welcome respite from the ivory tower of academia and, quite frankly, the pay is good to boot.
Task 1: Genome Assembly
High-throughput DNA sequencing (DNA-seq) experiments generate
large numbers of short sequencing reads. In this task, you will
develop an application that assembles a genome from a set of
DNA-seq reads. We are providing some starter code,
Assembler.java
, that can be found in the
/home/cs313/download/Assembler
directory on the CS
server. The starter code implements some but not all of the methods
described in the Assembler
contract found here. You should begin by studying the starter
code. Your job is to complete the Assembler
class by
implementing the three methods described below. Along with the starter
code, we have provided some files of sequencing reads in the
data
directory, which can be used to test your
Assembler
application.
-
public void populate_kmer_dictionary(String filename)
Reads in a file of sequencing reads. For each k-mer in a sequencing read, the k-mer is added to the dictionary and the number of occurrences of the k-mer is updated appropriately. -
public void extendGenomeSequenceForward()
Attempts to extend the genome sequence forward (to the right) repeatedly, one character at a time. The genome sequence is extended and a character added to its end if there exists a nucleotide character that can be added to the end of the k-1 final characters of the genome sequence to form a k-mer that occurs in the k-mer dictionary. If there are multiple individual nucleotide characters that can be added to the final k-1 characters in the genome sequence to form k-mers in the dictionary, then the character resulting in the k-mer with the largest number of occurrences in the dictionary is chosen. Each time the genome sequence is extended by a character, the corresponding k-mer is removed from the dictionary so that it will not be used in future extensions. -
public void extendGenomeSequenceBackward()
Attempts to extend the genome sequence backward (to the left) repeatedly, one character at a time. The genome sequence is extended and a character added to its front if there exists a nucleotide character that can be added to the front of the k-1 first characters of the genome sequence to form a k-mer that occurs in the k-mer dictionary. If there are multiple individual nucleotide characters that can be prepended to the first k-1 characters in the genome sequence to form k-mers in the dictionary, then the character resulting in the k-mer with the largest number of occurrences in the dictionary is chosen. Each time the genome sequence is extended by a character, the corresponding k-mer is removed from the dictionary so that it will not be used in future extensions.
As examples, when your Assembler
is executed on the file
readsSmall.txt
, it may output an assembled genome of length 88,
and when executed on the file reads.txt
, it may output
an assembled genome of length 7815.
Task 2: Mapping Sequencing Reads to a Reference Genome
For organisms whose genome sequences are known, one of the first steps following DNA-seq or RNA-seq experiments is mapping the short sequencing reads to the organism's genome, i.e., the reference genome. Most sequencing reads are likely to correspond to a region from the reference genome, but some reads may not map to the reference genome for a variety of reasons, including errors in the sequencing process or reads that correspond to contaminating nucleic acid sequences from impure samples.
Mapping a sequencing read to a reference genome involves determining if the short nucleotide sequence of the read can be found somewhere in the genome sequence and, if so, at what location. This problem is analagous to searching for a small substring in a much larger string. Because high-throughput sequencing experiments generate large numbers of reads, often millions or billions, we need to map each read to a reference genome as efficiently as possible, e.g., in O(n) time rather than O(m) time, where n represents the length of the read and m represents the length of the reference genome.
One approach for quickly searching for a short read sequence in a large reference genome involves using a Burrows-Wheeler Transform (BWT). As a pre-processing step, a BWT is computed for the reference genome. Using this BWT, each read sequence can be checked quickly to determine if it is contained in the genome sequence.
In the /home/cs313/download/Mapper
directory on the CS
server, we have provided you with a BWT
class that computes
the BWT of a genome sequence from a Fasta file.
The contract
for the BWT
class can be found here. You should
begin by studying this contract. Your task is to write an
application, Mapper
, that takes in two-command
line arguments: the name of a Fasta file containing a reference
genome sequence and the name of a file containing a set of
short sequencing reads. Your Mapper
application
should construct a BWT of the reference genome sequence using
the provided BWT
class. Then, your
Mapper
application should read in the file
of sequencing reads and attempt to map each read to the BWT using
the algorithm
studied in class. Keep in mind that a sequencing read could
correspond to either strand of the reference genome, so you
should check both. Ultimately, your Mapper
application should print out:
- the number of reads in the file
- the number of reads that could not be mapped to the reference genome
- the number of reads that map to exacly one location in the reference genome
- the number of reads that map to more than one location in the reference genome
For example, when executed on the genome of the human
papillomavirus and the file of sequencing reads found in the
data
directory at
/home/cs313/download/Mapper
on the CS server,
your Mapper
application might print out the
following:
|