A tool for predicting whether a prokaryotic genome
sequence corresponds to a coding or noncoding transcript
A tool for predicting whether a prokaryotic genome
sequence corresponds to a coding or noncoding transcript
Popcorn stands for PrOkaryotic Prediction of Coding OR Noncoding. Popcorn uses machine learning to predict whether prokaryotic genomic sequences are coding or noncoding. While Popcorn can be used for any length sequences, it is designed especially for the more challenging types of sequences, namely short sequences such as those corresponding to sORFs (short ORFs) and sRNAs (small regulatory noncoding RNAs). An example use case would be following an RNA-seq experiment, where commonly many transcripts are observed that do not map to annotated genes - do these transcripts correspond to novel sORFs or sRNAs or something else? Popcorn can help quickly identify which transcripts are most likely to be protein coding. At its core, Popcorn employs a neural network that has been trained on thousands of documented sORFs, noncoding RNAs, and other coding and noncoding prokaryotic genomic sequences.
As input, Popcorn requests the name of a genome and one or more prokaryotic genomic sequences.
When a user enters a few characters from the genome name, a drop-down box appears from which the user can select their genome of interest. If the genomic origin of a sequence is unknown, no genome need be provided, i.e., the genome may be left blank. In this case, Popcorn will use generic prokaryotic information to predict whether the sequence is coding or not. When available, the name of a genome should be provided as Popcorn will then be able to use genomic specific information and its results will be more accurate.
One or more genomic sequences may be submitted. If multiple sequences are submitted, the sequences should either be in FASTA format or else different sequences should be separated by blank lines. Sequences may contain either DNA or RNA nucleotides.
As output, for each sequence, Popcorn reports the probability that the sequence is protein coding, as determined by the machine learning algorithm. If the coding probability is greater than 0.5 than the sequence is predicted to be CODING, and if the coding probability is less than 0.5 than the sequence is predicted to be NONCODING.
The Popcorn website enables searching any prokaryotic genome listed as reference or representative by NCBI's RefSeq, as well as many others. There are thousands of such genomes. The site does not provide support for searching any/all prokaryotic genomes. There are too many. If you are interested in using Popcorn to search a genome other than those provided, you have two options.
Option #1. Download the source code and run Popcorn on your own machine.
Option #2. We do our best to support
reasonable requests for individual genomes to be added to the
Popcorn website. We receive many
such requests, so we kindly ask for your understanding and
patience .
Unfortunately, we cannot process batch requests, e.g.,
"please add these 20 genomes." If you would like to request that
a genome be added to the website, please follow these steps:
Popcorn source code is available on GitHub
Popcorn is currently under review. If/when a manuscript describing Popcorn is published, a citation to the paper will be provided here.
Contact Us