Popcorn

What is Popcorn?

Popcorn stands for PrOkaryotic Prediction of Coding OR Noncoding. Popcorn uses machine learning to predict whether prokaryotic genomic sequences are coding or noncoding. While Popcorn can be used for any length sequences, it is designed especially for the more challenging types of sequences, namely short sequences such as those corresponding to sORFs (short ORFs) and sRNAs (small regulatory noncoding RNAs). An example use case would be following an RNA-seq experiment, where commonly many transcripts are observed that do not map to annotated genes - do these transcripts correspond to novel sORFs or sRNAs or something else? Popcorn can help quickly identify which transcripts are most likely to be protein coding. At its core, Popcorn employs a neural network that has been trained on thousands of documented sORFs, noncoding RNAs, and other coding and noncoding prokaryotic genomic sequences.

Input

As input, Popcorn requests the name of a genome and one or more prokaryotic genomic sequences.

When a user enters a few characters from the genome name, a drop-down box appears from which the user can select their genome of interest. If the genomic origin of a sequence is unknown, no genome need be provided, i.e., the genome may be left blank. In this case, Popcorn will use generic prokaryotic information to predict whether the sequence is coding or not. When available, the name of a genome should be provided as Popcorn will then be able to use genomic specific information and its results will be more accurate.

One or more genomic sequences may be submitted. If multiple sequences are submitted, the sequences should either be in FASTA format or else different sequences should be separated by blank lines. Sequences may contain either DNA or RNA nucleotides.

Output

As output, for each sequence, Popcorn reports the probability that the sequence is protein coding, as determined by the machine learning algorithm. If the coding probability is greater than 0.5 than the sequence is predicted to be CODING, and if the coding probability is less than 0.5 than the sequence is predicted to be NONCODING.

Interested in searching a different genome from those provided?

The Popcorn website enables searching any prokaryotic genome listed as reference or representative by NCBI's RefSeq, as well as many others. There are thousands of such genomes. The site does not provide support for searching any/all prokaryotic genomes. There are too many. If you are interested in using Popcorn to search a genome other than those provided, you have two options.

Option #1. Download the source code and run Popcorn on your own machine.

Option #2. We do our best to support reasonable requests for individual genomes to be added to the Popcorn website. We receive many such requests, so we kindly ask for your understanding and patience . Unfortunately, we cannot process batch requests, e.g., "please add these 20 genomes." If you would like to request that a genome be added to the website, please follow these steps:

Kindly double-check that the genome is not already available on the website.
Find the NCBI RefSeq genome assembly identifier for your genome of interest, e.g., GCF_?????????.1. This assembly accession identifier can be found in the first column here for archaea and here for bacteria. We use this identifier to add a genome. If your genome of interest does not have such an identifier, we will not be able to add it to the website and you will need to use Option #1 above and run Popcorn on your own machine.
Submit the Contact Form and provide the assembly accession identifier (GCF_?????????.1) for your genome of interest.
Over the next week or so, check the website for inclusion of your genome (make sure to refresh the Popcorn webpage). We process requests about once a week. We don't normally notify you when requests have been processed, so please check the website.

Source Code

Popcorn source code is available on GitHub

Citing Popcorn

Popcorn: prediction of short coding and noncoding genomic sequences in prokaryotes. Alison Kyrouz, Lian Liu, Lixin Qin, and Brian Tjaden. Bioinformatics, 41(5):btaf250, 2025.