The GC content of a genomic sequence is the percentage of nucleotides in the genomic sequence that are either guanines or cytosines. Different genomes have widely different GC contents. For example, the genomes of the bacteria Anaeromyxobacter have a GC content of about 75%, whereas the genomes of the bacteria Buchnera have a GC content of about 25%.
GC content differs not only between genomes but also within a genome. For example, regions of a genome that correspond to genes may have a higher GC content than regions of the same genome that do not correspond to genes. As a result, when trying to identify where genes are located in a genome, GC content may provide some clue, i.e., we may hypothesize that a region of high GC content is more likely to be a gene than a region of low GC content.
GC content information can also be useful when searching for
patterns in genomic data. For instance, in eukaryotic genomes,
many genes are preceded by the nucleotide sequence
TATAAA
, known as a TATA box. When trying to identify
where genes are located in a genome, the presence of a TATA box
may signal the beginning of a gene. However, the presence of a
TATA box may be a more useful signal in a genome with high GC
content where TATA boxes are less likely to occur by chance, as
opposed to genomes with low GC content where TATA boxes are more
likely to occur by chance.