Abstract
Correspondence analysis of 28 proteomes selected to
span the entire realm of prokaryotes revealed universal
biases in the proteins’ amino acid distribution. Integral
Inner Membrane Proteins always form an individual
cluster, which can then be used to predict protein
localisation in unknown proteomes, independently of
the organism’s biotope or kingdom. Orphan proteins are
consistently rich in aromatic residues. Another bias is
also ubiquitous: the amino acid composition is driven by
the GþC content of the first codon position. An
unexpected bias is driven, in many proteomes, by the
AANbox of the genetic code, suggesting some functional
biochemical relationship between asparagine and lysine.
Less-significant biases are driven by the rare amino
acids, cysteine and tryptophan. Some allow identification
of species-specific functions or localisation such as
surface or exported proteins. Errors in genome annotations
are also revealed by correspondence analysis,
making it useful for quality control and correction.