Whole Genome Phylogeny via Complete Composition Vectors

  • Author(s) / Creator(s)
  • Technical report TR05-06. The availability of complete genomic sequences allows us to infer the evolutionary footprints between species in a global strategy. However, the length of these genomic sequences poses a challenge on computational efficiency and optimality of information representation in phylogenetic analyses. In this paper, a new method called complete composition vector (CCV) is described to infer evolutionary relationships between species using their complete genomic sequences. In this method, the character string frequencies in the complete genomic sequence of each species are represented by a complete composition vector in a high-dimensional space. After being filtered out the random mutation background, cosines of the angles between the representing vectors are converted into pairwise evolutionary distances, based on which the phylogeny tree is constructed using the neighbor-joining algorithm. The method bypasses the complexity of performing multiple sequence alignments and avoids the ambiguity of choosing individual genes, whereas is expected to effectively retain the rich evolutionary information contained in the whole genomic sequence. To verify its strengths, the method was applied to infer the evolutionary footprints of coronaviruses and microbes. On a typical desktop PC, it took only one and half days to construct the phylogeny for 109 species containing 103 microbes and 6 eukaryotes. The phylogenetic trees generated by our method are highly consistent with those annotated by biologists. | TRID-ID TR05-06

  • Date created
  • Subjects / Keywords
  • Type of Item
  • DOI
  • License
    Attribution 3.0 International