Scalable de Novo Genome Assembly Using a Pregel-Like Graph-Parallel System

Academic Article


  • De novo genome assembly is the process of stitching short DNA sequences to generate longer DNA sequences, without using any reference sequence for alignment. It enables high-Throughput genome sequencing and thus accelerates the discovery of new genomes. In this paper, we present a toolkit, called PPA-Assembler, for de novo genome assembly in a distributed setting. The operations in our toolkit provide strong performance guarantees, and can be assembled to implement various sequencing strategies. PPA-Assembler adopts the popular de Bruijn graph based approach for sequencing, and each operation is implemented as a program in Google's Pregel framework which can be easily deployed in a generic cluster. Experiments on large real and simulated datasets demonstrate that PPA-Assembler is much more efficient than the state-of-The-Arts while providing comparable sequencing quality. PPA-Assembler has been open-sourced at
  • Digital Object Identifier (doi)

    Pubmed Id

  • 23719529
  • Author List

  • Guo G; Chen H; Yan D; Cheng J; Chen JY; Chong Z
  • Start Page

  • 731
  • End Page

  • 744
  • Volume

  • 18
  • Issue

  • 2