A hybrid framework for protein sequence clustering and classification using signature motif information

Academic Article

Abstract

  • In this paper, we propose an unsupervised hybrid framework for protein sequence clustering and classification which incorporates protein structural motif information. The proposed framework consists of three stages: protein structural motif scan, hybrid clustering, and sequence classification. The incorporation of protein structural motif detected by ScanProsite service provides a better measurement in calculating the sequence similarity. The proposed two-phase hybrid clustering approach combines the strengths of the hierarchical and the partition clustering. Phase I adopts the hierarchical agglomerative clustering to pre-cluster multi-aligned sequences. Phase II performs the partition clustering which initiates its partition based on the result from Phase I and uses profile Hidden Markov Models (HMMs) to represent clusters. The profile HMMs are then stored in the database for unknown sequences classification, which is done by finding the best alignment of a sequence to each existing profile HMM. Our experiments demonstrate the effectiveness and the efficiency of the proposed framework for biological sequence clustering and classification. © 2009 IOS Press and the author(s). All rights researved.
  • Authors

    Published In

    Digital Object Identifier (doi)

    Author List

  • Chen WB; Zhang C
  • Start Page

  • 353
  • End Page

  • 365
  • Volume

  • 16
  • Issue

  • 4