This paper proposes a multimodal framework that clusters spam images so that ones from the same spam source/cluster are grouped together. By identifying the common sources of spam images, we can provide evidence in tracking spam gangs. For this purpose, text recognition and visual feature extraction are performed. Subsequently, a two-level clustering method is applied where images with visually similar illustrations are first grouped together. Then the clustering result from the first level is further refined using the textual clues (if applicable) contained in spam images. Our experimental results show the effectiveness of the proposed framework. © 2009 ACADEMY PUBLISHER.