With the existence of "semantic gap" between the machine-readable low level features (e.g. visual features in terms of colors and textures) and high level human concepts, it is inherently hard for the machine to automatically identify and retrieve events from videos according to their semantics by merely reading pixels and frames. This paper proposes a human-centered framework for mining and retrieving events and applies it to indoor surveillance video databases. The goal is to locate video sequences containing events of interest to the user of the surveillance video database. This framework starts by tracking objects. Since surveillance videos cannot be easily segmented, the Common Appearance Intervals (CAIs) are used to segment videos, which have the flavor of shots in movies. The video segmentation provides an efficient indexing schema for the retrieval. The trajectories obtained are thus spatiotemporal in nature, based on which features are extracted for the construction of event models. In the retrieval phase, the database user interacts with the machine and provides "feedbacks" to the retrieval results. The proposed learning algorithm learns from the spatiotemporal data, the event model as well as the "feedbacks" and returns the refined results to the user. Specifically, the learning algorithm is a Coupled Hidden Markov Model (CHMM), which models the interactions of objects in CAIs and recognizes hidden patterns among them. This iterative learning and retrieval process contributes to the bridging of the "semantic gap", and the experimental results show the effectiveness of the proposed framework by demonstrating the increase of retrieval accuracy through iterations and comparing with other methods. © 2009 Elsevier B.V. All rights reserved.