Spatio-temporal Local Features

local spatio-temporal features

The idea of extracting salient regions in the image and describe them for using this diminished data for classification is based techniques for 2D images. The extension of this approach into the time dimension has been done and has been extended with one dimensional Gabor filter for getting a more dense distribution of points.

 

We get a reasonable easy way of generating ground truth and thus training data: The frame number of a video gives the classification ground truth. Examples can be found below. For action and gestures where the application should react with an alarm. One important fact of this kind of training data is the chosen background: It is highly structured, and some vey salient colors can be found in the scene. Using spatial corner detection for event classification, the most salient location in the scene would clearly be the checker board on the bottom right side.

To classify the actions shown in the videos, we aim to connect the retrieved data into meaningful clusters of descriptors. The data should discriminate the trained actionsinto the two classes: Usual and emergency situations.This is done with the numbers of clusters and the distancemeasure as the minimization function for the k-means. The experiment is done with decreasing numberof clusters. We evaluate two different distance measurements for the classification of descriptors.

Spatio-temporal features are able to detect salient movement being very stable towards structured and moving background. Further, linear camera movements can be disregarded leading to a very stable estimation of important movement and its properties. A reliable classification in several different action classes is possible and the extension of these abilities relies completely on the used classification scheme. The main drawback of this method is the computational expensive extraction of local information on the visual data. One possibility to use them in an productive home monitoring systems would be to use this approach for a more precise classification after another technique detected something extraordinary already, a more sophisticated classification can be done some seconds after the accident with previously buffered visual data. This synchronization with faster methods like simple hardware frame differencing might lead to a highly flexible monitoring system which can deal with a large number of incidents.