The project set-up

The final system should have the following set-up:


Low cost web-cameras will be installed in all the rooms of the home of an elderly or disabled person. In each room, there will be 1-4 cameras, depending on the room setting. The purpose of using multiple cameras is mainly to avoid occlusions of the person in the video sequences and also to allow 3D-based analysis algorithms. To avoid complicated wiring in the consumers flat it is possible to use wireless web-cams which are nowadays widely available.


All cameras of one home will be connected to a central computer, which will perform the processing of the image sequences and eventually perform the fall detection. Depending on the concrete hardware, it is possible that some parts of the preprocessing (like simple motion detection) will be done in the cameras themselves (since some cameras have this feature build in already).


The processing of video data and the fall detection will be done fully automatically, so that no privacy issues are challenged and the consumer doesn’t feel being watched. Even more, since the camera system should also be installed in private places such as the bathroom and bedroom as those represent the most likely places to fall.


In case of a detected fall an alert signal (e.g. an SMS) will be sent to a designated assistant or a call center. When receiving a fall alarm, the assistant can call the fallen person and ask them if they need instant help or if they can resolve the situation themselves (e.g. if it was a false alarm, if the fallen person is able to stand up by himself/herself or has assistance nearby, e.g. visiting family). A way of letting the user know that the system has “seen” them fall should also be considered, since the fallen person needs reassurance that help is on the way.
Additionally, a way of “clearing the system” should be considered e.g. for situations when the elderly have visitors, such as lively grand children who may throw themselves on the ground and activate the system unnecessarily.

project set-up

 

Automatic fall detection



When using multiple cameras, a key question is when the fusion of the video data streams is performed in the overall detection process. We compare two basically different detection schemes which we label as early fusion and late fusion.

early fusion late fusion
In the early fusion scheme, detected motion is fused together from calibrated cameras to obtain a 3D reconstruction of the human. These methods offer a robust estimation of the human posture and thus view-invariant features for fall detection. However, the drawback of 3D reconstruction is that it needs camera calibration and demands higher computational effort. In the late fusion scheme, feature extraction and fall detection is performed individually in each camera. In a final voting step the individual decisions are fused together to an overall decision. This scheme tries to overcome the drawback of view-dependence of the extracted features by a well-adapted fusion strategy that needs no camera calibration and less computational effort.
early fusion scheme
late fusion


In our methodology we focus on simplicity, low computational effort and therefore fast processing without the need of high-end hardware since the system has to be as cheap as possible to be affordable for the elderly. These design goals render, for instance, sophisticated model-based approaches for posture recognition infeasible. For both the early and late fusion approach, posture recognition is kept simple and estimates basically the general orientation of the human body, i.e. standing/vertical or lying/horizontal. This is achieved by combining the extracted features to confidence values for different posture states by using fuzzy logic. In the late fusion approach, fuzzy logic is also used to fuse the confidence values of the various cameras to a final estimation of the state and for final fall detection.


The steps of human detection, feature extraction and estimation of posture and fall confidence values are nearly identical in both approaches. The only difference is on which kind of motion information the feature extraction is performed (2D pixels for late fusion and 3D voxels for early fusion) and whether posture and fall confidence values are estimated once (early fusion) or individually for each camera (late fusion).


Feature extraction

Inspired by previous works, we implement a collection of straightforward semantic driven features. We discern between the intra-frame features which are computed within each frame and which focus on describing the character of the object, i.e. the posture, and a inter-frame feature which expresses the character of the change that happens between consecutive frames.


Intra-frame features

  • Bounding Box Aspect Ratio: The height of the bounding box (green on the image) surrounding the human divided by its width
  • Orientation: The orientation of the major axis of the ellipse (blue on the image) fitted to the human.
  • Axis Ratio: The ratio between the lengths of the major axis and the minor axis of the ellipse fitted to the human.

Inter-frame feature

  • Motion Speed: The relative number of new motion pixels/voxels in the current frame compared to the previous frame

features

Fuzzy-based estimation of posture and fall confidence values

In conformity with previous work, we define 3 posture states in which the human may reside: standing, in-between and lying. Sets of primarily empirically determined fuzzy thresholds in the form of trapezoidal functions are assembled to interpret the intra-frame features and relate them to the postures. Thus, each feature value results in a confidence value in the range [0,1] on each posture, where the confidences of one feature sum up to 1 for all postures. These are then combined to assign a confidence value for each posture which is determined by a weighted sum of all feature confidences. The membership functions for the orientation are exemplarily shown in the following Figure.

fuzzy funtion for the Orientation feature


From the computed confidence values for the different postures, for every frame a confidence value for a fall event is computed. Therefore, we combine the intra- and inter-frame features with the assumption that a fall is defined by a relatively high motion speed, followed by a period with a lying posture. Therefore, the alarm is caused with a several frame delay, as can be seen in the following Figure.

confidence of the fall