Navigation auf uzh.ch

Suche

Department of Informatics Artificial Intelligence and Machine Learning Group

[processed by Pascal Engeli] More Stable Replacement for Non-Maximum-Suppression in Object and Face Detectors

Given an input image, the task of object detection consists of both the localization and the classification of one or several objects in that image. When using a (two-stage) object detector deep network such as ,Fast R-CNN or Faster R-CNN, two different stages are performed. In the first stage, many different possible locations of objects in various scales and locations are generated, where each real object in the image object might be proposed several times. In the second stage, for each of those locations the object inside of this bounding box is classified (which includes a "garbage" class when no object is present in this region). For all overlapping bounding boxes that are classified as the same object, the so-called Non-Maximum-Suppression (NMS) algorithm throws away all bounding boxes except for the one with the maximum prediction for that mentioned object.

When looking at video sequences of objects, often the detected bounding boxes are jittering considerably since in each frame a different predicted bounding box can win this NMS competition, or since the location proposals differ between frames. To obtain a more stable detection that does not jitter that much, one could think of better alternatives for NMS, e.g., computing a weighted average of the overlapping bounding boxes. Finally, when having such a more stable bounding box, it is very well possible that the classification of the contained object is even better than the one of the NMS winner.

The task in this Master's thesis is to design and implement strategies for replacing NMS used in object detectors by a weighted average of the overlapping bounding boxes. How these weights are computed is an essential part of the thesis, and different algorithms / sources should be investigated. Additionally, the final classification score for the computed bounding box must be extracted in an additional step, which possibly requires to modify the network topology or the extraction stages.

Face and facial landmark detectors work in a similar fashion using the same NMS technique, just that the task is slightly different. In MTCNN, for example, first all faces are detected, and NMS is performed to reduce overlapping face region proposals to a single bounding box. Only for the last surviving bounding box, facial landmark locations are predicted. The idea here would be to predict facial landmarks for more than one of the overlapping bounding boxes and compute weighted averages of these landmarks as the final output. Using these stable landmarks for face alignment, different face processing tasks such as Facial Attribute Prediction can be achieved with better performance.

Note: The goal of this Master's thesis is not to train object or face detectors, but only to improve results by just replacing NMS with something more reasonable.

Requirements

  • A reasonable understanding of deep neural networks.
  • Programming experience in python and a deep learning framework.
  • Decent understanding of written English.