The Viola-Jones + SORT tracking system developed in this project has been demonstrated to exhibit accurate detections for well-lit, well-spaced flies with fast inference times and a low computational overhead. This makes it an ideal candidate to be deployed on the Raspbery Pi based ethoscope platform for real-time online tracking. The Faster R-CNN + SORT tracking system has been shown to exhibit high object detection accuracies even with an Absolute Deviation an order of magnitude smaller than the Viola-Jones method and Mean Fly Count almost equal to the true fly count despite the particularly difficult test cases in the test video. Though the detection framerate is too low to be applied in a real-time scenario, it can find it's use as a robust offline tracker.
Part of the reason for the gulf in accuracy between Faster R-CNN and Viola-Jones methods may be due to the granularity of features each system is able to detect. As discussed in the background section, the Viola-Jones method only looks for pre-defined identifying features consisting of lines or edges [2]. However the weights of convolutional layers in the Faster R-CNN can be optimised to return outputs that best highlight object-specific regions [3][8]. This plasticity allows the network to hone in on more nuanced features allowing for more robust detections.
Surprisingly, the extended DeepSort + YOLOv4 pipeline performed worse at tracking objects than any of the SORT pipelines. This could be due to the fact that the visual redundancy of fruit-flies were not enough to distinguish flies between frames leading to the creation of new tracks. The decreased performance may also be attributed to the underperforming YOLOv4 object detection component of the pipeline with many false positives occuring in video inference. This in turn could be attributed to the fact that the YOLOv4 pipeline was not optimised in the same rigorous process (using mAP calculations on the test set) as the Faster R-CNN pipeline was. It could also be attributed to inherent flaws in the YOLO object detection system - as discussed in the background section, YOLO struggles to identify small objects [4].
While the performance metrics used to assess the pipelines in this report indicate promising results, it is important to note that these metrics have severe limitations. For example, the Object Detection metrics assume all detections are true positives. This was particularly problematic in the case of YOLOv4, where the detection of several false positives lead to misleading performance metrics. A more robust metric of performance comes in the form of Mean Average Precision (mAP). mAP attempts to quantify the precision (the fraction of all positives that are true positives) of the detector taking into account multiple definitions of a positive detection (using different IoU thresholds between ground truth and predicted bounding boxes)[13]. However, this requires the generation of ground truth positives which would have to be curated manually for all frames of the test video which was out of the scope of this project. In a similar vein, the methods used to assess object tracking assume that the number of ID's and mean track length can be used as proxies to assess tracking performance. However, this method does not attempt to make the comparison between computationaly predicted tracks and the true paths taken by the flies. Again, this comparison requires a ground truth for object tracking which was outside of the scope of this project so a more robust analysis could not be conducted.
Future work could attempt to re-assess the developed pipelines using a ground truth with a more robust analysis. In addition, alternative smaller CNNs (such as YOLOv4 Tiny) could be investigated for real-time tracking applications - their smaller size leads to a much faster inference time, e.g. YOLOv4 Tiny has only 29 convolutional layers as opposed to the full 137 featured in YOLO and as a result, features inference times up to 371 FPS [15].
The success of Faster R-CNN deployment for multiple object detection may in turn be used to develop systems that can capture more behavioural data than just physical location. For example Mask R-CNN, an extension of Faster R-CNN that returns an object mask for each detection, has been used successfully in human pose estimation [16]. This could be potentially extended to Drosophila allowing researchers to capture small scale micromovements in multiple fruit flies simultaneously.