Methods

Test Video

To evaluate the performance of each object detection method, video inference was performed on the same test video (Figure 5) . This video was selected since it provided some particularly difficult edge cases for object detection: 


1. The video features steep gradients of intensity between shaded and lit regions. This makes object detection particularly difficult for flies that lie on a boundary of intensity. 
2. Four wells are situated in the corners with a similar intensity to the flies which often leads to occlusion. 
3. There are many flies in the sequence which often overlap. This makes separating instances particularly difficult for the classifier

Figure 5. Test video used to evaluate Tracking Systems

Object Detection - Performance Metrics

Since a Ground Truth (containing the true bounding box coordinates of each fruit fly) was not available for this project. The following metrics were developed: 

Mean Fly Count:

Average number of predicted flies in each frame of video

Absolute Deviation

Sum of deviation between predicted and true number of flies across whole video

Object Detection - Viola-Jones

As mentioned previously, object detection was performed with the Viola-Jones framework to allow for fast real-time detections. To this end, the framework was trained on a bank of  7211 25x25 cropped images of flies (serving as positives) and 5967 background images (serving as negatives). This dataset was created by the Gilestro Lab prior to this project using an automated method. For this reason, the dataset was manually curated to  remove any false positives that may have been erroneously included. The Viola-Jones framework was then trained in OpenCV with 90% of the dataset used for testing and 10% for validation [10]. This ran for approximately 12 hours. After training video inference was performed on the test video and performance metrics were computed.

To improve the performance of the Viola Jones model, an enriched dataset was created. This was achieved by running the curated Viola-Jones model on a second test video. The output of this video was then used to find frames where the number of detected flies decreased relative to the previous frame. This potentially indicated the presence of a false negative, so a 25x25 image at the coordinates corresponding to the missing fly was generated. This bank of enriched images was then manually curated to remove false positives before being added to the positive instances from the curated dataset. This provided an additional 320 images designed to correct for the shortcomings of the classifier. Again video inference was conducted with the Viola Jones model trained on the new enriched dataset.

Object Detection - CNN Dataset Creation

To create the dataset used to train Faster R-CNN and YOLOv4 models, 1000 frames of experimental footage were annotated with bounding boxes designating the positions of each fruit fly. These positions were initialised with bounding box predictions generated through video inference with the curated Viola Jones framework and were then manually adjusted to remove false positives and add true negatives. This created 32,603 annotations in total.

Each frame was resized to reduce training time and contrast stretched to normalise pixel intensities and make edges more apparent to the classifier. For each frame, two augmented versions of the frame were created with various transforms randomly applied (Flipping, cropping, altering saturation, applying noise). This effectively tripled the size of the dataset, providing new instances for the network to be trained on without the risk of overfitting.

The data was partitioned into training (70%), validation (20%) and testing (10%) subsets. Bounding box adjustment, frame pre-processing, frame augmentation and dataset partitioning were all performed using Roboflow [11].

Object Detection - Faster R-CNN

A Faster R-CNN model was implemented using the Detectron2 library in Python [12]. The model was initialised using pre-trained weights generated from the COCO dataset (an extensive object detection dataset used as a standard to benchmark object detection methods) [13]. This was used to reduce training time since objects in the set share some similarities to the fruit-flies and therefore fewer itterations of the network were required to optimise weights from this checkpoint rather than starting from scratch.

The network was trained using training and validation subsets and tested using the testing subset. The performance of the network was quantified by calculating the Average Precision (AP) for inference with the test set. The learning rate and maximum iteration hyperparameters were altered between runs to find the optimal AP (Table 1) - this was found to be 1000 iterations at a learning rate of 0.001.

Video inference was performed with the optimised model on the test video to calculate Mean Fly Count and Absolute Deviation performance metrics.

Table 1. AP values for Faster R-CNN Testing.  The AP performance metric is a weighted mean of precision values at specific threshold recall values. AP50 and AP75 refer to calculations of AP where True Positive detections have an Intersection-over-Union between predicted and actual bounding boxes > 50 or 75. The mean AP (mAP) is the mean of all AP values determined at IoU thresholds between 0.5 and 0.95 in steps of 0.05 [13]

mAP AP50 AP75
60.4 96.7 70.9

Object Detection - Training YOLOv4

A YOLOv4 model was implemented using the Darknet framework [4]. The framework was used since this is one of the few object detection systems that can be implemented with DeepSORT. Similarly to Faster R-CNN, the model was initialised with pre-trained weights from the COCO dataset. The fork of YOLOv4 used to build the network did not contain a built-in method to calculate AP using the test set. Due to time constraints, it was not feasible to manually implement this. Therefore the training and mAP curve was used to roughly predict the optimal number of iterations for the best performing model (Figure 6). Video inference was performed on the test video using the optimised model and performance metrics (mean fly count & absolute deviation) were calculated.

Figure 6. mAP and performance curve for YOLOv4 model. This plots both the model loss (blue) and AP (red) calculated using the validation set for the training of YOLOv4. From this plot, it was decided the optimal number of iterations fell around 12000 as this was the point at which the rate of increase of AP began to fall (indicating models trained beyond this would overfit).

Object Tracking - Performance Metrics

Since a ground truth for individual tracks in the test video was unavailable, the following metrics are used to evaluate tracking performance:

Number of Unique Tracks

This found the total number of created tracks by the object tracker. Accurate tracking systems will have a number of tracks equal to the number of flies in the video.

Mean Track Length

This found the mean length (in frames) for all tracks computed by the tracker. A perfect tracking system will have a Mean Track Length equal to the number of flies in the test video.

A histogram of track lengths has been created for each method to visualise the distribution of track lengths.

Detection Framerate (FPS)

Mean number of frames processed by each complete tracking system in one second. An ideal online tracking system will have a large framerate.

Object Tracking - SORT

The SORT object tracking method was selected due to its capability to track objects that aren't detected frame to frame without generating a large computational overhead (with a fast enough object detection method, it can be run in real time).

SORT was implemented using the SORT package and was implemented into the video inference script [5]. Tmin was set to the number of frames in the video to prevent tracks from being prematurely terminated and the IOUmin was set to 0.01 since the erratic motion of the flies would make it hard to accurately predict their position. The individual tracks were recorded and plotted. In addition, the lengths of each track were recorded and this distribution was plotted.

Object Tracking - DeepSORT


The DeepSORT tracking method was selected to assess weather object re-identification could significantly increase tracking performance (beyond that of Fast R-CNN and SORT) [6].

To perform object tracking using DeepSort, the darknet YOLOv4 model was first converted to the Tensorflow framework to decrease inference time. The DeepSort algorithm was then applied with an IOUmin of 0.01 and a score (governing the threshold for visual similarity between a detected object and a track) of 0.01.

← Return to BackgroundContinue to Results →