Strengthening Drone Detection with Bio-Inspired Algorithms and Knowledge Distillation

Feb. 29, 2024

In the event no signal is received after two minutes, a timed relay will place the robot plane, or ‘DRONE’, as it will be called hereafter, in a turn.
Radio Control of Aircraft

From behind the headboard slipped a tiny hunter-seeker no more than five centimeters long. Paul recognized it at once — a common assassination weapon….It was a ravening sliver of metal guided by some near-by hand and eye. It could burrow into moving flesh and chew its way up nerve channels to the nearest vital organ.
Dune

Let me be the first
I’m not so innocent
Let me be the one
The one that you choose from above
After all
I’m partly to blame
So drone bomb me
— Anhoni

Introduction

Drones hum and hover above every continent on Earth, omnipresent and inconspicuous. A wide and diverse swath of society—militaries and militias, farmers, black market peddlers and private enterprises, beach lifeguards, and consumers of all ages—avails itself of the airborne drone within the largely unregulated “Wild West” of the lower skies. The implications are manifold and well documented.

Computer vision models augment considerably the sensor-specific capabilities for drone detection such as acoustic, electro-optical, and radio frequency (RF) sensors. Some of the current pain points of innovation in the domain of real-time drone detection, and those which computer vision models can especially mitigate, consist of the following: detecting small drones from a great distance and against complex backgrounds, such as dense urban areas; detecting drones that exploit the vulnerabilities in current detection systems by stealthily hiding amongst large obstacles such as trees, mountains, and high-rise buildings; and improving the speed and accuracy of detecting drones in real-time video feeds, such as those from visible and infrared spectrum cameras.

In this proposal, we first present a comprehensive review of current computational techniques for drone detection, including biologically-inspired vision (BIV) algorithms and deep Convolutional Neural Networks (CNNs). Secondly, to mitigate the pain points around innovation in this space, we propose experimenting with a combination of BIV algorithms for more advanced signal pre-processing and feature extraction, scarce data augmentation techniques on existing publicly available datasets, and CNN architectures that are uniquely well-suited for detecting drones in real-time video feeds. Through informed experimentation, we aspire to present a novel computer vision model that will mitigate the aforementioned pain points as well as lower the overall cost barrier for real-time drone detection. Finally, we present the best dataset candidates for model experimentation from amongst those that are publicly available.

Review of Computational Techniques

In this section, we discuss innovative computational techniques which have arisen from the fields of signal processing and artificial intelligence, in particular biologically-inspired vision (BIV) algorithms and deep Convolutional Neural Networks (CNNs). These computational techniques complement and augment device-specific sensor modalities which ultimately enhance their capacity for more accurate drone detection.

Biologically-Inspired Vision Algorithms

Biologically-inspired algorithms draw inspiration from natural processes and biological systems, applying principles such as replicating the highly efficient visual processing capabilities found in biological organisms, e.g., birds, insects, and primates, to enhance drone detection and localization performance. Some of the biological neural mechanisms that have been applied successfully to artificial visual systems are lateral inhibition, visual attention, contrast sensitivity, and motion processing.

Entomological Physiology

The roots of current advances in computational applications based on entomological visual systems emerged from studies put forth by O’Carroll (1993), Shoemaker (2005), and Wiederman (2007). Wiederman (2008) originally proposed a BIV model based on flies’ (Calliphora) specialized neurons responsible for small moving targets known as small target motion detectors (STMDs). The artificially-replicated model in computer vision is known as the elementary small target motion detector (ESTMD). More recent studies (Wang, 2016; Wiederman, 2022) have proposed extending the ESTMD model to dragonflies (Hemicordulia tau) based on the work of O’Carroll (1993).

Lateral inhibition, as replicated in the original ESTMD model, is typically achieved by inhibiting background motion to enhance small target motion more effectively; however, too much lateral inhibition, particularly in the periphery, can lead to unstable detection performance, as observed in the original model. Building upon the work of Wiederman (2008), Wang (2016) proposes a modified ESTMD model that incorporates a more biologically-plausible lateral inhibition mechanism with motion velocity and direction to improve discriminating the motion of the target from the motion of the background, a feature which has a physiological basis in the dragonfly’s higher order neurons. A more detailed discussion around lateral inhibition in retinal neurons is given in Srinivasan (1982).

Schematic of the dragonfly-inspired computer vision model
Figure 1. Schematic of the dragonfly-inspired computer vision model (Wang, 2016).

Melville-Smith (2022) proposes a novel nonlinear lateral inhibition scheme, leveraging optic-flow signals for dynamic signal conditioning, which significantly improves target detection performance from moving platforms and suppresses false positives. The proposed approach achieves an improvement in detection accuracy of 25% over linear inhibition schemes and 2.33 times the detection performance over conventional BIV models, such as ESTMD, without inhibition.

Wiederman (2022) builds on their earlier work (Wiederman, 2008) to modify the ESTMD model that leverages the visual processing abilities of dragonflies, in particular their ability to detect small moving targets (i.e., their prey) against complex backgrounds in low-resolution (i.e., “blurred”) settings. These abilities are ideally suited in environments where computational resources are limited, such as with real-time small drone detection in the field. The results indicate that combining outputs from light and dark contrast model variants, designed to mimic more physiologically realistic values, improves recall, especially at lower resolutions. Performance is influenced by the apparent size and speed of the drone in the image plane, with the model struggling at extreme speeds or smaller apparent sizes. Reduction in spatial resolution decreased detection performance but reduced computational demands.

Recent studies have sought to extend conventional BIV models from flies to pre-processing techniques for thermal infrared video frames containing small drones against low contrast backgrounds. Uzair (2019) incorporates adaptive temporal filtering and spatio-temporal adaptive filtering, inspired by the photoreceptor cells and large monopolar cells of small flying insects. Their method significantly improves detection performance by enhancing target contrast against cluttered backgrounds and suppressing noise. Experiments demonstrate that these pre-processing techniques enhance the effectiveness of the four standard infrared detection algorithms: the baseline multiscale morphological top-hat filtering, the saliency detection method using local regression kernels, the multiscale local contrast measure, and the infrared path-image model. Detection rates of the best performing model increased by 100%.

Uzair (2021) addresses the limitations of detecting small, thermally minimal targets within infrared imagery through a four-stage BIV-based target detector, utilizing inspiration gleaned from flying insects’ visual system as proposed in Uzair (2019). Their model overcomes such challenges as sensor noise, minimal target contrast, and cluttered backgrounds with an improvement of over 25 dB in signal-to-clutter ratio and a 43% higher detection rate than the existing best methods.

Primate Physiology

Broadly speaking, BIV algorithms based on primate physiology fall into the following categories: cognitive; information theoretic; graphical; spectral analysis; pattern classification; Bayesian; and decision theoretic (Borji & Itti, 2013). For the purpose of this paper, we focus on those cognitive, graphical, and spectral models which simulate early visual processing stages in primates, particularly in the context of attention mechanisms and saliency detection.

Itti, Koch, and Niebur (1998) explore the early visual system of primates which excels at interpreting complex scenes and focusing attention in real time through the bottom-up use of saliency maps. The authors implement a dynamic neural network architecture that mimics this visual system, in particular a saliency-driven focal visual attention for target detection which utilizes center-surround mechanisms to extract conspicuous features across different scales without requiring top-down guidance; a biologically-plausible winner-take-all mechanism which allows the model to sequentially attend to different locations based on their saliency; and adaptation to scene context in which the model changes its focus in response to the evolving content of the scene. The results of this cognitive model indicate that its attentional trajectories perform similarly to human eye fixations and is therefore a computationally efficient means for real-time target detection.

Dynamic neural network architecture inspired by primate visual system
Figure 2. The dynamic neural network architecture inspired by the early primate visual system (Itti, Koch, & Niebur, 1998).

Given the human visual system’s strong ability to detect saliency, Hou and Zhang (2007) propose a spectral residual approach which mimics the way the human retina and early visual cortex prioritize regions of interest in the visual field, extracting those features which stand out. By analyzing the spectral residual in an image’s log spectrum, their approach taps into a fundamental aspect of human visual perception. While the spectral residual method achieves the same hit rate (HR) of 0.5076 as Itti, Koch, and Niebur (1998), it does demonstrate a significant improvement in reducing the false alarm rate (FAR) to 0.1688 from Itti, Koch, and Niebur’s (1998) FAR of 0.2931. Additionally, the spectral residual method is substantially more efficient, requiring only 4.014 seconds for computation, as opposed to 61.621 seconds needed by the Itti, Koch, and Niebur (1998) model.

Harel, Koch, and Perona (2007) improve upon Itti, Koch, and Niebur (1998) with a Graph-Based Visual Saliency (GBVS) model for bottom-up visual saliency. To improve conspicuity, GBVS improvements include forming and normalizing activation maps on feature channels utilizing a graph-based method to better simulate human visual attention. Their method achieves 98% of the receiver operating characteristic (ROC) area of a human-based control compared to the Itti, Koch, and Niebur (1998) model’s 84%.

Hérault (2010) developed a model of the neural connections within the human retina highlighting the importance of spatio-temporal filtering for pre-processing images. This model has influenced the creation of the human retina model described in the Methodology section which uses spectral whitening to replicate how the human visual system equalizes different frequency components to focus on important features. Hérault (2010) also delves into circuits for processing motion and color in the retina as well as studying the adaptive nonlinear characteristics of photoreceptors. These aspects demonstrate how the retina functions as a network system that cleans and prepares information for further computational analysis.

McIntosh and Maheswaranathan (2015) employ CNNs to predict spiking activity in retinal ganglion cells in response to stimuli such as spatio-temporal binary noise, aiming to enhance the understanding of how the retina responds. Due to their heightened sensitivity towards noisy backgrounds, attention mechanisms inspired by vision algorithms with both bottom-up and top-down approaches tend to outperform more basic models (Borji & Itti, 2013). Bottom-up models are computationally faster as they react to stimuli in a visual scene and are more responsive to visual disturbances. On the other hand, top-down approaches, driven more by specific task requirements than sensory stimuli, tend to be slower and less responsive to interference. Therefore, when it comes to tasks like drone detection in challenging settings that involve spotting small targets, combining a bottom-up approach (using passive retinal-like inputs) with a top-down approach (utilizing an active cognitive-like guide to maintain focus) could offer promising opportunities for further exploration. This integrated method is considered as part of our methodology in the Methodology section.

Yang (2023) presents a motion-guided visual detection system inspired by the human visual system and its attention response triggered by motion called the “motion-guided video tiny object detection method” (MG-VTOD). Employing a YOLOv5 framework as its foundation, MG-VTOD captures motion cues from moving targets such as drones. A unique motion strength algorithm generates a grayscale map that highlights moving objects against backgrounds when overlaid on video frames. During trials conducted in environments with occlusions such as clouds, buildings, forests, and mountains, the MG-VTOD model has shown better performance compared to other methods such as the vanilla small variant YOLOv5-s (v8 is detailed in the YOLOv8 section) and FCOS (discussed in the FCOS section).

Structural illustration of the MG-VTOD model
Figure 3. Structural illustration of the MG-VTOD model (Yang, 2023).

Deep Convolutional Neural Networks (CNNs)

Deep convolutional neural networks (CNNs) are well suited for drone detection given their multiple, hierarchical layers which are capable of identifying and localizing objects against complex backgrounds. Through the process of convolution, spatial relationships in images and videos are preserved. There are two main types of object detection algorithms that are implemented within deep CNNs: two-stage and one-stage. Two-stage object detectors, such as Region-based CNNs (R-CNNs), generate region proposals which are areas that are likely to contain the target object; these detectors then classify the regions into object categories and refine the bounding boxes (i.e., the location and size of the objects). While two-stage detectors are more accurate, they are slower due to the additional proposal generation.

In contrast, one-stage object detectors such as YOLO (“You Only Look Once”) and Fully Convolutional Networks (FCNs) bypass the proposal generation step and directly predict object categories and their corresponding anchor boxes in a single pass. The architecture of FCNs, neural networks which are fully convolutional, i.e., without any fully connected layers in the final output, has been adapted as single-stage object detectors to make predictions on a per-pixel basis across an entire image. One-stage detectors have traditionally sacrificed some accuracy for greater real-time speed, although that gap is closing in the wake of recent innovations. Advanced signal pre-processing and feature extraction techniques are also discussed in this section.

Advanced Signal Pre-Processing and Feature Extraction Techniques

Advanced signal pre-processing and feature extraction techniques, such as noise reduction, contrast adjustment, attention mechanisms, and normalization, improve a CNN’s ability to learn from more refined sensor data as measured by greater detection accuracy and reduced false positives. Ding (2023) proposes an acoustic denoising pre-processing module as part of a hybrid Convolutional-LSTM (long short-term memory) neural network typically reserved for such tasks as sound separation and speech enhancement. Their model, called DroneFinderNet, improves detection sensitivity by separating drone sounds from background noise.

DroneFinderNet network architecture
Figure 4. The proposed drone-sound-localization module within the DroneFinderNet network, including a) the pipeline of acoustic denoising and source-localization process and b) dataset collection and training of the denoising network (Ding, 2023).

To mitigate challenging situations such as environments with high visual noise levels and the presence of visually similar objects, Han (2024) proposes RANGO to strengthen feature extraction methods of the object-detecting YOLOv5 model. RANGO adds a Preconditioning Operation (PREP) to improve target-background contrast; a parallel convolution kernel in the Cross-Stage and Cross-Feature Fusion of the CSPDarknet53 module; and a Convolutional Block Attention Module and Atrous Spatial Pyramid Pooling for more focused feature attention. Compared to a vanilla YOLOv5 implementation, YOLOv5 with RANGO reduced the drone missed detection rate by 4.6% and increased the average recognition accuracy by 2.2%.

Fully Convolutional One-Stage Object Detector (FCOS)

The novel Fully Convolutional One-Stage Object Detector (FCOS) is given by Tian et al. (2022). It is a fully convolutional neural network; one stage; and anchor free, i.e., without the need for pre-defined anchor boxes as reference points to predict the presence of objects which is far more computationally intensive. FCOS predicts the presence and boundaries of objects at the level of the pixel, without relying on predefined anchor box shapes and sizes. This adjustment simplifies the model by removing the need for calculating Intersection over Union (IoU) scores and additional hyperparameters for each anchor box. FCOS achieves similar recall rates to anchor-based methods, such as YOLO, but with improved performance and faster inference speeds. In their research, Nayak (2022) suggests a data augmentation technique to optimize FCOS for drone detection.

Network architecture of FCOS
Figure 5. The network architecture of FCOS (Tian et al., 2022).

YOLOv8

YOLOv8 (Ultralytics, 2023) is a computer vision object detection model first developed by Redmon (2016). The model detects objects in images or videos and pinpoints their locations using bounding boxes. YOLOv8 operates on a CNN framework that processes the image or video frame in such a way that it predicts object classes and locations in a single pass, hence the name “You Only Look Once”. This method of processing data in a single pass is much quicker than classical convolutional techniques that analyze parts of images separately using sliding windows.

YOLOv7 and YOLOv8 architecture
Figure 6. Computer vision architecture and backbone of the YOLOv7 and YOLOv8 variants (Bennour, 2023).

Kim (2023) proposes an updated version of the YOLOv8 architecture by integrating Multi Scale Image Fusion (MSIF) and a P2 Layer to enhance the recognition and differentiation of objects, such as distinguishing between birds and drones, especially when they are at a great distance away from the sensors. Their model achieves a speed of 45.7 fps with the P2 layer and MSIF using a 640p image size dropping to 17.6 fps with a 1280p image size.

Bennour (2023) presents experiments with YOLOv8 and YOLOv7 variants to optimize for real-time detection speed, accuracy, and computational efficiency. The YOLOv8n model had the best performance in terms of both frames per second and Average Precision (AP50): 107.53 fps with an AP50 score of 99.5%. While the YOLOv7 also demonstrated similar accuracy levels, the inference speed of the YOLOv8n model was much faster.

Reis (2023) enhances feature extraction and object detection by combining a Feature Pyramid Network (FPN) with a Path Aggregation Network (PAN). In addition to these modifications to the YOLOv8 framework, they introduce auto-labeling tools to streamline the process of annotating model training data. Furthermore, their model incorporates more advanced post-processing techniques such as Soft NMS (non-maximum suppression). Their modified YOLOv8 model maintained inference speed while increasing the accuracy.

Methodology

To mitigate the main pain points discussed in the Introduction, we propose a series of at least three experiments. For both the network backbone and the baseline, we propose utilizing variations of YOLOv8. For more advanced pre-processing and feature extraction techniques, we will implement variations of the human retina model from OpenCV (OpenCV, 2023). And finally, given that the experimental models are likely to be quite large, we propose to explore Knowledge Distillation as a means to transfer the learning to a smaller model that will still preserve detection accuracy. Combining these three approaches is an important contribution to the field of drone detection.

Candidate Datasets

In this section, we present candidate computer vision datasets in tabular form drawn from various modalities and sensor technologies for the purpose of drone detection.

CitationData TypeVolumeResolutionFeaturesConditionsFormat
Acoustic
Audio Based Drone Detection and Identification Dataset
Al-Emadi (2019)Audio ClipsOver 1300CD qualityDrone sounds, augmented clipsIndoor environmentMPEG-4 audio format
Electro-Optical
Amateur Unmanned Air Vehicle Detection Dataset
Aksoy (2019)ImagesOver 4000-DJI Phantom series drones, negative objectsSourced from YouTube and GoogleJPEG, text files
“An insect-inspired detection algorithm for aerial drone detection” Dataset
James (2018)Videos51080p and 720pPhantom 4 Pro drone, various conditionsUrban, parklands, suburbs-
Drone-vs-Bird Detection Challenge Dataset
Coluccia (2017)Videos5 videosMPEG4Annotations for drone framesVarious backgrounds & illuminationSeparate annotation files
Drone Model Identification by CNN from Video Stream Dataset
Wisniewski (2021)Images--3 DJI models, randomized backgroundsSynthetic, Blender-generatedPNG
Multi-Target Detection and Tracking from a Single Camera in UAVs Dataset
Li (2022)Videos50 sequences, 70250 frames1920x1080 / 1280x1060Multiple target UAVs, manual annotationsCaptured with GoPro 3VATIC annotations
Segmented Dataset Based on YOLOv7 for Drone vs. Bird Identification
Srivastav (2023)Images20925640x640Birds and drones in motion, augmentationVarious conditionsJPEG, plaintext files
SUAV-DATA Dataset
Zhao (2023)Images10000-Small, medium, large dronesAll-weather, diverse angles-
UAV Traffic Dataset for Learning Based UAV Detection
Enfv (2022)Packet Headers--Six commercial drones-.csv files
USC Drone Dataset
Wang (2019)Videos30 clips1920x1080Model-based augmentation, diverse scenariosVarying backgrounds, weather-
Unmanned Aerial Vehicles Dataset
Makrigiorgis (2022)Images1535-Multiple angles and lighting conditions-COCO, YOLO, VOC formats
LIDAR
Drone detection in LIDAR depth maps Dataset
Carrio (2018)Depth Maps6000-Various drone models, environmentsIndoor and outdoor-
Radio Frequency
Cardinal RF Dataset
Medaiyese (2022)RF Signals--UAV controllers, UAVs, Bluetooth, Wi-FiVisual and beyond-line-of-sightMATLAB format
Drone Remote Controller RF Signal Dataset
Ezuma (2020)RF Signals~1000 per RC-17 drone RCs, 2.4 GHz band-MATLAB format
DroneRF Dataset
Al-Sad (2019)RF Signals227 segments-Various drone modes, background activities--
RF-based Dataset of Drones
Sevic (2020)RF Signals--Communication signals between drones and control stations, frequency hopping techniques--
VTI_DroneSET_FFT Dataset
Sazdic-Jotic (2021)RF Signals--DJI drones, operational mode changes--
Multi-Modal: Infrared and Electro-Optical
Anti-UAV Dataset
Zhao (2023)Videos-Full HDRGB and IR, dense annotationsReal-world dynamic scenarios-
VisioDECT Dataset
Ajakwe (2022)Images20924-Six drone models, three scenariosCloudy, sunny, eveningtxt, xml, csv
Multi-Modal: Other
“CW coherent detection lidar for micro-Doppler sensing and raster-scan imaging of drones” Dataset
Rodrigo (2023)Lidar Data--Micro-Doppler signatures, raster-scan imagesUp to 500 meters detection-
Multi-view Drone Tracking Dataset
Albl (2023)Videos--Multiple cameras, 3D trajectoryVarious difficulties-
“Real-Time Drone Detection and Tracking With Visible, Thermal, and Acoustic Sensors” Dataset
Svanström (2020)Videos, Audio Clips650 videos, 90 audio clipsInfrared: 320x256, Visible: 640x512Drones, birds, airplanes, helicoptersVarious weather conditions-

References

𑇄