Two Toyotas recognized with near 100% confidence, taken from www.pyimagesearch.com

A new bridge is being created between Artificial Intelligence and environments by applying computer vision tunneling to machine learning. This bridge is called object detection, with far-reaching applications in self-driving vehicles, automated robot workers in industries, facial recognition software (as a security facet in bank accounts, devices such as phones, computers etc.), VAR’s in sport, and, homeland and international security. Coupled with machine learning, it is like Siri or Alexa with eyes on their surroundings. Object detection or computer vision tunneling can not only be used to detect a single variable in an image but can correctly classify different components in a single image, as in the picture below:

How Does Object Detection Work?

According to the experts from The App Solutions, object detection has evolved from splitting images into regions, classifying those regions and using thousands of probabilistic neural networks to try and assign high scores to each region which would then form a criterion for detection. Now we have the more recent “YOLO” method, which only requires a single neural network to produce bounding boxes, classes and their probabilities. Image classification and object detection work share much of the same core principles but work differently.

While image classification returns the probability of an object within the image being a certain class label, its focus on the most visible or dominant object in the image renders it limited in functionality. Object detection, on the other hand, uses localization of image segments through bounding boxes with x-y coordinates, a class label for each segment, and a probability score tied to each bounding box. The overall effect is that with image classification, you can only get one output per classification, while with object detection, you can process several classes simultaneously. This vastly increases the advantage of object detection over image classification.

Implementing an Object Detection Framework

With this understanding of how object detection works, you may now wonder how the actual rollout happens. Due to the actual processing that takes place behind the scenes, speed is an important factor when choosing neural networks or datasets to integrate within your own frameworks. It is actually possible to integrate image classification systems with object detection frameworks. For example, with ImageNet which is a classification system, you can integrate APIs with a deep-learning object detection network. Other datasets include COCO (Common Objects in COntext) and PASCAL VOC. Each of these datasets has different numbers of images classified with bounding boxes. More traditional methods of detecting and localizing include using image pyramids and sliding windows moving vertically and horizontally across a screen, giving both classification and location.

With datasets integrated into base neural networks such as CNNs (Convolutional Neural Networks), your APIs can then feed this into object detection systems such as YOLO (You Only Look Once), Faster R-CNNs and SSDs (Single Shot Multibox Detectors). This means that you have at your disposal complete end-to-end systems for detection and classification.

Approaches and Applications of Object Detection Systems

YOLO (You Only Look Once) is a product of the Darknet project which is open-source and provides “neural network frameworks for training and testing computer vision models.” In YOLO, a single neural network is used to produce spatially separated bounding boxes with associated class probabilities. This allows it to process images to speeds of as many as 45 FPS, although the Fast YOLO version of the application can handle as many as 155 FPS. With a single neural network doing all the heavy lifting, YOLO is a considerably smaller and faster neural network. The only downside is that it might not produce the desired accuracy. However, with its blazing speeds, it is critically important as a real-time detection system such as in video image detection and self-driving vehicles.

Convoluted Neural Networks (CNNs) use the concept of reaction to stimuli by neurons by subjecting an image to a vector classification system. R-CNNs then use selective image search to extract and classify 2000 region proposals to obtain mean Average Predictions (mAPs). The results for each sector within a dense 4096-dimensional vector are fed into an SVM (Support Vector Machine). The result is a much slower but more accurate deep learning system. Fast R-CNN significantly reduces the downtime associated with R-CNN by eliminating the need to feed thousands of region proposals to the CNN each time, instead simply checking a single image as a vector file and generating a feature map.

SSDs (Single Shot Multibox Detectors) from Google crosses concepts from R-CNNs and YOLO, with both regional convolution proposals and feature extraction combined in a single neural network for faster speeds.

As an emergent field, there are many more approaches to object detection being tested and implemented each day. Neural network environments are being simplified and made more openly accessible to users through utilities such as TensorFlow, which is a deep learning environment that enables integrated APIs for various applications. Choose your approach based on the goals you intend to achieve.