Understanding object detection using YOLO and training for new objects – Part 1

The field of computer vision for the purpose of object recognition is developing at a fast pace. Apart from the obvious examples of self-driving vehicles, there is a wide range of possible applications, such as the field of predictive maintenance of, for instance, power grids. The identification of power lines at risk (e.g. trees growing within the safety area of cables) and of equipment in need of repair is a time-consuming ordeal. Equiping drones with cameras would facilitate this task, given that models are trained on adequate data.

In this post, I am going to show you how easy it is to use obejct detection with a few lines of code, but also how you can train the available models on custom data, i.e. data relevant to your goals. To avoid a writing a far too lengthy post, I have decided to make it a two part job and the reason is quite simple. Even though the “object detection” and the “labelling” part of new data are fairly simple tasks (I did not write that it doesn’t require work), the technical work required to actually incrementally train a model on new data is painful….at least for those of us that mainly use windows driven pc:s. I therefore do not want to destroy the joy of newly acquired skills by painfully describe how to solve technical difficulties in windows.

Hence, this part will include an overview of YOLO and a description of how it works, as well as a guide on how to use it on your own machine and on your own material. The later part will focus on training YOLO on custom data and include a thorough guide to the technical requirement and solutions to succeed with the task.

What is Yolo?

YOLO stand for You Only Look One and YOLOv3 is the third improvement of the original model (read this article by its creators: https://pjreddie.com/media/files/papers/yolo.pdf). Now, you may ask why I am presenting a work done by someone else. Well, in the best of worlds everyone has the computational capacity and the available data to train a model capable to detect 80 different objects. Anyone that has trained a neural network to recognize 10 or more objects know what I am talking about: Most of use do not have access to that kind of computer power. So, my aim here is to try to make YOLO:s inner workings understandable to the layman and show that it does not require a great deal of efforts to use it.

So, YOLO is a network for object detection. The object detection task is performed by determining the location of object in the image, after which these objects are classified. This, in itself is not new as other methods such as R-CNN can do the job. The problem with R-CNN is that the task must be repeated several times for each image. This in turn implies a very slow process which is rather hard to optimize since each individual component must be trained separately. You might now understand the name YOLO: You Only (Need) to Look Once, you simply feed the image to a single neural network once!

How does YOLO work?

The input to the neural network is an image (either one single image or a frame in a video-feed) fed to a  neural network to get, as output, a vector of bounding boxes and class predictions. The input is divided into an S x S grid of cells. Now, there are potentially a large number of objects present in the image, so for each object in the image, a grid cell is  “responsible” for predicting it. But there might be overlapping of cells due to the presence of several objects. The responsibility is then awarded to the cell in which the center of the object is situated.

Each grid cell predicts K bounding boxes as well as P class probabilities. The bounding box prediction has 5 components: (x, y, w, h, confidence). The (x, y) coordinates represent the center of the box, relative to the grid cell location. Recall that if the center of the box is not inside the grid cell, then the cell is not responsible for it. Since objects can be of different sizes and can appear to be further away or nearer to the observer, all coordinates need to be normalized. You might want to look at the example below to see how the normalization is done.

Fig 1: Predicted bounding box (yellow), Ground-Truth (red) and coordinate normalization.

x = (196 – 160)/160 = 0,225

y = (226 – 160)/160 = 0,4125

w = 316/480 = 0,658

h = 188/480 = 0,391

These values are simply rewritings of coordinates and give no indication on the presence of an object or the lack thereof. We somehow need to give a measure of how confident we are that there actually is an object. In the paper cited above the author defines the confidence score as: ” Pr(Object) * IOU(pred, truth) . If no object exists in that cell, the confidence score should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.”

What is the intersection over union (IoU)? IoU is a metric that measures the accuracy of an object detector on a particular dataset. Simply put, it compares the overlap of what is called the Ground-Truth, which is a perfectly fitted bounding box around a specific obejct (as the red box around the airplane in the fig 1), with a predicted bounding box (as the yellow box around the airplane in fig 1). You might ask how the ground-truth is obtained. Well, it is most commonly the bounding box that has been places around the object during the manual annotation process. Yes! A lot of the work is done by hand and it is of huge importance that this work be done with high precision. We’ll describe the process further down in the blog.

So, basically, the IoU is simply the ratio between the area of the union of the ground-truth bounding box and the predicted bounding box and their intersection, as can be seen in the picture below (fig 2).


Fig 2: Illustration of the calculation of the confidence score.

Recall that each grid cell predicts K bounding boxes and we have divided the input into S x S grids. This, together with our 5 inputs per bounding box (x,y,w,h,Confidence score) gives us  K x S x S x 5 outputsWe also mentioned the existence of P class probabilities (for YOLOv3, the number of classes is 80). This probability needs of course be conditional on the fact that there actually is an object in the considered cell, that is that Pr(Class(k)|Object). Basically, it implies that if a cell does not contain an object, the loss function will not penalize it. Let’s look a little deeper into the loss function, as it has quite a lot to say about how YOLO works: As it should be evident by now, YOLO predicts multiple bounding boxes per grid cell but we only want one of theses bounding boxes to be responsible for the object. But we can easily do this. Indeed, we defined the intersection over union function earlier on and it is a good measure of how well-fitted the ground-truth bounding box and the predicted bounding box overlap, so it is reasonable to choose the bounding box with the the highest IoU with the ground-truth. Applying this method over all bounding box predictions made kind of defines a neat rule and each prediction gets better at predicting certain sizes and aspect ratios.

What is included in the loss function, then? Well, it naturally uses sum-squared error between the predictions and the ground-truth to calculate loss. The loss function composes of three parts: classification loss, the localization loss (which basically is the errors between the predicted boundary box and the ground-truth) and the confidence loss (the objectness of the box).

Classification loss: The classification loss is defined as the squared error of the class probabilities for each class, whenever an object is detected. If we define \hat{p}_{j}(c) as the conditional probability of class c in cell j and \mathbf{1}_{obj, j}, where \mathbf{1}_{obj,j}=1 if there exists an object in cell and 0 otherwise, then the classification loss function is given by

Loss_{Class} = \sum_{j=0}^{K^{2}}\mathbf{1}_{obj,j}\sum_{c\in \{classes\}}(p_{j}(c)-\hat{p}_{j}(c))^2.

Localisation loss: As the name indicated, the localisation loss function is related to the coordinated that we defined above, i.e. (x, y, w, h) and measures the errors in the predicted boundary box locations and sizes. If we define \lambda_coordinates as some weight for the importance of the loss (in this case the localisation loss), then the localisation loss function is given by

Loss_{loc} = \lambda_{coordinates}\sum_{j=0}^{S^2}\sum_{k=0}^{K}\mathbf{1}_{obj,jk}((x_{j}-\hat{x}_{j})^2+(y_{j}-\hat{y}_{j})^2) + \lambda_{coordinates}\sum_{j=0}^{S^2}\sum_{k=0}^{K}\mathbf{1}_{obj,jk}((\sqrt{w_{j}}-\sqrt{\hat{w}_{j}})^2 +(\sqrt{h_{j}}-\sqrt{\hat{h}_{j}})^2),

where \mathbf{1}_{obj, jk} is 1 if the k.th bounding box in cell j is responsible for detecting the object and 0 otherwise. We can right away define the complement of \mathbf{1}_{obj, jk} as Comp(\mathbf{1}_{obj, jk}) = \mathbf{1}_{Nobj, jk}..

Confidence loss: You might remember from the above work we’ve done that we stated that the bounding box predictions consist of 5 values, the coordinates (x, y, w, h) and the confidence. The localisation loss deals with the the localization variables and it is natural that we should have a measure related to the confidence. As one might easily guess, the two loss functions are very similar, with a little difference.

Loss_{Conf} = \sum_{j=0}^{S^2}\sum_{k=0}^{K}\mathbf{1}_{obj,jk}(C_{j}-\hat{C}_{j})^2) + \lambda_{Nobj}\sum_{j=0}^{S^2}\sum_{k=0}^{K}\mathbf{1}_{Nobj,jk}(C_{j}-\hat{C}_{j})^2).

Now, why the $\mathbf{1}_{Nobj, jk}.&bg=ffffff$ term? Well, think about it. Most cells with not be containing any object. This means that we suddenly have a strong class imbalance which needs to be dealt with. A reasonable way to do so is just to use the complement of \mathbf{1}_{obj, jk}) to solve the issue.

Hence, the loss function is defined as:

Loss = Loss_{Class} + Loss_{Loc} + Loss_{Conf}..

We now understand most of the components of YOLOv3 but still lack the knowledge of its architecture. I am not going to go into the details but I strongly recommend this image (click on it to go to its source):

Fig 3: YOLOv3 Architecture

As was mentioned several times in the text, YOLOv3 is trained to recognize 80 classes of objects, which are

person, bicycle, car, motorcycle, airplane,
bus, train, truck, boat, traffic light, fire hydrant, stop_sign,
parking meter, bench, bird, cat, dog, horse, sheep, cow, elephant, bear, zebra,
giraffe, backpack, umbrella, handbag, tie, suitcase, frisbee, skis, snowboard,
sports ball, kite, baseball bat, baseball glove, skateboard, surfboard, tennis racket,
bottle, wine glass, cup, fork, knife, spoon, bowl, banana, apple, sandwich, orange,
broccoli, carrot, hot dog, pizza, donot, cake, chair, couch, potted plant, bed,
dining table, toilet, tv, laptop, mouse, remote, keyboard, cell phone, microwave,
oven, toaster, sink, refrigerator, book, clock, vase, scissors, teddy bear, hair dryer,

 Using YOLOv3 on your own material

All work and no play makes me a dull boy, as it does with most of us. We have worked quite a lot to understand how the predictions are made and it is now time for some fun. What we will use here is the module created by Moses Olafenwa. There are a few steps to take care of before getting started and I’ll asume (here) that not everybody has python installed on his/her machine. Even when this has been done, there are some dependencies that need to be dealt with before using YOLO (or any other model for that matter). Here is a brief description of what needs to be done, step by step:

  1.   Install Python 3
  2. Install the following dependencies via pip: Tensorflow, Numpy, SciPy, OpenCV (OBS! pip install opencv-python, not to be confused with OpenCV, which we will deal with in part 2 of this blog), Pillow, Matplotlib, H5py, Keras, ImageAI

ImageAI can be installed with pip install https://github.com/OlafenwaMoses/ImageAI/releases/download/2.0.2/imageai-2.0.2-py3-none-any.whl

Finally, download YOLOv3 from this link. As an example, I have chosen a video that show the “Promenade des Anglais” in Nice (France) as this is my hometown. You are free to choose any video you may like and replace it in the code below. The raw video it the following.



That’s it! We’re good to go! The code is ridiculously short:

from imageai.Detection import VideoObjectDetection
import os
execution_path = os.getcwd()
detector = VideoObjectDetection()
detector.setModelPath( os.path.join(execution_path , "yolo.h5"))
video_path = detector.detectObjectsFromVideo(input_file_path=os.path.join( execution_path, "Promenade des Anglais - Nice - France.mp4"),
                                output_file_path=os.path.join(execution_path, "Promenade des Anglais - Nice - France_detect Yolo")
                                , frames_per_second=20, log_progress=True)


Some explanation of the code might be in order:

1) Line 4 creates a VideoObjectDetectionclass.

2) Line 5:  we set the model type to YOLOv3 (make sure you have saved the model yolo.h5 in the right folder)

3) line 6: Set the model path.

4) Line 7: Load the model into the instance of the VideoObjectDetection class.

5) Line 8:  Called the detectObjectsFromVideo function and parse the values into it:

i. input_file_pathFile path of the video.

ii. output_file_path: File path to which the detected video will be saved.

iii. frames_per_second: Number of image frames per seconds in which we want to proceed with detection.

iv. log_progress: States that the detection instance must report the progress of the detection.

6) detectObjectsFromVideo function returns the file path of the detected video.

Run the code! The result is the following video which I have posted on Youtube:


Promenade des Anglais, YOLO.


As you can see, very little is required to actually use YOLO, the actual training of the model is a completly different story. Partly becasue it requires loads of data for every single class. As we described, classed ground-truth data i required and this can, to this day, only be done manually and very precisely. Most of us do not have the time, economic liberty and computational power to train such a model. However, and this will be the subject of Part 2 of this blog, one doesn’t need to start from scratch. The point is that it is possible to incrementally train YOLOv3 with custom objects, objects that are relevant to your goals. It still requires a lot of data and time to annotate the images, but it does not any longer require more than the GPU of a fairly simple graphics card (the better it is, the faster the training).  A little bump on the road is that if you do not already run om linux, a lot of things need to be done before you even get started. But, that is the subject of the next post.

Until then, play!


4 thoughts on “Understanding object detection using YOLO and training for new objects – Part 1

Add yours

    1. Dear Valentina! Thank you for your kind words. The second part of this blog is not yet available because of technical issues. I unfortunately did not get my GPU installed and as you know, this kind of work requires it. I will however keep you informed on that.


  1. I wanna use Darknet53 wia scratch so I need teh details of Darknet 53 architecture. The link on the image doesn’t work, please would you help me to find a source to get into the details of Darknet 53 architecture.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.

Up ↑

%d bloggers like this: