Tutorial on implementing YOLO v3 from scratch in PyTorch



Object detection is a domain that has benefitted immensely from the recent developments in deep learning. Recent years have seen people develop many algorithms for object detection, some of which include YOLO, SSD, Mask RCNN and RetinaNet.

This is a companion discussion topic for the original entry at https://blog.paperspace.com/how-to-implement-a-yolo-object-detector-in-pytorch


Cool video from Karol Majek showing his results in YOLO v3:


Hi, it seems it does not work when I choose a different batch size in the detect.py? Anybody have solutions or ideas about this? Thank you very much!


@DongDong_Chen Hmm interesting. The author is working on a full tutorial on the training component itself. Should be out soon.


@DongDong_Chen It seems as if you’ve cloned my other pytorch v3 repo, and not the one linked in this tutorial. There was a bug that made the code crash with a batch size of > 1, and that has been resolved. The issue doesn’t exist in repo linked in this post.


Is there a section how to run a training job for YOLO on Paperspace?


thanks for the details of YOLOV3. I get the great overview of the whole algorithm goes with you r instruction. but I still have a small puzzle here.
on the making predictions part, the computation about “bw and bh” I still unclear. there you said" tw, th is what the network outputs. pw and ph are anchors dimensions for the box.". About the pw and ph, does it mean pw and ph is the real length of the width and heigh of anchors box in the original image(416*416), or sth else?


I have the same puzzle ,are you understand it?


Hello, @ayooshkathuria great tutorial. i need some suggestion, is it possible to detect only one object? like if I want to detect a car will it detect car only and ignore rest of the object in the image or video.


@Faizan_Ahmad: Bro, your question indicates that you didn’t go through the tutorial to learn how this stuff actually works. The author spent a lot of time creating this. He spells out exactly where in the code it gives you the true results, which would include which class it has found and where it is in the image. Just write some code to erase any detection that you aren’t interested in. In section 5 of this tutorial, there’s a block of code “for i in range(output.shape[0]):” with two indented lines of code after that. As soon as this code is executed, you have all your detections and where they are. The very last column (output[:,7]) in the “output” Tensor is the object class. It will be a number, corresponding to what kind of object it is. So if output[0,7] is 0, the first object found is a person. If you’re looking only for people, then keep that one and move on. If you’re looking for something else, remove the entire row and keep going down the rows, removing everything you don’t want. If you don’t know how to do this, then learn more python. Then just allow the rest of the code to execute as provided. Very easy.


Hey Ayoosh Kathuria,

I really liked your tutorial and I appreciate the work you put into it. However I have one problem, when testing with images that contain larger objects it works really well (like the dog-bicycle-car image), but when I am using images with smaller objects, like images of a traffic jam, so when the 2nd and the 3rd Detection Layer are used, it doesn’t work at all. I just wondered if someone has similar problems or if it is my mistake.



The image with dog is wrong. Class score are used once per grid cell and bbox attributes consists of x, y, w, h, objectness. The image should look like (x,y,w,h,objectness) × B + (class scores).



I think the paragraph about

The resultant predictions, bw and bh , are normalised by the height and width of the image. (Training labels are chosen this way). So, if the predictions bx and by for the box containing the dog are (0.3, 0.8), then the actual width and height on 13 x 13 feature map is (13 x 0.3, 13 x 0.8).

are wrong.

The figure of the original paper is pretty clear that the size of the bounding boxes are nomalized by the size anchor box,instead of the original picture. I think I agrees with these guys who have the puzzle above…


Hey Ayoosh Kathuria,
Your tutorial is awesome and I appreciate the work you put it online. In your tutorial, you describe the forward testing in detail. I want to use my own dataset to fine-tune the network, but I don’t know how to train it with my own dataset.