In brief: Thanks to machine learning, object detection has come a long way in recent years, but most models still perform best on low-resolution video images. Now, researchers at Carnegie Mellon University have developed a new system that uses GPUs to quickly and accurately detect objects in 4K and 8K video.
As explained to TechXplore by researcher Vít Růžička: "While plenty of data sources record in high resolution, current state-of-the-art object detection models, such as YOLO, Faster RCNN, SSD, etc., work with images that have a relatively low resolution of approximately 608 x 608 px.”
The majority of current models use these images for three reasons: they are sufficient for the task; processing low-resolution images is more time efficient; and many publicly available datasets used to train the models are made up of low-res images.
The problem with low res, of course, is that the videos don’t pick up a lot of detail. And with the number of 4K and even 8K cameras on the rise, a new model is needed to analyze them. That’s where the researchers’ ‘attention pipeline’ comes in.
The method, which is the work of Růžička and his colleague Franz Franchetti, divides the task of object detection into two stages, both of which involve subdividing the original image by overlaying it with a regular grid and then applying the model YOLO v2 for fast object detection.
"We create many small rectangular crops, which can be processed by YOLO v2 on several server workers, in a parallel manner," Růžička explained. "The first stage looks at the image downscaled into lower resolution and performs a fast object detection to get rough bounding boxes. The second stage uses these bounding boxes as an attention map to decide where we need to check the image under high-resolution. Therefore, when some areas of the image don't contain any object of interest, we can save on processing them under high resolution."
The researchers implemented their model in code, which distributes the work across GPUs. They managed to maintain high accuracy while reaching an average performance of three to six fps on 4K videos and two fps on 8K videos. Compared to the YOLO v2 approach of down-scaling images to low resolutions, the method improved the average precision score from 33.6 AP50 to 74.3 AP50.
"Our method reduced the time necessary to process high-resolution images by approximately 20 percent, compared to processing every part of the original image under high resolution," Růžička said. "The practical implication of this is that near real-time 4K video processing is feasible. Our method also requires a lower number of server workers to complete this task."
Růžička and Franchetti say they are looking at ways to improve their model further—overlaying the grid onto the images can sometimes result in objects being cut in half. You can learn more about the process here.