Google shows how depth detection works on the Pixel 3

Machine learning gains another foothold in the mobile industry

By Isaiah Mayersen December 1, 2018, 10:12

Google shows how depth detection works on the Pixel 3

Serving tech enthusiasts for over 25 years.
TechSpot means tech analysis and advice you can trust.

The big picture: The Pixel 3 is quite possibly the best camera phone on the market, and it proudly shows off its ability to take "professional-looking" portrait shots with background blur in advertisements and stores. Thus far, Google's kept the magic to themselves, but they recently revealed many details in a blog post. The Pixel 3 uses the Phase-Detection Autofocus and neural networking (from the Pixel 2) and combines it with new machine learning techniques to detect depth much more precisely and reliably than other single lens phones.

If almost every other phone needs two rear cameras to create realistic background blur, then how can the Google Pixel 3 do it with just one? Good depth detection almost always works by detecting the changes between two slightly different views of a scene

When comparing two images that were taken side by side, the foreground remains pretty much stationary while the background moves noticeably, parallel to the direction from one viewpoint to the other. Known as parallax, it's how the human eyes detect depth, how many interstellar distances are calculated, and how the iPhone XS can create a background blur.

Phase Detection Autofocus (PDAF), also known as Dual-Pixel Autofocus, creates a basic depth map by detecting tiny amounts of parallax between two images taken simultaneously by the one camera. In most cameras, this depth map is used for autofocusing, but in the Pixel 2 and 3, it's the foundation for the depth map used for background blur.

The Pixel 2's Stereo depth detection fails to separate the horizontal lines from the foreground, something the Pixel 3 has no trouble with.

Unfortunately, because PDAF was never intended to be used so extensively, it comes with a lot of problems that Google has spent several years trying to solve. The first is obvious: the two different viewpoints are virtually indistinguishable because they're taken at a nearly identical position - this means that the parallax is very hard to detect and is often confused by various artifacts or errors. The second is what's known as the "aperture problem" and it happens when the parallax occurs parallel to a row of one color, making the tiny parallax impossible to see.

Google's solution for the Pixel 2 was to use a neural network to detect and separate the layers in a frame based on image recognition. The iPhone XR relies on this completely, which is why it can only blur behind faces, but like the XR the Pixel 2's solution only worked well in the situations it had been specifically trained for.

In shots where a wall began in the foreground but continued into the background, for example, the neural network couldn't detect a specific layer. The phone relied on the PDAF map, which happens to suck at mapping objects that travel from the foreground to the background, because of the aperture problem.

The parallax between these two PDAF images is barely detectable, which is why the Pixel 2 is more error-prone.

By the time the Pixel 3 began development, the camera team realized they needed to up their game. They used the foundation from the Pixel 2 (literally the same hardware) and built on it by adding a Convolutional Neural Network (ConvNN) between the PDAF depth mapping and neural network layer detection. ConvNNs are image recognition networks modelled on the human brain that rely on massive amounts of machine learning, rather than human engineering. Google used the ConvNN to combine image recognition and the PDAF parallax detection with two new methods: defocus and semantic detection.

Once the PDAF has focused the image, defocus detection simply detects which parts of the image are out of focus and by how much. Semantic detection relies on knowledge of the size of everyday objects, such as faces or cars. If a face appears larger than a car in an image, then it's pretty easy to say that the car is quite a fair distance behind the face.

To train the ConvNN, Google required lots of high-quality images of things that everyday users would photograph, with PDAF maps too of course. To capture all these images they constructed what they call the "Frankenphone" which is made of five Pixel 3 prototypes synced together.

The five cameras on the Frankenphone meant that there was parallax in multiple directions, which basically prevented all the issues that come with using parallax for depth detection. Similarly, there were enough cameras that nearly every part of an image was captured by more than one camera enabling the camera team to construct highly accurate depth maps with which to train their neural network.

All this culminated in the superb edge detection on the Pixel 3 today, which is easily as good as many dual lens smartphones. As Google continues to pave the way in smartphone photography, we'll have to wait and see what competitors bring to the table. Triple camera smartphone, anyone?

3 comments 180 likes and shares

// Related Stories

Featured on TechSpot