In brief: First published in 2016, Microsoft’s "MS Celeb" data set held over 10 million images of almost 100,000 people. The compendium was used by researchers and private companies to train facial recognition technology, but following an investigation by the Financial Times, Microsoft has now deleted it.
Facial recognition is a hot topic as more companies and law enforcement agencies jump on the bandwagon, seemingly without much thought to personal privacy. The main way to train the algorithms that power the tech is by “showing” them a vast number of pictures within a database. One such database was published by Microsoft in 2016, known as ‘MS Celeb.’
The name stems from the purported contents of the data. Microsoft maintains that the photos were scraped from images and videos publicly available on the internet, and together comprised the largest publicly availably facial recognition data set in the world. In total, 10 million images of 100,000 people were included.
According to an investigation by the Financial Times, MS Celeb was used not only by academics, but also military researchers and private companies to train their own facial recognition solutions. Two firms stand out in particular – SenseTime and Megvii. These are Chinese companies who are involved in China’s notorious tracking endeavors.
The investigation also revealed that many of the faces included in the data were not those of public figures or celebrities. Indeed, security journalists and privacy advocates were among those included, such as Shoshana Zuboff, author of Surveillance Capitalism.
Microsoft told the Financial Times, “the site was intended for academic purposes. It was run by an employee that is no longer with Microsoft and has since been removed.”
But just because Microsoft have taken down their version, it doesn’t mean that MS Celeb is gone. Adam Harvey, who conducted the original investigation, said that following Microsoft’s deletion MS Celeb “is completely disassociated from any licensing, rules or controls that Microsoft previously had over it. People are posting it on GitHub, hosting the files on Dropbox and Baidu Cloud, so there is no way from stopping them from continuing to post it and use it for their own purposes.”