Microsoft exposed 38 terabytes of sensitive data while working on AI model

Alfonso Maruccia

Posts: 921   +283
Facepalm: Training generative AI models requires coordination and cooperation among many developers, and additional security checks should be in place. Microsoft is clearly lacking in this regard, as the corporation has put vast amounts of data at risk for years.

Between July 20, 2020, and June 24, 2023, Microsoft exposed a vast trove of data to the public web through a GitHub public repository. Cloud security company Wiz discovered and reported the issue to Microsoft on June 22, 2023, and the company invalidated its insecure token two days later. The incident is only being disclosed to the public now, as Wiz unveiled the security ordeal on its official blog.

By incorrectly using a feature of the Azure platform known as Shared Access Signature (SAS) tokens, Wiz researchers say Microsoft accidentally exposed 38 terabytes of private data on the robust-models-transfer GitHub repository. The archive was used to host open source code and AI models for image recognition, and Microsoft AI researchers were sharing their files through an excessively permissive SAS token.

SAS tokens provide a way to share signed URLs to grant granular access to data hosted on Azure Storage instances. The access level can be customized by the user, and the particular SAS token employed by Microsoft researchers was pointing at a misconfigured Azure storage bucket containing loads of sensitive data.

Besides training data for its AI models, Microsoft exposed a disk backup of two employees' workstations, according to Wiz. The backup included "secrets," private cryptographic keys, passwords, and over 30,000 internal Microsoft Teams messages belonging to 359 Microsoft employees. A total of 38 TB of private files could have been accessed by anyone, at least until Microsoft revoked the dangerous SAS token on June 24, 2023.

Despite their usefulness, SAS tokens pose a security risk due to a lack of monitoring and governance. Wiz says that their usage should be "as limited as possible," as the tokens are hard to track because Microsoft doesn't provide a centralized way to manage them through the Azure portal.

Furthermore, SAS tokens can be configured to last "effectively forever," as Wiz explains. The first token that Microsoft committed to its AI GitHub repository was added on July 20, 2020, and it remained valid until October 5, 2021. A second token was subsequently added to GitHub, with an expiration date set for October 6, 2051.

Microsoft's multi-terabyte incident highlights the risks associated with AI model training, according to Wiz. This emerging technology requires "large sets of data to train on," the researchers explain, with many development teams handling "massive amounts of data," sharing it with their peers, or collaborating on public open-source projects. Instances like Microsoft's are becoming "increasingly hard to monitor and avoid."

Permalink to story.

I say we should just eliminate passwords altogether and also permit the small AI's in our future Windows operating systems to share our hard disks across the Web and the "cloud", so that the Big AI's could be trained properly, because training AI's is the best thing that ever happend to the human race. After all, what do you have to hide?

PS: and also be charged a montly fee for the priviledge of being allowed to train AI's
So, if this was data for training AI in image generation, does that mean that some of that 38TB could have been other peoples copyrighted photos that they scraped off the internet?