Chatbots are surfacing data from GitHub repositories that are set to private

Alfonso Maruccia

Posts: 1,706   +499
Staff
Facepalm: Training new and improved AI models requires vast amounts of data, and bots are constantly scanning the internet in search of valuable information to feed the AI systems. However, this largely unregulated approach can pose serious security risks, particularly when dealing with highly sensitive data.

Popular chatbot services like Copilot and ChatGPT could theoretically be exploited to access GitHub repositories that their owners have set to private. According to Israeli security firm Lasso, this vulnerability is very real and affects tens of thousands of organizations, developers, and major technology companies.

Lasso researchers discovered the issue when they found content from their own GitHub repository accessible through Microsoft's Copilot. Company co-founder Ophir Dror revealed that the repository had been mistakenly made public for a short period, during which Bing indexed and cached the data. Even after the repository was switched back to private, Copilot was still able to access and generate responses based on its content.

"If I was to browse the web, I wouldn't see this data. But anyone in the world could ask Copilot the right question and get this data," Dror explained.

After experiencing the breach firsthand, Lasso conducted a deeper investigation. The company found that over 20,000 GitHub repositories that had been set to private in 2024 were still accessible through Copilot.

Lasso reported that over 16,000 organizations were affected by this AI-generated security breach. The issue also impacted major technology companies, including IBM, Google, PayPal, Tencent, Microsoft, and Amazon Web Services. While Amazon denied being affected, Lasso was reportedly pressured by AWS's legal team to remove any mention of the company from its findings.

Private GitHub repositories that remained accessible through Copilot contained highly sensitive data. Cybercriminals and other threat actors could potentially manipulate the chatbot into revealing confidential information, including intellectual property, corporate data, access keys, and security tokens. Lasso alerted the organizations that were "severely" impacted by the breach, advising them to rotate or revoke any compromised security credentials.

The Israeli security team notified Microsoft about the breach in November 2024, but Redmond classified it as a "low-severity" issue. Microsoft described the caching problem as "acceptable behavior," though Bing removed cached search results related to the affected data in December 2024. However, Lasso warned that even after the cache was disabled, Copilot still retains the data within its AI model. The company has now published its research findings.

Permalink to story:

 
The issue isn’t the Chatbots then, but the fact that repositories were accidentally (or purposefully) set to public… that’s on GitHub (or the specific users of GitHub) not the bots…
 
The issue isn’t the Chatbots then, but the fact that repositories were accidentally (or purposefully) set to public… that’s on GitHub (or the specific users of GitHub) not the bots…
Maybe so in this instance, but how many times have companies decided to use their user's data for training a model, using an opt out consent process instead of opt in? And regardless of opt out or opt in, unlike other kinds of data opt outs, one cannot simply delete their data from a trained model by later deciding to opt out after some time. One of the points in this instance is that companies affected by this mistake, opening up their code briefly, have to respond to it as an incident where not only has their data leaked, but it is not so easy to contain that leak. Indeed, models are much more user friendly than scouring the dark web for leaked data, so you could imagine the average person asking models the right questions about well known companies and finding things that otherwise would not be discoverable. It's not just about responsibility or who if anyone needs to do something about this, it's about implications. This is something that needs to be in the back of the mind for every security professional, and why so many companies, at least the ones who take security seriously, are cautious about using AI tools (or tools in general) for anything internally.
 
Maybe so in this instance, but how many times have companies decided to use their user's data for training a model, using an opt out consent process instead of opt in? And regardless of opt out or opt in, unlike other kinds of data opt outs, one cannot simply delete their data from a trained model by later deciding to opt out after some time. One of the points in this instance is that companies affected by this mistake, opening up their code briefly, have to respond to it as an incident where not only has their data leaked, but it is not so easy to contain that leak. Indeed, models are much more user friendly than scouring the dark web for leaked data, so you could imagine the average person asking models the right questions about well known companies and finding things that otherwise would not be discoverable. It's not just about responsibility or who if anyone needs to do something about this, it's about implications. This is something that needs to be in the back of the mind for every security professional, and why so many companies, at least the ones who take security seriously, are cautious about using AI tools (or tools in general) for anything internally.
Many times… but that’s not the issue in this case.
 
Models being useful for querying data that is no longer publicly accessible elsewhere is literally the case here.
Yes… but it WAS publicly accessible… and therefore public domain… had they always been set to private, then the bots couldn’t have used the data.
 
Models being useful for querying data that is no longer publicly accessible elsewhere is literally the case here.

No it isn't. Either the users made a mistake and their stuff was never set to private in the first place or GitHub failed to properly execute their own settings. This is one hundred percent not the bots/scrapers.
 
No it isn't. Either the users made a mistake and their stuff was never set to private in the first place or GitHub failed to properly execute their own settings. This is one hundred percent not the bots/scrapers.
Yes, the data was publicly available. Yes, it was the responsibility of the users to ensure their data wasn't set to being publicly available. Yes, it IS still the case that models are being used to access data that was once publicly available and now would otherwise no longer be.

You are conflating responsibility with affect, which as I said, is something the security industry and professionals working on sensitive data and code need to keep in mind, that models are yet another way to get at this information. I'm not imposing or suggesting that models and bots need to be changed here, except for the scraping of private data.
 
Yes… but it WAS publicly accessible… and therefore public domain… had they always been set to private, then the bots couldn’t have used the data.
Yes, I'm not disputing that (except for the part about public domain, but that's beside the point). What I am saying is that focusing on the mistake here overshadows some of the lessons and implications that models have on the security industry, and why training models on private data has people and companies nervous.
 
Back