One of the many contentious issues with generative AI systems such as ChatGPT and Bard is the way they scrape and use data. It might be publicly available information, but that doesn't stop the plagiarism and privacy concerns, not to mention the possibility of the AI misinterpreting what was said or offering up old, outdated answers. Even Google has warned employees to be cautious when using chatbots like its own Bard as they can make undesired code suggestions.
There's also a question of whether this sort of data scraping is even legal. ChatGPT creator OpenAI is facing lawsuits over accusations that it collected personal information from internet users illegally and used the data to create its products.
OpenAI is also dealing with a lawsuit over copyright infringement and privacy violations relating to claims that it used copyrighted books without permission to train its AI systems. The company allegedly copied text from these titles unlawfully by not obtaining consent from the copyright holders and not giving them credit or compensation.
To address extreme levels of data scraping & system manipulation, we've applied the following temporary limits:– Elon Musk (@elonmusk) July 1, 2023
- Verified accounts are limited to reading 6000 posts/day
- Unverified accounts to 600 posts/day
- New unverified accounts to 300/day
Data scraping seems to be an especially vexing subject for Elon Musk. Twitter over the weekend temporarily limited the number of tweets accounts could read per day to allegedly address "extreme levels" of data scraping and "system manipulation" on the platform – though not everyone agrees this was the reason for the limitation.
Reddit has also faced a slew of troubles since turning off free access to its APIs to stop data harvesting. The move resulted in over 8,000 subreddits going dark in protest and some switching to NSFW.