Technology
OpenAI
OpenAI has announced a web crawler called GPTBot, whose job will be to scour the internet for public data to improve artificial intelligence (AI) offerings, specifically the ChatGPT maker's large language models GPT-4 and potentially GPT-5.
The name “web crawler” gives away what the function is — crawling the web.
Web crawler (or spider) bots scan the web for content. “Their purpose is to index the content of websites all across the Internet so that those websites can appear in search engine results,” technology company Cloudflare explains.
“Web pages crawled with the GPTBot user agent,” says OpenAI, “may potentially be used to improve future models.”
GPTBot will, however, steer clear of “sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies.”
“Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety,” OpenAI says.
The AI research and deployment company has given publishers and website owners the option to either fully opt out of GPTBot's surveillance or allow partial access. Check here for how to do that.
Although the option to opt out of web crawling by GPTBot is welcome and suggests a respect for privacy, it does put the onus of taking steps to disable access upon publishers and website owners.
Instead, an opt-in feature, where one is asked for permission, would have been more respectful.
OpenAI trains its machine learning models on public web data. This choice has led to questions of ethics and legality.
For one, the aspect of consent for the reuse of information is absent. The source of information isn’t typically highlighted in an ordinary interaction with a chatbot powered by an AI model. A chatbot user also isn't redirected to the source, so the latter doesn't benefit.
In this scenario, a source of information is forced to compete with a platform that rechannels that same information, while also acting as a one-stop shop for any other information necessary, clearly handing the latter the advantage.
“Why would any producer of free online content let OpenAI scrape its material when that data will be used to train future LLMs that later compete with that creator by pulling users away from their site?” asks Alistair Barr, writing for Business Insider.
In addition, some of the information on the web, for instance, is copyrighted.
OpenAI’s free use of copyrighted material — text, images, sounds, videos, and what not — to improve their models and grow their revenue, therefore, becomes a contentious issue. It becomes grounds for copyright infringement.
Comedian Sarah Silverman sued OpenAI for copyright infringement in July, and she is one among several authors who have taken objection legally.
On the other hand, OpenAI and the Associated Press joined hands in July for the ChatGPT maker to license the New York-based news agency’s archive of news stories.