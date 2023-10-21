A handful of major newspapers are in talks with OpenAI, the maker of ChatGPT, over access to a vital resource in the age of generative artificial intelligence: digital news stories.

For years, tech companies like OpenAI have freely used news stories to build data sets that teach their machines how to recognize and respond fluently to human queries about the world. But as the quest to develop cutting-edge AI models has grown increasingly frenzied, newspaper publishers and other data owners are demanding a share of the potentially huge market for generative artificial intelligence, which is projected to reach to $1.3 trillion by 2032, according to Bloomberg Intelligence.

Since August, at least 535 news organizations -- including the New York Times, Reuters and The Washington Post -- have installed a blocker that prevents their content from being collected and used to train ChatGPT. Now, discussions are focused on paying publishers so the chatbot can surface links to individual news stories in its responses, a development that would benefit the newspapers in two ways: by providing direct payment and by potentially increasing traffic to their websites.

In July, OpenAI cut a deal to license content from The Associated Press as training data for its artificial intelligence models. The current talks also have addressed that idea, according to two people familiar with the talks who spoke on the condition of anonymity to discuss sensitive matters, but have concentrated more on showing stories in ChatGPT responses.

Other sources of useful data are also looking for leverage. Reddit, the popular social message board, has met with top generative artificial intelligence companies about being paid for its data, according to a person familiar with the matter, speaking on the condition of anonymity to discuss private negotiations. If a deal can't be reached, Reddit is considering blocking search crawlers from Google and Bing, which would prevent the forum from being discovered in searches and reduce the number of visitors to the site. But the company believes the trade-off would be worth it, the person said, adding: "Reddit can survive without search."

And in April, Elon Musk began charging $42,000 for bulk access to posts on Twitter -- which previously had been free to researchers -- after he claimed that artificial intelligence companies had illegally used the data to train their models. (Musk has since rebranded Twitter as X.)

The moves mark a growing sense of urgency and uncertainty about who profits from online information. With generative artificial intelligence poised to transform how users interact with the internet, many publishers and other companies see fair payment for their data as an existential issue.

For example, a month after OpenAI launched GPT-4 in March, traffic to the coding community Stack Overflow declined by 15% as programmers turned to artificial intelligence for answers to their coding questions, according to Chief Executive Officer Prashanth Chandrasekar, who also told The Post he thought the artificial intelligence had been trained on Stack Overflow's data.

This week, the company laid off 28% of its staff.

In addition to demands for payment, leading artificial intelligence firms are facing a slew of copyright lawsuits from individual book authors, artists and software coders seeking damages for infringement, as well as a share of profits. Late Wednesday, former Arkansas Gov. Mike Huckabee joined the fray as a plaintiff in a class-action lawsuit against Meta, Microsoft and Bloomberg for using artificial intelligence tools with pirated books to train artificial intelligence systems, Reuters reported. Trade groups, meanwhile, are pushing lawmakers for the right to bargain collectively with tech companies.

OpenAI's decision to negotiate may reflect a desire to strike deals before courts have a chance to weigh in on whether tech companies have a clear legal obligation to license -- and pay for -- content, said James Grimmelmann, a professor of digital and information law at Cornell University, who recently helped organize a workshop on generative artificial intelligence and the law at the International Conference on Machine Learning.

An OpenAI spokesperson confirmed that the company is in talks with the newspapers and that discussions were not focused on prior training data, which it argues was obtained legally. "None of the company's practices have violated copyright law," the spokesperson said. "Any deal would be for future access to content that is otherwise inaccessible or display uses that go beyond fair use."

Nearly $16 billion in venture capital poured into generative artificial intelligence in the first three quarters of 2023, according to the analytics firm PitchBook -- a flood of cash that in part reflects how expensive the technology is to build. Every component is prohibitively pricey or hard to acquire, from hardware to computing power.

Until now, the only free and easy part had been the data. Widely used services like the nonprofit Common Crawl charge Google, Meta, OpenAI and others nothing to use its service, which crawls the internet in search of troves of online text and archives the information for others to download. To assemble the vast quantities of natural language and specialized information needed to train large artificial intelligence systems, tech companies have combined those archives with online data sets, accessing information made available for research purposes and increasingly straying from information clearly in the public domain.