10 months ago

Why is media industry increasingly wary of AI’s copy-paste ‘journalism’?

A slew of lawsuits against companies developing chatbots sow the seed to regulate the burgeoning sector. But is it feasible?

Kazim Alam

Reuters Archive

Artificial intelligence / Photo: Reuters Archive

The Washington Independent used to be a proper online news outlet that employed full-time journalists before it folded more than a decade ago.

Now, it’s returned from the dead, with regular news coverage and articles by professional-sounding journalists who have introductory notes and contact numbers against their names.

However, as pointed out by Pulitzer Prize-winning journalist Spencer Ackerman – who worked for the Washington Independent during its first run – the reincarnated news platform only regurgitates recently-published, article-like content that does not appear to have been produced by human beings.

In other words, whoever runs the once highly-respected news website relies on generative artificial intelligence or GenAI to produce content that would otherwise be copyrightable had it been created by a human being.

Some of the “new” articles that the website is churning out look a lot like what actual journalists like Ackerman produced in the past.

“Now I have to worry that I have inadvertently trained my robotic replacement,” writes the Pulitzer Prize-winning journalist.

Ackerman isn’t some lone paranoid writer worried about the impending takeover of GenAI, a technology that can create text – as well as images, videos and music – in response to simple prompts.

He’s one of the journalists, nonfiction writers and newspaper publishers actively campaigning against the “unauthorised” use of GenAI, which, they say, poses an existential threat to their livelihoods.

Within the news industry, GenAI manifests itself in the form of chatbots, which are computer programmes that simulate real-life conversations with human beings.

These chatbots get trained on hundreds of millions of pre-existing articles on the internet and have become so efficient – in theory, at least – to compete against proper news outlets as a source of information.

The problem has gained such immediacy that The New York Times recently sued the companies that have created popular GenAI platforms like ChatGPT for misusing its copyrighted written works.

It claims that these companies used its articles to train their automated chatbots, which are now competing with the newspaper as a source of reliable information for the general public.

The issue at the root of the tussle between GenAI firms and content producers like The New York Times is whether using copyrighted material to train chatbots falls within or outside the scope of fair use.

“The question is whether AI systems can copy copyrighted material without compensation and for their own profit. We do not think the US copyright law allows it,” Justin A. Nelson, partner at law firm Susman Godfrey LLP, tells TRT World.

“Permission to use creative works is at the heart of any fair licensing system,” says Nelson, whose law firm has filed a class action suit on behalf of several nonfiction authors against OpenAI and Microsoft for allegedly training chatbots on their copyrighted materials without permission.

On their part, firms like Microsoft, which have invested heavily in GenAI firms such as ChatGPT and Copilot, assert that the doctrine of fair use permits taking advantage of copyrighted materials to train AI models.

In comments submitted before the director of the US Copyright Office of the Library of Congress late last year, the software company held that making an “intermediate copy” of a work to train an AI model is “completely different” from copying an expressive work to communicate the copyright holder’s original expression.

“Just as humans are permitted to learn from the ideas, concepts, and patterns in copyrighted materials, copyright law has for decades recognised that fair use principles allow intermediate copying and use of copyrighted materials for the purpose of learning and creating new, transformative works,” it says.

University of Texas School of Law Professor Oren Bracha agrees.

He says the argument that using copyrighted works for training chatbots is itself copyright infringement – irrespective of any similarity with the generated output – is wrong as a matter of both law and information policy.

As a matter of law, copies of copyrighted works made only for extracting information without exposing anyone to the expression of the works do not infringe, he tells TRT World.

“As a matter of information policy, if mere use in training were infringing, the burden on AI development would be very heavy. Copyright is everywhere, AI training must use many works, and licensing would be prohibitively expensive,” he adds.

However, The New York Times insists that many GenAI companies have lifted a “massive amount of content” from its website to create tools that mimic, closely paraphrase, and copy its work.

The newspaper says a “stunning 1.2 percent” of the dataset used to train OpenAI’s ChatGPT-2 was based on its content. The newspaper was also the fourth-largest source for Google’s C4 dataset, which powers GenAI products like Google’s Bard and its Generative Search Experience.

Lawsuits galore

Bracha says the situation becomes “very different” when a GenAI system trained on copyrighted works produces output that’s “very similar to a specific work” – something that’s relevant to The New York Times litigation.

In those instances, he says, there’s a “very strong case” that the generated output infringes copyright. But he warns that even if the output is indeed infringing, it does not automatically mean every actor involved in every stage of the training should be held responsible for it.

“This is a further complicated question that courts will have to wrestle with,” Bracha says.

The courts are indeed getting inundated with cases by content producers against GenAI firms. News organisations The Intercept, Raw Story and AlterNet have sued OpenAI, accusing it of misusing their articles to train its popular chatbot ChatGPT.

News agency Thomson Reuters is another organisation that went to court against a legal tech startup, Ross Intelligence, that allegedly used its AI model to consume copyrighted data without permission.

Getty Images sued Stability AI while top fiction writers, including George R. R. Martin and John Grisham, went to court against OpenAI. Universal Music has initiated legal proceedings against Anthropic, an AI firm founded by former members of OpenAI, for using its copyrighted lyrics to train its AI model without permission.

Separately, a lawyer duo of Joseph Saveri and Matthew Butterick filed lawsuits against GitHub Copilot, an AI coding assistant built by Microsoft, and ChatGPT and LLaMA of Meta on behalf of book authors. They have also filed initial complaints against Stability AI, DeviantArt, and Midjourney on behalf of three artist plaintiffs.

As many as 13 new copyright-related lawsuits were filed against AI companies in 2023 alone. The Thomson Reuters versus Ross Intelligence case is expected to go to trial soon and is likely to be the first precedent-setting case in this field.

The way forward: licensing

Court cases take years to resolve, which will allow GenAI bots more time and additional data to train themselves to perfection. Questions abound as to what will happen if AI firms continue to have their way indefinitely. Will the consequences for original content producers be as disastrous as they claim?

Alternatively, what does the world stand to lose if the courts go for “overregulation”, resulting in regulatory overreach that inadvertently kills the up-and-coming business model of GenAI firms?

Emory University Law Professor Matthew Sag doesn’t believe that AI firms pose a threat to traditional content providers.

“I think we are increasingly going to see licensing transactions for large collections of high-quality and current data. We are already seeing this market equilibrium working itself out, and I think now is not the time for knee-jerk regulation,” he tells TRT World.

Acknowledging that GenAI will have “significant disruptive effects” on some white-collar and creative workers, Sag says the likes of The New York Times will be “just fine”.

The copyright office just needs to develop a standard that makes sense as a matter of copyright doctrine and one that it can administer at scale, he adds.

Professor Bracha of the University of Texas says that copyright-related costs of obtaining social benefits of AI can be reduced by tweaking the relevant law.

For example, there can be a scheme of streamlined statutory licences where the legislature or an administrative body creates a standard license for a standard fee to let AI firms use vast datasets to train their bots.

Similarly, there can be an opt-out rule, which allows AI developers to use works unless the copyright owner actively objects, he says.

There’s been some progress in this regard. Google recently signed a deal worth $60 million yearly with social news aggregation platform Reddit to gain real-time access to AI training data.

Similarly, media house Axel Springer, which owns journalism outlets like Politico and Business Insider, has announced a formal partnership with OpenAI. As a result, ChatGPT’s answers to queries by general users will now show links to the full articles produced by Axel Springer’s media brands.

Even The New York Times has shown a willingness to enter into “negotiated, fair agreements that permit and govern the use” of its content by GenAI firms.

“We ask the Copyright Office to help protect journalism and shape the development of this technology through guidance and legislative recommendations,” the newspaper said in comments submitted as part of the ongoing study on artificial intelligence.

SOURCE: TRT World