Authors Sue NVIDIA for Training AI on Pirated Books

Starting last year, various rightsholders have filed lawsuits against companies that develop AI models.

The list of complainants includes record labels, book authors, visual artists, even the New York Times. These rightsholders all object to the presumed use of their work without proper compensation.

“Books3”

Many of the lawsuits filed by book authors come with a clear piracy angle. The cases allege that tech companies, including Meta, Microsoft, and OpenAI, used the controversial ‘Books3’ dataset to train their models.

Books3 was created by AI researcher Shawn Presser in 2020, who scraped the library of ‘pirate’ site Bibliotik. The dataset was broadly shared online and added to other databases including ‘The Pile‘, an AI training dataset compiled by EleutherAI.

After pushback from rightsholders and anti-piracy outfits, Books3 was taken offline over copyright concerns. However, for many of the companies that allegedly trained their AI models on it, there are still some legal repercussions to sort out.

Authors Sue NVIDIA for Copyright Infringement

On Friday, American authors Abdi Nazemian, Brian Keene, and Stewart O’Nan joined the barrage of legal action with a copyright infringement lawsuit against NVIDIA. The company, whose market cap exceeds $2 trillion, is mostly known for its GPUs and related software and services, but also has its own AI models.

In a concise class action complaint, filed at a California federal court, the authors allege that NVIDIA used the Books3 dataset to train its NeMo Megatron language models. The models are hosted on Hugging Face where it states that they are trained on EleutherAI’s ‘The Pile’ dataset, which includes the pirated books.

Putting two and two together, the plaintiffs conclude that NVIDIA’s models were trained on pirated books, including theirs, without their permission.

“NVIDIA has admitted training its NeMo Megatron models on a copy of The Pile dataset. Therefore, NVIDIA necessarily also trained its NeMo Megatron models on a copy of Books3, because Books3 is part of The Pile,” the complaint reads.

“Certain books written by Plaintiffs are part of Books3 — including the Infringed Works — and thus NVIDIA necessarily trained its NeMo Megatron models on one or more copies of the Infringed Works, thereby directly infringing the copyrights of the Plaintiffs.”

Direct Infringement Damages

Relying on the same logic, the authors accuse the company of direct copyright infringement, noting that NVIDIA copied their books to use them for AI training purposes. Through the lawsuit, the rightsholders demand compensation in the form of actual or statutory damages.

The class action lawsuit includes three authors thus far, but more may be added to the case as it progresses. NVIDIA has yet to respond to the allegations but in light of similar cases, it will likely oppose the claims and/or argue a fair-use defense.

Last month, OpenAI managed to ‘defeat’ several copyright infringement claims from book authors in a somewhat related “Books3” lawsuit. However, the California federal court didn’t review the direct copyright infringement claims in this case, which have yet to be argued in detail at a later stage.

—

A copy of the class action complaint against NVIDIA, filed by the authors in a California federal court, is available here (pdf)

From: TF, for the latest news on copyright battles, piracy and more.

Source : Authors Sue NVIDIA for Training AI on Pirated Books