US tech giants OpenAI and Microsoft face lawsuit for copyright infringement; experts caution against risks of non-transparent data collection

SOURCE / COMPANIES

Liu Caiyu

Global Times reporter on society, cybersecurity, diplomatic relations and topics related to Xinjiang.

Liu Caiyu

Published: Dec 28, 2023 11:10 PM

Photo: VCG

The New York Times has filed a lawsuit against OpenAI and Microsoft, accusing them of copyright infringement. This has once again brought attention to the issue of non-transparent data collection by US tech companies in training their chatbots. Chinese industry observers are warning about the risks of this practice and comparing it to a "ticking time bomb" in terms of its potential impact on artificial intelligence governance.

The newspaper lodged a lawsuit in a federal court in Manhattan, reportedly the first major US media organization to sue OpenAI, alleging that OpenAI and Microsoft's large language models (LLMs), ChatGPT and Copilot are capable of generating content that closely resembles that of the New York Times or imitating its writing style when summarizing its content.

This approach undermines the relationship between the New York Times and its readers, as well as impairs the newspaper's ability to generate revenue from subscriptions, copyright licenses, advertisements, and other ancillary sources, the newspaper said.

The lawsuit does not specify a specific amount of compensation sought, but the New York Times believes that the defendants should be held responsible for billions of dollars in damages related to the illegal replication and use of its unique and valuable works. Additionally, the newspaper also demands that the two tech companies destroy any AI models and training data that utilize its copyrighted materials.

Creators of those AI tools should be aware of the risk they have brought and will bring to not only media industry but entire innovative industries, said Li Zonghui, vice president of the Institute of Cyber and Artificial Intelligence Rule of Law affiliated with the Nanjing University of Aeronautics and Astronautics.

The nature of generative AI poses a significant risk of copyright infringement as it requires collecting a wide range of existing materials, including texts, sounds, images, and other works, even those still protected by copyright, in order to effectively serve the market demand, Li Zonghui told the Global Times.

When the results generated by AI do not simply reproduce the entirety or a substantial portion of a specific work or just use algorithm to imitate the style of that work, they should be considered fair use. But for generative AI, its purpose is clearly not just "self-learning," so such use may not constitute fair use, as it is ultimately intended for market promotion and commercial utilization, Li Zonghui noted.

Besides the lawsuit from the New York Times, OpenAI has also been mired in lawsuits filed by two famous authors, Paul Tremblay, author of "The Cabin at the End of the World," and Mona Awad, author of "Bunny" and "13 Ways of Looking at a Fat Girl," who alleged in June that their copyrighted books were used to train ChatGPT without their authorization, according to CNBC.

In July, Sarah Silverman, a comedian and author, along with authors Christopher Golden and Richard Kadrey, filed lawsuits against OpenAI and Meta separately, alleging that both companies have committed their copyright infringement.

The Italian Data Protection Authority (GPDP) in March alleged that ChatGPT was unlawfully gathering user data and not effectively preventing minors from accessing inappropriate content, which resulted in OpenAI blocking ChatGPT in Italy.

Additionally, experts have cautioned that US tech companies such as OpenAI, Microsoft, and Github can expect to face an increase in lawsuits, which will come with significantly higher penalties when the EU AI ACT comes into effect, which is aimed at protecting creators and copyright holders, according to The Hill.

Li Baiyang, an assistant professor from the data management innovation research center of Nanjing University, highlighted the urgency of addressing copyright infringements, given that wide range of parties may be involved. Scraping datasets from the internet without authorization can potentially infringe upon the rights of copyright holders, individuals who utilize the technology to produce original works, as well as other parties with such interests, the expert told the Global Times on Thursday.

The actions of American technology companies also go against the agreement reached by countries worldwide at the UK AI Summit to strengthen regulations for AI safety. It is crucial to address important concerns such as responsible data collection, safeguarding privacy, reducing bias, and being cautious about spreading false information, otherwise risks posed by these concerns would be like "ticking time bomb" for future AI governance, Li Baiyang noted.

Clearly, copyright infringement related to AI highlights a significant crisis for the creative industry brought by large model developers in all countries, experts said. It raises concerns about machines replacing human workers and the challenges of addressing the conflicting interests arising from this situation, according to Li Zonghui. He said that this dilemma is reminiscent of the early days of the industrial revolution when workers protested and even attempted to destroy new machines in factories.

To address the concerns, Li Zonghui suggested that associations related to written works, music, and audiovisual works should proactively collaborate with AI companies to establish a comprehensive framework for copyright licensing agreements. Meanwhile, AI companies should also reach out to copyrights holders before scraping data to prevent infringement risks.