Sun, Nov 24, 2024
In December 2023, the New York Times filed a suit against Open AI, the company behind ChatGPT, and Microsoft, which recently acquired the Sam Altman-founded firm. NYT’s principal grievance is that OpenAI used the former’s copyright material without authorisation to train its large language models (LLMs) and create a product that directly competes with it.
The NYT wants to hold OpenAI and Microsoft accountable for billions of dollars in damages. It even asked for the destruction of ChatGPT and other LLMs and datasets that incorporate its works. While the dispute may seem unprecedented, it is similar to the stand-off between legacy media and digital platforms in key ways.
First, the NYT suit seems like an attempt to ask for a payout for not innovating technologically. The NYT has made investments in the last couple few years but never in any substantial technological innovation. In 2016, it purchased Wirecutter, a product recommendation service that relies on rigorous testing and research by reporters. In 2022, it purchased ‘The Athletic’, a digital sports journalism company.
As both these properties are digitally native, some commentators call the NYT a tech company. But this is an inaccurate characterisation because both its acquisitions are extensions of its core product - subscription-based, human-driven information services. The only tech investment the NYT made was for Wordle, a digital word puzzle. Conversely, generative AI like ChatGPT offers an innovative avenue towards the monetisation of information.
NYT’s inability to compete and its treatment of ChatGPT as a threat is similar to the situation between news publishers’ and social media platforms. Until the systematisation of information online by search engines, e-commerce websites and the social media, the legacy media enjoyed a virtual monopoly over the creation, distribution, and monetisation of information.
Most media revenue came on the back of advertisements. A newspaper or television channel with scores of subscribers would offer the advertiser some space in their publication or a time slot respectively, for a negotiated sum of money. When digital platforms emerged, they gave advertisers an opportunity to reach many more consumers, in a targeted manner. Advertising online was also more cost effective than on newspapers or TV.
Advertisers could now leverage search terms, or user networks where people told millions of their “friends” and followers to buy products. Ads became more organic. Instead of disrupting the media consumption process, like a jarring ad for men’s underwear in the middle of your favourite TV show, they became part of it.
Unless legacy media organisations innovated and invested in digital distribution, there’s no way for them to prevent new platforms from eating their lunch, so to speak. Instead, many such organisations filed lawsuits against digital platforms, alleging that the latter were unfairly profiting off the news and information created by the former.
Governments responded to these legal proceedings by introducing laws that created something akin to a royalty stream for legacy media. Illustratively, Canada’s Online News Act, 2023, aims to ensure that large online platforms provide remuneration to news organisations when their content is utilised on these services. Platforms responded by taking news off their feeds to avoid making payouts. Illustratively, after the Canadian law was enacted, Meta restricted the availability of news content for users of Facebook and Instagram in the country.
Second, just like social media platforms may have relied on news links to attract users, OpenAI relied, in part, on the work of the NYT-like media organisations to get where it is. And just like news publishers, who initially saw the wide distribution afforded by platforms as a boon for their business, the NYT was slow to act to safeguard its content.
NYT content is the most highly represented proprietary source in Common Crawl, a free public dataset that OpenAI relied on to develop ChatGPT. The preponderance of its content in ChatGPT’s training datasets is just one of reasons why the NYT is suing OpenAI.
The contents of Common Crawl as well as other datasets used by OpenAI have been public information for some time now. Illustratively, OpenAI researchers brought out a paper about fine tuning language models in 2020 where they discuss using WebText, another dataset that heavily features NYT content. Yet, it was only in July 2023 that the NYT changed its terms of service to prohibit the use of its content to train AI models.
Third, the NYT-OpenAI legal battle presents NYT as an institution that serves the public interest, while ChatGPT, much like social media, is portrayed as a perpetuator of misinformation. However, the moral high ground taken by the NYT is shaky, because, ultimately, it is also a business.
In 2019, it erected a paywall to stem the tide of losses it underwent the year before. In 2017, it reorganised its copy desk and fired over 50 copyeditors. Will it be able to preserve its moral high ground if it fires other staffers because it may be cheaper to automate their job roles?
The broader lesson for the NYT, and any businesses which find themselves challenged by the march of generative AI, is that innovation always wins, no matter the rules. Like social media deprioritised news links, generative AI companies can move away from news content to other sources for data. In such a case, all that actors like the NYT can expect is a one-time settlement fee. Even the deletion of ChatGPT will not yield the deletion of all generative AI.
Resisting innovation on normative terms may delay its progress, but ultimately legacy thinking perishes. The business world has seen countless such cycles in the past, not limited to the media. The larger question therefore is, what are the NYT and other legacy media organisations doing to adapt to the inevitability of AI?
(Meghna Bal is Head of Research at Esya Centre, a technology-policy focused think-tank. Views expressed are personal)