Read time:
Author:
Iain Niven-Bowling
Chief Technology Officer
Date published:
3.14.2025
Table of Contents:
Expand Table of Contents
Collapse Table of Contents

In the generative AI boom, one thing has become crystal clear: those with data hold the power. AI companies are in an arms race for quality data to fuel their models, while data providers are pushing back to protect and profit from their crown jewels. This post explores why AI craves data, the legal battles and landmark deals igniting over it, and how Harbr Data offers a solution for enterprises to share data safely and without losing long-term value.

AI companies are desperate for high-quality data

The success of large language models (LLMs) hinges on the quantity and quality of their training data. The more diverse and rich the dataset, the smarter the AI. Early LLMs gorged on internet text: OpenAI’s GPT-3, for example, was fed 410 billion tokens of text from the public web crawl (Common Crawl), which made up 60% of its training data​. But this well of free internet text is not infinite — research suggests AI could use up the supply of high quality human-written text on the open web as soon as 2026​. In other words, the “free lunch” of public data is ending, and AI developers are feeling the squeeze. They need fresh, reliable, domain-specific data (news articles, proprietary research, social media, code, you name it) to improve and stay competitive. It’s become an all-out scramble to secure that data, because better data means better AI — and better AI is a competitive edge.

Training data wars: Legal battles erupt

Naturally, content owners aren’t exactly thrilled about AI firms freely helping themselves to data. We’re now seeing high-profile showdowns between content owners and AI developers. A prime example: Getty Images vs. Stability AI. In early 2023, Getty sued Stability AI, accusing the startup of copying millions of Getty’s licensed photos without permission to train the Stable Diffusion image generator​. Getty is seeking up to $1.7 billion in damages for this alleged misuse​, underscoring how serious the stakes are. In the literary world, comedian Sarah Silverman and other authors filed class-action lawsuits claiming OpenAI unlawfully scraped their books to train ChatGPT​. They argue their copyrighted text was ingested without consent — essentially, that the AI “learned” from their work without paying for it.

Even social media companies are fighting back. Reddit and X (formerly Twitter), home to vast troves of user-generated content, have put up walls against unrestricted scraping. Reddit’s CEO declared that Reddit content has value and shouldn’t be free fodder for AI training, leading the platform to tighten API access and start charging fees. Over at X, Elon Musk imposed aggressive rate limits to block data scraping and even filed lawsuits against entities mining Twitter data without authorization​. The message from data owners is clear: unauthorized use of our data will not go unchallenged. These legal battles highlight a growing tension — as AI models vacuum up data, content owners are asserting their rights and fighting for compensation and control.

Striking deals: (Previously) unlikely partnerships for data

Alongside the conflicts, we’re seeing groundbreaking partnerships where AI companies and data providers strike mutually-beneficial deals. Rather than litigate, some data owners are choosing to collaborate — for a price. A notable case is the Associated Press (AP) partnering with OpenAI. In 2023, AP agreed to license part of its news archive to OpenAI, giving the AI access to decades of vetted journalism​ ap.org. In return, AP gains access to OpenAI’s technology and expertise, and presumably compensation, marking one of the first major news-and-AI licensing pacts. This trend continued as media giant News Corp (owner of The Wall Street Journal, The Times, etc.) signed a deal with OpenAI to share content. The partnership, reportedly valued at over $250 million, grants OpenAI access to News Corp’s current and archival articles​ — a huge trove of high-quality text data.

It’s not just news media. Shutterstock, a leading stock image provider, embraced AI by expanding its partnership with OpenAI. The company signed a six-year deal to supply high-quality images and metadata for training AI models. Rather than suing over copyright, Shutterstock chose to profit by being an official data source for generative AI. They even set up a fund to compensate photographers and artists whose works contribute to these models, ensuring creators get a cut of the AI boom.

From resistance to collaboration: the Reddit pivot

And remember Reddit’s stance? It eventually led to a collaboration — OpenAI and Reddit struck a partnership to give ChatGPT access to Reddit’s real-time content through Reddit’s API​. This gives OpenAI a firehose of the latest discussions and knowledge on Reddit (a goldmine of human conversation), while Reddit presumably gets fees or strategic support.

These deals show a possible path forward: controlled sharing and data monetization instead of scraping. By negotiating licenses, AI developers get the high-quality data they crave, and data owners get paid while maintaining control. It’s a shift from a “wild west” of data to a structured data economy.

Enter Harbr Data: Governed data sharing and monetization

While legal battles and partnership deals grab headlines, a more systematic solution has emerged. Harbr Data offers a platform that addresses the fundamental challenge at the heart of the AI data dilemma: how to share valuable data assets without surrendering control. The company wasn't born from the AI boom — it's been developing enterprise data marketplace technology since 2017. 

Acting as a private data marketplace, data providers can share any type of data asset with partners (or internally) without losing control or custody​. Every data asset has granular permissions, usage policies, and tracking. This enables any enterprise to make its data available to an AI company in a range of ways, ensuring compliance with licensing terms and preventing misuse. This gives enterprises the flexibility to work with AI companies on their own terms, from creating secure testing environments to establishing automated delivery pipelines. 

Harbr’s data marketplace platform was built for this new era of data collaboration. It enables governed access and use of data assets so both sides — providers and consumers — can work together with confidence. For data providers, it transforms what was once a binary choice — give away your data or lock it away — into a spectrum of controlled collaboration options that preserve long-term value. On the consumption side, AI companies are presented with high-quality data assets within a governance framework that lets them test and develop their models on their own or collaboratively. By embedding governance into the technical infrastructure itself, Harbr makes data partnerships sustainable rather than extractive.

Implications for enterprises in the AI era

What does this all mean for businesses and enterprises leveraging AI?

Your data is a strategic asset

Firstly, it’s another reminder that your data can be an extremely valuable asset. If you have large or unique datasets (be it customer behavior data, industry-specific knowledge, sensor data, etc.), AI companies will likely want it. Enterprises should expect more opportunities to either license their data for model training or to collaborate with AI providers — but they must do so carefully. The legal battles show the risk of failing to have a good strategy in place, while on the flip side the risk of eroding the long-term value of your data is also clear. Companies need to ensure that if they use third-party data to train AI, they have the rights to do so (to avoid lawsuits down the line). Conversely, if others want to use your data, you’ll need the proper safeguards, and a technology like Harbr Data to enforce them.

Governance is non-negotiable

Secondly, for enterprises building or using AI models internally, governance is key. You might train models on your own proprietary data — Harbr can help different departments safely share data with your data science teams. Or you might enhance your models with external data — using a platform approach can ensure you respect usage policies and maintain an audit trail of what data influenced your AI (critical for compliance and even for ethical AI concerns).

Gaining a competitive edge through proactive data management

Finally, as AI development shifts from a data free-for-all to a more regulated, collaborative model, enterprises that proactively manage their data sharing will have an edge. They can turn what was once a guarded resource into a strategic asset — fueling AI innovations, forming partnerships, or even creating new revenue streams — all while staying on the right side of ethics and law. In a world where AI loves data, Harbr Data helps ensure that love doesn’t break any hearts (or laws).