Step one for AI deployment: Acquire the right data

Read time:

Author:

Anthony Cosgrove

Co-Founder

Date published:

16.12.2024

Expand Table of contents

Collapse Table of contents

You don’t need me to tell you that everyone is talking about AI right now. You’ll also be aware of the constant stream of “How to use AI to do X” posts that abound.

So I’ll keep it simple: the biggest obstacle firms face is gaining rapid access to good quality data, whether that’s coming from an internal or external source. With recent developments in AI being far more visible, the significance of efficient data access is increasingly pivotal for companies that want to leverage AI. By overcoming the challenges of data access, organizations have a chance of unlocking the full potential of their AI initiatives. Without it, they’ll likely move too slowly or have too much embedded cost to generate a meaningful return on investment.

Data acquisition and preparation

The first, and arguably largest obstacle for most companies is acquiring the necessary data for training and testing AI models. This will likely involve data from various internal and external sources, ensuring that the data is of high quality, and addressing any legal, security and privacy concerns. Companies will also need to adapt the data both to evaluate whether it meets their needs and to actually use it, such as labeling the data to provide supervised learning signals for their models. Therefore, they’ll need a strong understanding of the format and structure of the data, and the ability to transform it and maintain that new version on an ongoing basis. This process of discovery, evaluation, customization and integration will also vary depending on whether you’re accessing internal or external data.

Internal data sources

If you’re looking for internal data, the options typically range from “ask the person most likely to have access to what you want” to “check the data catalog.” In the former case, it really comes down to luck and leveraging relationships within your company, which is neither a strategy, nor is it scalable. In the latter, you may get a view of what data is available, but you’re unlikely to know if it’s useful, or have the ability to quickly access and transform it. The point is, the status quo doesn’t provide a quick and easy way to access data that will scale with your AI ambitions. It also means that the same manual processes are repeated every time anyone needs data.

An emergent solution to this is the idea of the internal data marketplace, such as the one that Aboitiz Data Innovation built on Harbr. With this approach, users can discover and gain controlled access to a variety of data assets that span on-premises systems, cloud storage, and data lakes/warehouses. The idea here isn’t to just dump all the data in one place. It’s to make sure that the right people can access the data they need, wherever it is, in a self-service manner. I’m probably biased here, but a data marketplace is a practical, value-led solution to the age-old problem of how to get value from data silos.

External data sources

To acquire data outside of your company, you have a few options: major data providers like Moody’s Analytics or Bloomberg, public data storefronts like Datarade, public data marketplaces like those from AWS or Snowflake, or matchmaking services like Nomad Data. Side note: I had a good chat with Nomad’s founder, Brad Schneider, about some of the specific challenges of finding the right data. Read it here.

Data providers operate a variety of models as part of a multi-channel approach to data commerce. The problem for the data consumer, is that while many of these technologies help with discovery and integration it can remain difficult to evaluate and customize the data. This is further exacerbated when the consumer wants to use it to feed an AI model, as providers are even more reticent to let a consumer trial data because they may obtain the value without buying it. To overcome this, many data providers are increasingly developing a strong core channel – what we call a private data marketplace – with embedded sandbox environments. A perfect example is Moody’s DataHub where, via a secure login, customers can access a storefront containing Moody’s data products. Access and visibility can be controlled per organization or user, meaning that the right people will see the right data products in a self-service manner. This aids the first challenge of buying data: discovery.

Next, users will need to evaluate the data to see if it’s fit for purpose. This is obviously a massive step, but one that persistently causes issues for both data providers and consumers. Again, a private data marketplace can solve this problem by making the trial and evaluation stages effortless and self-service. For AI-related use cases this is even more important, because data providers are increasingly reluctant to provide trial access due to the risk of value being realized during the evaluation and without the data being purchased. Accessing samples in a cloud-based secure sandbox, means data providers don’t run that risk, so can provide full-volume samples, so data consumers can really understand the value proposition. Without this understanding, customers won’t have confidence that the data will work in their model.

Every data user has slightly different needs. So data products are rarely considered one-size-fits-all, in almost all cases there is a level of customization that needs to occur, and when pushing data into a model development process, this is even more pronounced. This customization can also be tricky because the data producer knows the data and the data consumer knows the use case. It’s therefore important that they can work seamlessly together to deliver data that’s ready to use. A great example of a vendor solving this problem is Corelogic, whose Discovery Platform powered by Harbr, lets them work directly with their customers to deliver customized data products at scale.

The final aspect that private data marketplaces solve for is integration. Again, user needs vary, so it’s not unusual for integration requirements to be a factor in the speed and viability of external data acquisition. In the case of Moody’s Analytics, with DataHub, they have the ability to deliver data through a number of self-serve methods to all major clouds and on-premises through SFTP and downloads avoiding the time-intensive processes of coordinating data transfers. When thinking about integrating data into AI models, automated distribution is a major advantage, as updates to the data can automatically be fed into the model on a scheduled or event-driven basis.

Model development and training

Once the data is integrated, the next step is to develop and train the AI models. This involves selecting appropriate algorithms, architectures, and frameworks based on the problem you’re trying to solve. Training AI models can be computationally intensive and traditionally requires access to high-performance hardware or cloud infrastructure. The good news is that you might not need to build your own model.

Sebastien Krier, the UK Government’s former Head of Regulation at the Office for Artificial Intelligence, says that “You don’t necessarily have to build your own model from scratch. You can fine-tune an existing AI model through an API by adding your own data and other tweaks.”

Regardless of how you build and train your AI models, having a mechanism where people feeding the models can easily acquire the right data is super important and a private data marketplace is arguably the best mechanism for achieving that.

‍