The Data Pipeline is the New Secret Sauce

Light Mode

Why Data Pipelines and Inference Are AI Infrastructure’s Biggest Challenges

The Data “DevOps Moment”

The Inference Hosting Challenge

1. Hosted Inference via API

2. On-Device “Edge” Hosting

3. On-Premise Data Center

4. Off-Premise Cloud Hosting via Third-Party Data Center

Getting Enterprise Ready for AI

Phase 1: Starting a Program With an off-the-shelf Cloud Provider

Phase 2: Scaling the Existing Solution

Phase 3: Cost Shock and Optimization

Phase 4: Specializing Mature Enterprises Seek Appropriate ML Infrastructure for Use Case Fit

How We’ll Address AI’s Infrastructure Challenges Together

More Resources

Jesse RobbinsGeneral Partner
Heavybit

SEP 16, 2024

17 min

Why Data Pipelines and Inference Are AI Infrastructure’s Biggest Challenges

While there’s still great excitement around AI and machine learning (ML), we’re starting to see some level of organizational maturity as enterprises work to stand up programs internally. About 40% of enterprises surveyed report they have either actively deployed an AI program or are currently exploring one. This means enterprises are now grappling with real-world challenges like building products on top of AI, aligning disparate teams around their programs, and even going to market.

More pointedly, foundational challenges in building and running the necessary infrastructure to underpin such programs are beginning to emerge. The ideal future for enterprises is having the ability to run successful programs powered by a large language model (LLM) that provide significant competitive advantage compared to programs powered by off-the-shelf, open-weight models utilizing the white-labeled OEM dataset provided with purchase. Generally speaking, typing a generic prompt into a generic model will only ever produce a generic, low-value response.

Instead, the ideal future for a best-in-class program involves having the appropriate model (or models) humming away in production, running in an affordable, easy-to-manage configuration, and continually running jobs using a pristine, secure, first-party dataset. But getting to our perfect picture is much easier said than done.

As far as I can tell, this is the core of the modern infrastructure problem for enterprises:

The biggest challenge emerging is building and operating the infrastructure both for creating and running the data pipelines to build, manage, and maintain a robust, secure body of proprietary data to train, fine-tune, and orchestrate LLM operations, and for running inference, the actual process of models running calculations on inputted data.

Let’s unpack the situation.

The Data Pipeline is the New Secret Sauce

The enterprise data pipeline is unique and different from what startups are able to do because of the organizational, regulatory, and customer challenges they face. And so, data pipelines that work represent a competitive advantage for their organizations. And they also, by necessity, are slow to evolve because the risks of those very same things are commensurate with their value. So, the similarities between this situation and the emergence of DevOps are 1:1–these are precisely the conditions that emerged when I was at Amazon and observed other organizations trying to survive and scale.

What I think is different–and this is the important part–is that unlike a lot of the early DevOps concepts, everyone was on a journey to CI/CD. Whereas today, we are starting from this point with the data pipeline.

The Data “DevOps Moment”

The ability to continuously create and improve high-quality, first-party training datasets and then develop and finetune models are a strategic advantage.

At Data Council this year, I said that data pipelines are having a “DevOps” moment, starting with a cultural and technical shift toward continuous integration/continuous delivery (CI/CD). This is now accelerating as AI programs in production are running data jobs through their models, resulting in outputs of increasing quality and power. (I also said that the only reason this hadn’t happened before is that data people seem to be more introverted than we were at the beginning of the DevOps moment.) Creating and maintaining systems that are this complex requires a dedicated team and resources–Gartner reports that 87% of “mature organizations” have dedicated AI teams. These types of projects aren’t accidental. They cannot be bought off the shelf. Building out a functional data pipeline for AI programs will be a valuable advantage now, and will become more so every day. Each enterprise’s internal dataset is an artifact–the end result of a series of important data management processes run through a complicated data toolchain. Creating this data pipeline is itself a massive barrier that requires significant operational effectiveness combined with security and privacy practices that prevent any personally identifiable information (PII) from “leaking.”

At best, without an appropriate data pipeline, enterprises will simply fail to build their ideal internal dataset. At worst, they incur significant business risk from a variety of data-related factors, including privacy leaks, subpar performance that produces inaccurate results, and the potentially high costs of having to re-train models initially trained on poor-quality data (to name just a few).

Data management for AI is having its DevOps moment.

The data pipeline for internal AI programs will need to be a continuous process that doesn’t “end” when you launch your LLM into production, but rather, begins at that point–and requires continuous iteration, refinement, and monitoring, like any other software product. In order to master the process of managing AI data–which gives them the ability to adapt to change and drive higher performance over time–enterprises will need to invest in cultivating team-wide, organizational capability to both build and maintain their data pipeline.

To clarify, much has been written about MLOps tooling and the exciting shifts happening there–so while we won’t be doing a deep dive here, I do recommend looking into the space if you’re interested.

However, in addition to creating their data pipeline, organizations are also running into another challenge that we’ve seen over and over again: Once you have built a data pipeline, what do you do with it?

You need to be able to run inference on your data processes securely and privately, at scale–and without breaking the bank.

The Inference Hosting Challenge

Inference for LLMs requires computing resources, most commonly provided by high-powered AI accelerator hardware most commonly found in high-end GPU chips. Hardware that is powerful enough to run inference at scale is, for the time being, both costly and rare. Due to the variety of challenges, we are seeing organizations adopting one of the following inference hosting models:

1. Hosted Inference via API

Many enterprises are choosing to work with third-party API providers such as OpenAI and Anthropic. External providers host LLM models and inference burdens themselves–abstracting away the complexity and cost into token allowances across various pricing tiers, without the need to invest significant capital expenditure into acquiring AI accelerator hardware, or the need to invest operating expenditure into hiring a full-time team on call to run a data center.

2. On-Device “Edge” Hosting

An emerging configuration for smaller teams is edge computing–specifically, bringing LLMs “closer” to data sources by hosting AI models locally on-device. By pairing “smaller” models (in the range of ~3 billion parameters) with high-end laptop computers, organizations can see reduced latency, better bandwidth usage, and improved privacy by keeping data local–though it’s not clear this configuration can scale well for larger teams that may need larger models.

3. On-Premise Data Center

There was a time when running an on-premise data center was so crucial to operations that many enterprises owned and ran their own. However, as of 2023, the majority of enterprise IT workloads are now hosted off-premise due to the massive total cost of ownership, including hardware, physical real estate, and operational teams on call. It’s likely that enterprises will continue to prefer externally-hosted solutions for inference as well.

4. Off-Premise Cloud Hosting via Third-Party Data Center

Third-party data center hosting for AI inference is increasingly beginning to resemble hosting for traditional cloud computing, and it’s possible that in the future, we’ll see a core group of leading vendors in third-party AI inference hosting that will gain popularity among enterprises. However, dependence on externally-hosted resources could also introduce latency into larger-scale inference jobs, dependencies on external data center performance, and external security and privacy threats.

Getting Enterprise Ready for AI

Given the challenges and costs associated with launching an enterprise-scale AI program, we’re starting to see discrete phases emerge:

Phase 1: Starting a Program With an off-the-shelf Cloud Provider

Unless an enterprise is restricted to on-prem and has a mature infrastructure capability already, most organizations are going to start serious experimentation with existing Cloud provider(s) using currently available APIs. For example, Microsoft has reported 53,000 organizations using its AI offerings via Azure.

At this stage, data science and operations teams will focus on testing and implementing valuable use cases and putting them into production via their providers’ models. Inference burdens are abstracted away as part of existing contracts with established providers. Smart enterprises will of course do everything they can to maintain as high a level of privacy and security as possible around the data they feed in. However, they may face headwinds due to possible reverberations around data leakage–there are many counternarratives, and fear around, having third-party models trained on proprietary data.

Phase 2: Scaling the Existing Solution

Over time, enterprises will stand up a functional data pipeline, which may not be perfect, but will be performant enough to zero in on valuable use cases that deliver outsize value (like batch processing jobs for large quantities of data). As they do, they may feel compelled to improve their data privacy posture, at least for certain types of data and certain data processing jobs.

Also, as enterprises begin to realize value from their developing AI program, they will also assess the growing costs of their API provider contracts and tooling. At this stage, enterprises will look to make significant optimizations on their usage and, in the interest of reducing costs, they focus on efficiency.

Phase 3: Cost Shock and Optimization

Specifically, enterprises may find themselves weighing the cost and complexity of standing up an internal LLM against the benefits of maintaining a fully closed-circuit dataset they no longer have to feed to an externally-hosted system–freeing themselves from external model risks such as outages or security breaches experienced by their vendor, and from massive provider bills. As they grow their capabilities in managing a ML program production, they will also likely grow their knowledge of hosting models internally and will seriously consider hosting and launching their own internal model, powered by an inference configuration that makes operational and financial sense.

At such a point, enterprises may start thinking about investing in tooling such as pretraining datasets, data filtering and splitting, and model evaluation to select a model that makes sense for them. They’ll also need to seriously consider inference hosting configurations, potentially opting for edge computing with a small test team to begin with and migrating to an externally-hosted cloud inference provider as their needs increase in scale.

As their familiarity with running AI operations internally grows, they also begin to more-strongly eye the value of closed-circuit privacy offered by running their own models internally–without the help of an off-the-shelf provider.

(It’s also worth noting that with the passage of time, the economics of launching a LLM internally will also likely become more favorable. Specifically, while compute resources are currently scarce and relatively costly, increased commoditization will likely continue to put downward pressure on pricing.)

Phase 4: Specializing Mature Enterprises Seek Appropriate ML Infrastructure for Use Case Fit

Will all enterprises eventually “graduate” to a state of hosting their own internal, closed-circuit LLM, through which they run every single data job? Not necessarily.

As enterprises learn how to run their own ML programs, they will find themselves less bound by hardline goals to spend a certain amount of dollars on AI initiatives or to acquire specific infrastructure configurations, and more focused on finding infrastructure that fits their specific needs and use cases. In some cases, they may find themselves standing up their own internal LLM, complete with their own internal infrastructure hosting configuration and costs, to handle high-priority use cases that require high responsiveness with the utmost privacy. In other cases, as they increase their comfort levels with hosted solutions and find “good enough” data privacy tooling, they may find better value in letting hosted solutions run massive data jobs that might’ve been too costly or time consuming to run internally.

Savvy enterprises will understand that, given the incredible speed of change in the AI space, it will be a best practice to seek AI solutions that offer them optionality in favor of restrictive, long-term commitments with vendor lock-in. Committing to launching an internal LLM may be exactly what an enterprise needs to compete, but it’s a significant investment that will require continuous operational support afterwards. Conversely, signing a contract with a hosted third-party provider eager to get their business may be a much more lightweight alternative.

As enterprises mature, they’ll begin thinking more deliberately about their AI inference spend, particularly given the high cost of compute. Suffice it to say, there are many aspects of AI infrastructure that let teams expend huge amounts of money, like attempting to train, and repeatedly re-train, an open-weight model using an entire dataset. In the same way that many businesses repatriated their compute infrastructure from owned data centers to cloud providers–then looked on increasing horror as their cloud computing bills skyrocketed–smart enterprises will look for ways to control their spend, such as seeking model training alternatives like model merging and mixture-of-experts.

How We’ll Address AI’s Infrastructure Challenges Together

In the next 12 months I expect an explosion in new ideas and innovation from practitioners who are actually doing the work to figure this out at scale. At Heavybit, we’re working to identify and elevate the people working in enterprises and startups that are solving these new challenges, creating best practices, and rallying around common and critical challenges. We plan to share our learnings to help accomplish two goals: First, to provide a clearer path for successful AI programs at enterprise scale, and second: To clarify the requirements for enterprise-scale programs to help ambitious AI tooling startups get enterprise-ready.

Stay tuned for more updates on how enterprises are tackling new AI and ML challenges like inference hosting. If you’re an AI startup founder looking to get your product ready for enterprise, feel free to reach out to us.

More Resources

Content from the Library

Visit library

The AI Infrastructure Guide for Software Development

This content is part of our AI Infrastructure Guide for Software Development. See more related content here:

Jun 29, 2023

Video

Open-Source Licensing and The Future of Open Source Businesses

In this panel we'll dig into the current state of open source licensing and what is and is not open source, what the current...

Apr 11, 2025

Article

The Role of Synthetic Data in AI/ML Programs in Software

Why Synthetic Data Matters for Software Running AI in production requires a great deal of data to feed to models. Reddit is now...