Enterprise AI Infrastructure: Privacy, Maturity, Resources

Light Mode

Enterprise AI Infrastructure: Privacy, Economics, and Best First Steps

Specialized AI Systems Will Provide an Edge

Data Privacy in AI Will Be Crucial, Especially for Highly Regulated Spaces

Future Economic Shifts May Lead to an Inflection Point

Maturity and How to Scale

From First Principles to Enterprise Adoption

Resource Allocation and Efficiency in AI Deployment

In the Future, Enterprises Will Increasingly Focus on Compound AI

Andrew ParkEditorial Lead
Heavybit

SEP 5, 2024

14 min

Enterprise AI Infrastructure: Privacy, Economics, and Best First Steps

The path to perfect AI infrastructure has yet to be paved. Enterprises must consider many important factors, like maintaining data privacy and scaling their AI deployments without overspending. They must kick off and mature their AI initiatives in a way that provides competitive advantage, then properly resource their teams.

To scope out such challenges more clearly, we spoke with Chaoyu Yang, founder and CEO of BentoML. His startup provides a developer platform for enterprise AI teams to build and scale compound AI systems. Chaoyu previously served as a software engineer at AI data leader Databricks. Some of his key observations include:

Specialized AI Systems Will Provide an Edge

Custom AI systems optimized for a specific use case, combined with a high-quality, proprietary dataset may provide a powerful competitive advantage for enterprises that “graduate” from relying on proprietary AI models.

Data Privacy in AI Will Be Crucial, Especially for Highly Regulated Spaces

As enterprises scale their ML workloads, data privacy will only become more important, particularly as each company’s store of proprietary data grows in size and relevance.

Future Economic Shifts May Lead to an Inflection Point

While there’s a case to be made for every enterprise to run and own their own AI/ML operations internally, it’s arguably not a practical goal at the moment due to a variety of factors (including operational gaps between data science and operations teams, and the sheer economics of trying to own your own inference estate). In the future, more tools and cheaper, commoditized compute resources may eventually tip the scales in a way that makes owning their own AI operations more feasible for enterprises.

Maturity and How to Scale

While the current paradigm for inference hosting boils down to some combination of on-device, hosted, or data center, deployment and inference platform each have their own nuances–though there’s an emerging platform ecosystem for inference hosting as well. “There are a number of players providing an inference platform. There’s my own team at BentoML and other providers that let developers deploy any open-source model, or their own proprietary custom model, and run inference at scale. Compared to AI API providers, a big differentiator is that we offer dedicated ‘bring your own cloud’-style private deployment, typically in customers’ own secured environment–which is especially appropriate for teams in highly regulated industries.”

Chaoyu suggests a maturity curve that begins with hosted options due to ease of use through providers like OpenAI and Anthropic. “For now, people mostly get started with an API endpoint provider. There's an argument that self-hosting AI models can provide cost benefits as you scale up, but we found that to be a weak argument. Even teams using GPT-4 at scale, especially for enterprise users, don't think of it as ‘expensive.’ Especially when compared to the total cost of ownership of building and maintaining your own mission-critical AI systems.”

The CEO suggests that early adopters are starting to look for alternatives. “As AI increasingly powers your business-critical applications, specialized AI systems are going to become more strategically important for every enterprise to compete. Being able to build specialized AI systems, optimize for your specific business use case and leverage proprietary data and knowledge, are going to become how you win in the future.”

“Another important topic is developer efficiency. For a lot of use cases, you will need to quickly iterate on either the application code, the model, the inference strategy, or infrastructure decisions, to tweak system design and make performance improvements. ‘Performance’ could include factors like latency or accuracy for a specific scenario, cost efficiency, or security metrics. Today, people are still racing to get their product to market, to start evaluating ROI, to start evaluating how AI is contributing to the business. But over time, I think the specialized-AI approach will win in high impact, business critical AI applications.”

Data centers require potentially massive total cost of ownership, from real estate to hardware to ongoing maintenance. Image courtesy MIT Technology Review

From First Principles to Enterprise Adoption

Even with this potential future in mind, Chaoyu would still likely advise newcomers to consider starting with a third-party AI API provider. “That's just the fastest way to explore what AI could do for you. If you're working with sensitive data, try to curate synthetic data that's less sensitive for prototyping and evaluation.”

“As enterprises look to build specialized AI systems however, they should carefully consider TCO, which can be prohibitively high. To mature over time, enterprises require platforms to help scale their models and empower application building. But I do think a couple trends are pointing towards more adoption of specialized AI systems. The first trend is on-demand GPUs becoming more accessible, at cheaper prices. I think it will consolidate to a few cloud vendors that will make on-demand GPU access fairly easy in the next 12 months. A second trend is open-source models. You have tons of really good options that open up opportunities for advanced customizations, and they're only getting better from here.”

“And the last trend is the growing shift to compound AI systems–systems that utilize multiple interacting components, models, and other tools to accomplish their goals. I think the future won’t be a single model that does everything. More AI applications are compound systems composed of multiple models, multiple pipelines, and multiple components. RAG, Voice Chat LLM, Function-calling Agents are some of the popular examples. And this is a trend that will result in more AI applications being built with a combination of foundation models and specialized models, and need access to sensitive data or proprietary software systems. We’ll return to this topic shortly.”

Resource Allocation and Efficiency in AI Deployment

The founder notes that resource utilization is a surprising challenge at many ends of the spectrum for enterprises. “You noted in a blog post about misaligned AI incentives that only 10% of AI projects go into production. Obviously, things should be aligned in terms of business goals, and there are some similarities [to the pre-DevOps days] where engineers and ML scientists have a very different path and different day-to-day outputs. However, I believe there should be much better tooling designed to close some of those gaps in the future.”

“Today, a lot of AI projects say they are in production and paying a fortune for GPU resources. But they don’t know the exact resource utilization rate, oftentimes struggling with GPU utilization at under 10%.” The CEO suggests that resource overprovision and its upfront cost, is another argument in favor of newer AI projects beginning with endpoint providers.

“For example, within an enterprise, I may reserve a limited amount of GPU quota, dynamically shared with dozens, or hundreds, of models, making sure inactive models are being scaled down to zero and high priority models are right-sized to ensure service quality. In my mind, that’s the hard infrastructure problem we aim to solve.” Chaoyu offers the example of serving open-source LLMs, for which teams might use vLLM, text-generated inference, or TensorRT-LLM–which all promises inference performance on a single model replica. However, the surrounding infrastructure for fast scaling, cold-start, concurrency control, observability, and common LLM deployment patterns–such as LLM Guardrails, Multi-LLM gateway–can still be quite difficult to build and optimize.

Traditional Cloud-native infrastructure doesn’t really work for GenAI –container images and models files are huge, for instance, leading to delays of 30 minutes or more when scaling from one instance to two. “You can see how a lot of the traditional DevOps assumptions about the workload being a small-container, single-process type of thing–how GenAI just completely breaks that. We need a new type of cloud infrastructure, optimizing every step in the AI stack, to solve the resource efficiency challenge.”

Engineering teams that are new to running AI in production may make assumptions about how metrics work that cause them to unwittingly run up massive usage bills, or hit performance snags as jobs fail or slow to a crawl due to unexpected variations in data payloads. “People who work in DevOps may focus on straightforward CPU/GPU Kubernetes metrics but there are limitations such as Python’s global interpreter lock, which limits parallel execution and makes CPU utilization less visible. There are also nuances in terms of how vendors like Nvidia represent utilization. And another nuance that we find newer teams can overlook is how resource-based metrics are always retrospective. You’re looking at how usage has been consumed over a certain time period–not necessarily something that gets conveyed in a single snapshot in time.”

For example, Chaoyu recommends considering concurrency-based scaling, which can offer more-granular information that will help teams deduce how many GPUs needed to support their various workloads. “Concurrency scaling works pretty well with inference where batching could be happening. For example, you can do adaptive batching or continuous batching that groups multiple incoming requests together.”

While it can be tempting for longtime infrastructure engineers to toss their AI workloads into K8S and expect everything to work the way cloud workloads typically do, the founder cautions that AI workloads are drastically different. “Without a proper AI infrastructure, you may end up with a larger bill from AWS than you thought because your system gets overprovisioned, or your deployments don't scale up fast enough so requests fail or respond slowly. Slow performance can be a serious issue for some use cases that need low latency, such as an AI phone calling agent. As we covered in one of our blogs, the typical cloud-native stack is not built for AI, and can lead to suboptimal performance or utilization.

Enterprises like Google run AI programs from massive data centers built to provide compute resources at scale. Image courtesy Forbes.

In the Future, Enterprises Will Increasingly Focus on Compound AI

“As I mentioned earlier, I believe that the best AI products are built with the compound AI approach, and it will become much more common in enterprise AI in the future.” Chaoyu offers. “The complexity in such systems is increasing, which makes it more important for teams to stay agile and ship faster as they unearth more use cases.”

The founder confides that despite the enormous amount of capital expenditure that larger orgs have invested into AI so far, his company’s enterprise customers aren’t fretting about the costs so much as they are concerned about developer efficiency and data privacy. “From our perspective, the main issues are control and customization. In this case, ‘control,’ includes things like data privacy and security, avoiding vendor lock-in, and having predictable behavior. Whereas ‘customization’ would refer to the flexibility and ease of use in building out advanced compound AI systems with custom requirements.”

“I’ll give you an example–we have a customer building a voice agent application with multiple open-source models, including components such as speech recognition, LLM, function calling, and text-to-speech. Their initial prototype can take over a minute to respond to a user’s question, which is not acceptable in real-time voice assistant use cases.”

“With our platform, they were able to quickly fine-tune their inference setup for faster time-to-first-token latency, parallelize multiple inference calls, and replace a large number of slow LLM calls with faster, domain specific models. We were able to stand up a solution that got them the ultra-low-latency deployment they needed, improving end-to-end latency to less than 1 second. In these cases, optimizations can be highly specific to the use case. That’s what we’re seeing enterprises ask for.”

“I strongly recommend considering specialized AI with the compound AI systems approach, as this offers the flexibility for quickly improving performance for your specific use case.” says the CEO. “When you’re evaluating the ROI of your AI initiatives–prototypes don’t often focus on details like the latency vs. throughput tradeoffs, but when you want to productionize that into a reliable and scalable product, that becomes a lot more impactful in customer experience and cost saving.”

Content from the Library

Visit library

The AI Infrastructure Guide for Software Development

This content is part of our AI Infrastructure Guide for Software Development. See more related content here:

Jan 10, 2025

Article

How to Make Open-Source & Local LLMs Work in Practice

How to Get Open-Source LLMs Running Locally Heavybit has partnered with GenLab and the MLOps Community, which gathers thousands...

Sep 19, 2024

Podcast

Generationship Ep. #20, Smells Like ML with Salma Mayorquin and Terry Rodriguez of Remyx AI

In episode 20 of Generationship, Rachel Chalmers is joined by Salma Mayorquin and Terry Rodriguez of Remyx AI. Together they...