The Role of Synthetic Data in AI/ML Programs in Software
- Heavybit
Heavybit
Why Synthetic Data Matters for Software
Running AI in production requires a great deal of data to feed to models. Reddit is now contracting with Google and OpenAI as part of multimillion-dollar training deals with each publisher. OpenAI transcribed millions of hours of YouTube videos despite dubious permissions. Meta considered buying the publishing house Simon & Schuster.
However, not all data can or should be made freely available. In highly regulated spaces, teams need to protect customer privacy while still using high-quality datasets. In such cases, teams have a few options. They can attempt a self-hosted, closed-circuit model using real data, but incur the risks of leaks and breaches. Alternatively, they can use data that is either partially redacted or synthetic, the latter offering most of the completeness of real data without compromising user privacy.

Adam Kamor is a veteran software developer and engineering lead with more than a decade of experience building software products at Microsoft, Tableau, and the synthetic data startup he co-founded, Tonic. In this article, he explains how synthetic data helps software teams by:
- Protecting Personally Identifiable Information (PII)
- Keeping Teams in Compliance
- Saving Budget on Inference Costs
How to Use Customer Data in Your LLM Product
Kamor suggests that data management for software teams building products with LLMs isn’t as straightforward as data in, data out. “The problem that engineers face is they may want to send customer data to LLMs to provide a certain experience to customers. But they need to be aware of what, contractually, and legally, they can and cannot do in terms of customer data.”
The co-founder points out that, for compliance purposes, using customer data requires teams to be aware of these important considerations, particularly if they go the route of using a third-party LLM:
- Your Sub-Processors: The connection between your data and your LLM is not always a straight line. Which companies and products form that supply chain? What do they do with data as it moves along the supply chain?
- Compliance: Many customer contracts require updates when there is a change in how and where data is sent. In highly regulated spaces, there are also regulations such as HIPAA and GDPR.
- What are LLMs doing with the data? Your customers, and the regulations in your space, may partially or completely limit the kind of data you can and can’t share with LLM providers, especially with regards to third parties using your customer data to train their models.
In other words, teams that work with customer data, particularly in highly regulated spaces, often start from a problem space of needing relevant, high-quality datasets while remaining compliant with privacy policies. They need to get to the end result of having performant, well-trained models that give accurate, customer-focused responses without privacy risk.
Engineers may want to send customer data to LLMs to provide a certain experience to customers. But they need to be aware of what they can and cannot do in terms of customer data.” - Adam Kamor, Co-Founder Tonic
Increasing Data Quality and Model Performance, Decreasing Risk
Kamor suggests that teams looking to work with large amounts of sensitive customer data have a few options:
1. Host Your Own LLM Privately
Self-hosting would let you safeguard all the data you use to train that model internally. Ideally, hosting an internal, closed-circuit model helps mitigate security, privacy, and regulatory issues, assuming your own model doesn’t get compromised. It should be noted, however, that in these early days of AI/ML, that while deploying and operating your own model may lead to strong competitive advantage, requires considerable expertise from MLOps teams, and can be a costly challenge in itself due to misaligned incentives.
2. Work With Data to De-Identify It, or Use Synthetic Data
For many organizations, particularly resource-strapped startups, self-hosting a model may not be practical. An alternative is to work with your data set, either de-identify sensitive data or use synthetic data.
You can de-identify private data by redacting some or all sensitive fields within your dataset. Typically, developers de-identify data by tokenizing (replacing sensitive data with unique, non-sensitive tokens), hashing (using cryptographic functions to convert sensitive data into unique hash values), or masking (obscuring sensitive data with fictitious data).
However, new research shows that de-identifying data is often not enough to protect user privacy. A study found, in an experimental healthcare setting, that “de-identification of real clinical notes does not protect records against a membership inference attack.”
In this form of attack, adversaries get the patient records of a target person and use that data to find out whether that target person’s information was in a given model’s training dataset. If the model is used to identify patients with sexually transmitted infections, for example, then mere inclusion in the training data could be damaging to patients.
In theory, synthetic data can address this risk while solving all the privacy, security, and regulatory issues already mentioned. With synthetic data, you can build an LLM’s capabilities without revealing anything private to the LLM – removing the possibility that the LLM could ever regurgitate anything private to your users.
The co-founder offers an example: “Let's say you want to train BERT summarization models, which are frequently used to identify the most important text from a specific passage to create a summary. This type of model is relevant to physicians to summarize a patient's medical history: When new patients come in, you can provide a summary of those patients for your records.”
“Next, you’d want to train your model on the medical histories of all the current patients in your medical system, and then use the model for inference on future calculations. But you do not want to train your summarization model on customer protected health information (PHI) because the summarization model might regurgitate that.”
“Any situation where you're training a model on the entire population (so you can use it for inference on individuals) is dangerous because the model might memorize things about individuals in the population. And inevitably, it will regurgitate it to individuals when it's being used.”
Benefits of Synthetic Data
Synthetic data is extremely useful when the data you want to use to train an LLM needs to remain private. Synthetic data is the best way to keep your customer data private without affecting the utility of the data itself, allowing you to train or finetune an LLM without compromising the quality of its outputs or the privacy of its inputs.
Protecting Privacy in Highly Regulated Industries
In highly regulated industries, such as healthcare and financial services, customer data is not only private but guarded by strict regulatory regimens, such as HIPAA for healthcare or the BSA in finance. HIPAA violations can cost organizations upwards of $2M.
Satisfying Legal and Contractual Obligations
The co-founder points out the need for legal and regulatory compliance as well. “So, how do I synthesize my data so I can still train an effective model and not miss my customers’ sensitive information...and also adhere to my legal contracts? Customers typically don't want you training models on their data, though sometimes there will be exclusions for when the data is synthesized or de-identified.”
Synthetic data lets you build an artificial dataset that reflects customer data without actually containing and revealing any of it. That way, you can build LLM products that work within the narrow problems customers actually have. All without touching their sensitive information.
Practical Considerations: Code Privacy, Inference Costs
The co-founder points out that synthetic data isn’t a silver bullet that protects proprietary code from being gobbled up by models, and that exposing proprietary code to commercial LLMs might not be that big a deal. “I don’t think code alone is that valuable, or that it’s necessarily a danger to a software business. The efficiency benefits you get from using AI coding assistants probably outweigh any of those risks.”
“If I gave out all the source code to [Tonic] to somebody, I don’t think it would hurt my business. It would take them a long time to figure out how to stand it up and maintain it. And it would take forever to figure out how to add new features. By the time they got all of this done, we’d be a year or so ahead. I’ve never been of the opinion that code itself is so proprietary and sensitive.”
For teams looking to curb inference costs, synthetic data isn’t necessarily cheaper or faster for running ML jobs. Organizations operating their own internal models might be better served using distillation: “If you're if you're concerned about curbing inference costs, then I would use the outputs of LLMs to train more-efficient BERT models, for example. If you properly train your your BERT summarization model, you can get super-cheap inference.”
For early-stage startups, the co-founder suggests that rolling your own models, a time-intensive process that can require considerable specialized expertise, might be overkill. “For a startup, there are worse things you could do than just pay for an OpenAI ChatGPT plan and move on with your life and try to get customer revenue. But if you're operating at a certain scale and you're genuinely concerned about inference costs, then go train more-specialized models and run them yourself, and you'll get your costs down without issue.”
“We've started doing that with some of our features. We're not using LLMs for all of our NLP tests. We’re using traditional BERT models and they're cheap. They run on CPU. They're great. But they take longer to develop and they require, frankly, more skill to develop correctly. Not the kind of thing some junior engineer whips up in 20 minutes with a clever prompt.”
For a startup, there are worse things you could do than just pay for a GPT plan and move on. But if you're operating at a certain scale and you're genuinely concerned about inference costs, then train specialized models and run them yourself. You'll get your costs down.”
How Modern Data Is Shaping the Future of Models (and Software Teams)
The co-founder notes that there is a great deal of press about the utility of “small” models, specifically that their tight training and tuning on a finite dataset seemingly makes them more capable for limited use cases. But small isn’t necessarily always the right move.
“I think the march to small models can’t just be because they're cheaper and faster. That won't be enough to justify them because the cost of LLMs will just continue to go down. Sometimes a bit at a time, and sometimes it’s massive drops like with DeepSeek R1. That alone is not going to be a strong enough driver.”
“These smaller, faster models need to have other benefits. They need to be able to go places that large language models can't go. Or because we only run on-prem and we can't make calls to clouds. There has to be some other driver to necessitate that, I think, because LLM costs will just keep going down.”
The co-founder concedes that the pace of change in the AI/ML space is incredible, and that standards for hiring software developers and data experts could evolve as AI assistants lower barriers to entry. However, he advises software leads to look for data specialists with fundamental knowledge that isn’t easy to fake on a test.
“When we hire data scientists, I want to see a mathematical seriousness to them. I want to see an understanding of statistics and probability and other complex math topics. The fundamentals are still important. The reality is that some jobs, which used to require PhD-type folks to do, can be done almost as well by an engineer that knows how to write code and good prompts.”
“But if you're doing serious work, you still need serious people that have the mathematical grounding to think about these things properly. Admittedly, maybe that sort of thing is less important than it was previously. It's all changing so fast, and my answer could change tomorrow. Sometimes it seems like there's a game-changer every couple of days.”
Takeaways on Synthetic Data and How to Think About Data Management
To summarize, the co-founder advises software leads to think about synthetic data first and foremost as a safer alternative to using potentially risky real data to train models. “Ultimately, synthetic data isn’t so much about writing safe prompts and chats with your LLM. What you’re trying to do is train your own model. So if you're trying to do that on potentially sensitive customer information, you likely need to remove customer PII first. That's the most important consideration for privacy.”
More Resources
Content from the Library
Synthetic Data for AI: Purpose and Use Cases
What to Know About Synthetic Data for AI Programs For software developers, large language models (LLMs) like ChatGPT can help...
Data Council 2025: The Databases Track with Sai Krishna Srirampur and Craig Kerstiens
Heavybit is thrilled to be sponsoring Data Council 2025, and we invite you to join us in Oakland from Apr 22-24 to experience 3...
Platform Builders Ep. #1, The Future of CRM is No CRM with Justin Belobaba
In this inaugural episode of Platform Builders, hosts Christine Spang and Isaac Nassimi of Nylas welcome Justin Belobaba, Founder...