1. Library
  2. How to Properly Scope and Evolve Data Pipelines

How to Properly Scope and Evolve Data Pipelines

Light Mode
For Data Pipelines, Planning Matters. So Does Evolution.
Working With Data Means Constantly-Shifting Terrain
Beware of Non-Deterministic Outputs and Scaling Pains
More Resources on Data Pipelines:
How to Approach Building a Data Pipeline From First Principles
How Teams Typically Start Their Data Pipeline Projects
GenAI Projects and Data Freshness Requirements
The Most Common Data Pipeline Mistakes Today
How Lineage Tracking and Modularity Reduce Tech Debt
How to Think About Data Pipelines for Building AI Agents
Final Thoughts and Best Practices for Data Pipelines
  • Heavybit Photo
    Heavybit
    Heavybit
27 min

For Data Pipelines, Planning Matters. So Does Evolution.

A data pipeline is a set of processes that extracts, transforms, and loads data from data sources to destinations in the necessary format. While data pipelines have been in use for years in storage and analytics use cases, they’re becoming increasingly important for AI use cases, and will likely become even more crucial as organizations look to deploy AI agents that are dependent on high-quality, up-to-date data.

Development teams looking to stand up their first data pipeline will be best served by scoping their initial project against clearly-stated business goals. Over time, it’s also a best practice to understand how your tech stack will need to evolve as your team and your data corpus expand to avoid piling up technical debt. Data expert, DAGWorks co-founder, and Hamilton and Burr co-creator Stefan Krawcyzk explains.

Working With Data Means Constantly-Shifting Terrain

Some important concepts to keep in mind include:

  • Datasets changing over time: Unlike codebases, datasets shift as inputs change, and in growing companies, data to be managed frequently expands into larger datasets
  • The need for standardization: As organizations and their datasets grow, the potential for changes and anomalies also grows
  • The need for observability and quality checks: Over time, organizations will find it increasingly important to have visibility into data quality and potential issues

Stefan Krawczyk discusses managing data for LLM applications at MLOps World. Image courtesy MLOps World

Krawcyzk suggests that developer teams newer to data management should understand that managing a codebase (which remains largely static) differs from managing datasets, which change over time. “This really brings another dimension to understanding what's happening in terms of managing pipelines that feed into data, AI, and ML applications.”

The co-founder notes that many data projects start small–which is the best time to take a first-principles approach. “While there are many ways to author a pipeline, enforcing standardization can definitely help you. And you need to instrument for observability so that you can understand what's going on.”

The co-founder notes that as orgs grow, their internal datasets may power more operations, including machine learning models or customer-facing AI chat. “If you don't understand the connection between [your data and your programs in production], that's where you really run into issues.”

“In general, code and data are those two dimensions that change at different rates. And this is where it becomes important to understand that you need to build different mechanisms for introspection, quality checks, and that sort of thing to make [your data pipeline] work over time.”

Beware of Non-Deterministic Outputs and Scaling Pains

Working with AI/ML programs? It’s a good idea to:

  • Set Benchmarks Early: AI models are infamously non-deterministic, so having metrics aligned with business goals from the outset reduces variability
  • Instrument for Scale: Understanding that a data pipeline will become key infrastructure to operate over time, it’s best to not over-engineer beyond your team’s needs
  • Understand and Plan for Change: Smaller startups need to balance shipping fast vs. implementation costs, while enterprises will take compliance into account

The co-founder notes that with the increased use of LLM models, which may give different responses even to seemingly identical data inputs, teams are best served by proactively determining success metrics aligned with their business goals. “A lot of people get caught up in these data changes. Then your boss asks you: ‘Why is this output different?’”

The co-founder notes that upstream changes can affect an entire system, which is why teams should have some measurement of business metrics, a “macro” metric, in place to gauge overall health. Without an overall business metric to guide decision making, trying to instrument for every single possible data discrepancy can lead to alert fatigue because it’s hard to gauge the importance of a particular “micro” metric.

“As long as you ensure you're measuring against business metrics, that’s how you can gauge your program’s health: Regardless of what’s going right, what is actually critical?” The co-founder notes that bringing in observability typically happens when issues arise, but being successful with data pipelines is about anticipating shifts, among data, customers, or even regulations.

“How confident can you be that when the policy team at your airline company changes something, it won’t impact your chatbot? You don't want it to hallucinate and say things that you’re going to be liable for, the way Air Canada’s chatbot did. And so the connection between that policy data and where it's used becomes critical. At a startup, that's relatively easy, because small teams are all in the same room, but at larger companies, you have different orgs in different places.”

Different companies at different stages will tend to have different problems. Larger orgs will often have more-complex problems that involve governance and customer privacy issues, as well as issues specific to their vertical. The co-founder suggests that the challenge of managing data pipelines has a lot of depth and variation across company size, vertical, and industry policies, with very few one-size-fits-all solutions.

More Resources on Data Pipelines:

How to Approach Building a Data Pipeline From First Principles

For those starting out, it’s a good idea to:

  • Avoid Too Many Assumptions: It’s important to inventory all data inputs and outputs and understand the impacts of upstream and downstream changes
  • Standardize Like an Engineer: Your pipeline will become another piece of operational infrastructure, so the earlier you get organized, document, and plan, the better
  • Start Simple: As mentioned above, it’s a good idea to avoid the overhead of over-engineered support systems before your pipeline itself matures

“When building a new data pipeline for the first time, the key thing to remember is: You don't know what you don't know. You need to not only understand the inputs to the data, but you also need to understand the outputs. And it will depend on your context and use case. If you’re an e-commerce company recommending clothes to shoppers, you'll look at the output of recommendations vs. the input of what customers actually brought.”

The co-founder suggests that having a software background and an operations mindset can be helpful, and that building a structured process of metrics and checks will be a great jumping-off point on which to build. “If you have that background, you can standardize how changes are made and updated and applied. Which lets you add in specific things in continuous integration (CI) or other quality assurance checks and balances.”

“All the spaces in the industry seem to be moving in this direction, because we simply have more engineering to manage. Maintaining these different moving pieces is ultimately a software engineering problem. So I would recommend starting simple, and don't over-engineer. You should be asking, ‘If you have a system and process that makes it easy for you to run tests and checks (and not only for the code) then does the data also meet the shape that I expect?’”

“A few years ago, there was hype around data contracts, which are just someone writing tests to make sure that: If I'm reading from the customer database and I'm going to consume or transform that data, or use it for recommendations or to input into a chatbot, that someone should be writing some sort of expectation check at some point in the pipeline. In the context of software development processes, this is potentially akin to a CI system, which you can use as a scaffold to add in more checks and balances as your needs grow.”

When building a new data pipeline for the first time, remember: You don't know what you don't know. You need to not only understand the inputs to the data, but you also need to understand the outputs. And it will depend on your context and use case.” -Stefan Krawcyzk, Co-Founder/DAGWorks

How Teams Typically Start Their Data Pipeline Projects

Common first steps for data pipeline projects include:

  • Identifying Data Sources and Destinations: Start with your sources, your processing needs, and your destinations and map around them.
  • Pipeline Components Will Evolve: You’ll end up versioning components in your pipeline as your needs change (and as those individual products change)
  • Scope Auths and Permissions: A significant portion of operating a successful pipeline is ensuring the parties that need the data can actually access the data

The co-founder notes that while much of the data management landscape is Python-based due to tooling primarily being in that language, more teams are using alternatives like TypeScript or JavaScript, but technology choices shouldn’t trump best practices.

“The classic case of implementing a pipeline is starting something that is ‘offline,’ meaning you have a production database, from which you need that data to power some sort of data-driven or AI/ML product. The ‘classic’ way you’d do this is taking a snapshot of that database, then dumping it into a data store like an S3 bucket. But, most data these days can fit into a single machine.”

“So then, you need to spin up some machine, which could be Kubernetes or an EC2 machine, or maybe even on AWS Lambda. There’s many ways to do this, considering computation may take too long, requiring you to script something in Java or Python or Node to transform it and output it elsewhere.”

“And that other place could be what we typically called our data warehouse, or it could be another data store, a vector database, or just a regular Postgres database. I think most developers are thinking about that full stack: They have an app, which then talks to a database. But all the data that goes into the database potentially needs to be accessed in a serving context that is not from your web API.”

Full-stack app development teams need to come to terms with data and schema being a commitment unto themselves, requiring documentation and versioning like an API. They’ll also need to manage encapsulation and authorization issues, as the majority of the time, data management is about enabling the right teams with access to the data they need.

GenAI Projects and Data Freshness Requirements

As more teams pursue GenAI projects, data pipelines are changing by:

  • Being More ‘Online’: Rather than serving data from a standard database, more GenAI products require pipelines to serve live web requests
  • Potentially Requiring Up-to-the-Minute Data: Depending on their use cases and customer needs, some GenAI products require a high degree of data freshness
  • Potentially Requiring Different Data Update Treatments: Teams should scope data freshness needs and not over-engineer if data doesn’t constantly need to be refreshed

“Maybe I'm pulling context from [company] data, or pulling data from a document, I need to transform it, send it to an LLM, get a response back to some of the ‘What-Was-Traditionally-More-Offline’ stuff. This happens more for data with respect to machine learning (ML). It’s turning into more of an online process with GenAI, within the context of things like web requests.”

The co-founder notes that teams building new GenAI products need to understand their requirements for data freshness to build successfully. For products that don’t urgently require data with up-to-the-minute freshness, teams might consider offline processing systems, which are easier to maintain due to having fewer moving parts. For products that require constantly refreshed data, teams will need to look at alternative implementations to ensure their data flows and is received in a timely manner.

“So there are going to be levels of complexity. First, you’ll want to take that step back to ensure you understand your data freshness requirements. If your requirements are high, you might look into scheduling and cron jobs as an approach to pull, transform, and push your data to its destination.” The co-founder recommends not skipping ahead directly to a high-maintenance online streaming system for data pipelines unless product and customer needs demand it.

The Most Common Data Pipeline Mistakes Today

Some of the most common mistakes teams make with data pipelines include:

  • No Benchmarks Tied to Business Goals: It’s difficult to succeed without a clear understanding of ‘what matters’
  • No Standardized Authoring: Without standardized authoring, it’s not clear what effect future changes will have
  • Technical Debt: Without standardized authoring, and over time as projects change hands, previous coding and tool choices can hinder future projects

The co-founder suggests that one of the most common mistakes that teams make with their data pipeline projects is not setting up a series of metrics and benchmarks from the start. “If you have no way to measure the impact, you end up in a ‘management limbo’ that leads to overpromise and underdelivery. (And if you don't know what success is, that's not good for any kind of project, regardless.)”

Another common challenge teams face is the need to ship quickly while lacking standardized methods to author data pipelines, leading to painful technical debt over time. “Depending on how you author your project, it can end up becoming very hard to understand. When you make a change to your tools or processes, what’s the impact?”

“So for smaller teams, you’ll see issues along the lines of a product developer changing the production schema upstream, not knowing that this column and the values in it actually are used downstream. But as teams get larger, if there’s an issue with the schema or a range of values, maybe I can go in and fix it, but I don't know whether someone else actually had dependencies related to that change.”

The co-founder suggests that not having an understanding of how data flows is often the cause of trouble. “If you don't have the flexibility for your level of scale and SLA needs, that's where the problems start, since you basically don't know what to change, or how to change it, without knowing the impact of making any changes.”

“When you're building pipelines, you're effectively stitching together computation in various ways as part of various processes. One processor takes some data in, transforms it, puts it into another process, and so on. This is what should happen logically. But, usually the way that teams get into technical debt is that they actually couple how they execute things with logic.”

“Here’s an example: Let’s say you started with Lambda and you set up your pipelines to all assume that you're running on Lambda. That will mean a certain way of authoring things. It’s arguably great for independence, but it brings in other challenges. Alternatively, if you set things up to run on PySpark, all your jobs might be written in PySpark SQL, but you might find yourself having to switch to more of an online process. This is where you can get a lot of technical debt: From which infrastructure you choose versus how it's implemented.”

When you're building pipelines, you're effectively stitching together computation in various ways as part of various processes. The way that teams get into technical debt is that they actually couple how they execute things with logic.”

How Lineage Tracking and Modularity Reduce Tech Debt

Teams can reduce tech debt for their data pipelines as they scale by:

  • Building Traceability and Lineage Into the Process: Knowing the origin of data changes is crucial for triaging downstream issues
  • Simplifying Metrics Into Dashboards Built Around Business Goals: It’s a good idea to set up reporting that filters out operational noise and focuses on business outcomes
  • Using Modular Frameworks That Scale: Modular frameworks that scale well with different dataset sizes and operational contexts incur less debt over time

“My team is working on Hamilton and Burr, two Python-based open source frameworks. The idea behind both of them is that if you can draw a flowchart for your data pipeline, you can basically use one of the frameworks. The latter is about conditional branching and dynamic looping in the flow charts, particularly for teams working with agents; the former is about using transformation processes for time-series forecasting, retrieval augment generation document processing pipelines, or customer data to transform for ML models.”

“Both of the frameworks make things really easy to use and test, without you having to engineer for it. You can always see the lineage or tracing: How data flows as a flow chart so you can then see how things progress. It’s about helping you structure your code in a way that helps you understand the kind of connection that exists between different transformations.”

“This means one less thing to engineer for when going to production, because you can understand how this column in this table was created, or what this output meant and what was upstream.” The founder notes that going zero-to-one with data pipelines is relatively easy, but zero-to-N can become cumbersome quickly with multiple authors and needs for debugging and observability, which can eventually require a total pipeline rewrite.

“With our frameworks, we’ve built a lot of platform hooks to enable the ease of observability and data collection. They also come with self-hostable UIs to have an out of the box way to see how things operate, with the idea being that we can slowly start to consolidate and provide a single pane of glass that is a better experience: Having several different data quality tools, data monitoring tools, a catalog to see lineage and tracing.”

The co-founder notes that his efforts were informed by his own experience managing platform teams to avoid technical debt. “Ideally, people shouldn’t have to rewrite much logic to swap out where code runs, which leads to less technical debt and more ability for your data scientists or ML teams to collaborate because you can define better boundaries. The decoupling makes your projects more modular, which means you can run things in different contexts [like Jupyter notebooks, PySpark jobs, FastAPI].”

The co-founder suggests that as GenAI has seen greater adoption, the market has greater demand for responsive data pipelines that teams can launch quickly and tweak iteratively. Teams that don’t have to re-instrument and can quickly understand where their data issues lie see less operational burden. “But you will need to build up some sort of view of your data in terms of what the distribution is, and whether you’re getting the business outcomes you want.”

How to Think About Data Pipelines for Building AI Agents

Teams considering agentic projects should first consider:

  • Getting Data Management Processes in Order: To run successful agentic programs, teams need to fully understand data inputs and outputs
  • Being Able to Fully Scope Simple Workflows: Before diving into agent building, it’s important to be able to fully sketch out and understand the steps of the human workflow you want to replicate
  • Iterating Gradually: AI agents given the power to make workflow decisions will need to be managed carefully

The co-founder suggests that approaching the building of agentic products should also start from a first-principles approach that starts from inputs and outputs. “This is why I think there has been a larger focus on tracing. If you think of distributed tracing, most people are using one of the foundational model providers, so that's an API call.”

“And so, you need to instrument the API call, but then you also need to instrument what's upstream and downstream.” The co-founder suggests that teams that haven’t figured out the basics of data management may struggle if they rush to build agents immediately.

“You really need to understand your data and what's going on with it before you can fully automate it. And so, for agentic-like projects, we’re often referring to using an LLM with tool calling. Here’s an example of a very specific agentic workflow: If a finance person works in accounts payable, they're always getting a certain series of PDFs. They always need to extract certain aspects of those files. Very deterministic stuff.”

“So for a PDF, we’d first target the extraction of various properties you would need from that file. In such cases, I would recommend focusing on explicit, clear workflows, in the same way you’d write out instructions for a human being on what to do.”

“And as your process gets stronger, and you understand what’s going on while measuring the business outcomes and metrics, this is where you can start to build in more complexity. And then, maybe you’d start incorporating agents that dynamically make decisions on what action to take next. (I teach a course on topics like this, so feel free to reach out with questions.)”

As your process gets stronger, and you understand what’s going on while measuring the business outcomes and metrics, this is where you can start to build in more complexity. And then, maybe you’d start incorporating agents.”

Final Thoughts and Best Practices for Data Pipelines

Some overall takeaways for building data pipelines and managing data for agentic;

  • Don’t Get Wrapped Around the Axle: During the implementation process, it’s common to overindex on edge cases
  • Don’t Try to Fit Everything Into an Agile Sprint: Data pipelines likely won’t fit into a traditional Agile sprint, and will require additional maintenance and monitoring
  • Get Working Prototypes and Workflows Before Investing in Tools: Before buying the shiny new data product, it’s a good idea to fully scope out your use case first

The co-founder rattles off a list of final best practices for teams to consider: “For any business project, you need to have clear business metrics of the value your project will bring to help orient your first version against expectations. Also, it’s generally a good idea to never promise 100% of all the things we’ve discussed here, as they probably won’t fit into an Agile sprint context.”

The co-founder suggests that a common pitfall is how easy it is to find use cases for which their scoped data pipeline should theoretically work at 100% efficiency. “Where most people spend most of their time in building and bringing these things into production is working on those edge cases and figuring out: How do I transform the data in the correct way? Or, how do I tweak the prompt? Or, what do I need to observe? What are the extra guardrails I need to add to make sure we're providing the customer experience we expect?”

The co-founder reiterates: Data pipeline projects often don’t fit neatly into Agile sprints. “It’s very easy to build a prototype, but harder to productionize it. You need something more of a six-week kind of sprint model, where you allow for not only building a prototype, but also lots of data collection to really understand and measure what the pattern of edge cases is. Basically: What is this product getting wrong most of the time? And, how can we fix it?”

The co-founder notes that these decisions should precede technology choices. “You’ll notice I haven’t mentioned alternatives like vector databases yet. Because that's really about technology being a means serving an end. You need to really understand the failure modes first to really understand what solutions you need.”

It’s also important to realize that data pipelines will become part of your operational infrastructure. “This type of project is not a one-and-done. In traditional software development, it's very easy, within sprint models, to call a feature ‘done’ and leave it as is. You’ve written your tests, you deployed, it’s done. That's not true here, since you need to continue to monitor things, especially your LLM calls.”

“For LLM calls, you need to ensure that you are providing time in your engineering schedule for maintenance, looking at data, and ensuring that things are working. This is something of a carrying cost that is a little different than a regular software engineering delivery model. Overall, a data pipeline project will probably take you a little longer than a standard Agile sprint.”