Data Annotation Pipeline

The Problem

A client had a large amount of data flowing into their system from mobile app uploads. All of this data needs to be reviewed, but it scaled too quickly to allow for human review, so it was necessary to get AI involved to help handle this. While the initial pipeline that was built could do this, it turned out to be more expensive than anticipated, so we needed a solution that was cost effective, but still had a quick turnaround time for the resulting data.

As a result, I built this new pipeline that takes raw data in at one end through the mobile app, runs it through whichever AI-driven analyses it needs, and drops structured results out the other end, with the whole pipeline itself being serverless, pay-per-millisecond, and costing roughly 95% less than the legacy workflow it replaced.

Let's define a few terms first

A handful of words in this post are doing a lot of work, so it's worth pinning them down:

Experiment — one self-contained AI analysis: one model, one prompt, one purpose. Examples might be "classify this" or "extract key entities." Each experiment is its own small pipeline, deployed and scaled independently. The word "experiment" is deliberate — see Environment below.
Step — a single serverless function inside an experiment. An experiment is a short chain of steps (prepare → submit → monitor → route → process). Steps hand off to each other via events, so no step ever sits idle waiting for another.
DAG (directed acyclic graph) — the dependency graph between experiments. If Experiment B can only run after Experiment A finishes, that's an edge in the DAG. The graph is stored in a database table, so adding a dependency is a data change, not a code change.
Environment — a complete, independent copy of the pipeline. Not just dev/qa/prod — the number of environments is arbitrary, and which one a record flows through is determined by the GCS (Google Cloud Storage) bucket it was uploaded to. Each environment has its own bucket, its own functions, its own DAG of experiments, its own model and prompt choices. That's why these analyses are called experiments: multiple production environments can run side-by-side, each processing the same incoming data through a different mix of analyses, like an A/B test on real traffic.

Here's the model at the highest level:

Two layers, with control bouncing between them:

Outer pipeline — one entry point for all incoming data. It owns the graph: a database-defined DAG of which experiments depend on which. When data arrives, it does some basic verification, then dispatches whichever experiments have no upstream dependencies. When any experiment finishes, control returns here, the DAG is re-evaluated, and any newly-unblocked experiments are triggered. No business logic lives in this layer — it's pure routing.
Inner pipeline — each experiment is its own short chain of serverless functions. Prepare the input, call the model, watch the batch, store the result, hand control back to the outer pipeline. Each experiment is independent infrastructure, so adding a new one doesn't touch the others.

The key idea: experiments don't know about each other. They only know how to do their one job and how to report "done." Sequencing, parallelism, and dependency resolution all live in the outer pipeline's evaluation of the DAG.

Why it's fast

The old way: one long-running worker holds onto a machine for the entire job, idle most of the time, waiting on an external API.

The new way: every step is a separate Cloud Function. When a step is waiting on a model response or other action, nothing is running. The next function only spins up when there's actual work to do.

Steps chain together via Pub/Sub — no orchestrator polling, no schedulers, no idle compute.
Independent experiments fan out in parallel automatically — the DAG dispatcher triggers everything whose dependencies are met, in one shot. Three experiments with no shared deps means three pipelines running simultaneously, not three times the wall-clock time.
Batch APIs are used wherever the latency budget allows, which collapses thousands of individual calls into a handful of large jobs.

End-to-end, a record that used to take up minutes of processing time now only takes a couple seconds, and the system scales horizontally without any infrastructure changes. And when a large burst of traffic comes in, each step in the experiments can scale accordingly to not lose any time waiting on others to finish.

Why it's cheap

The cost win comes from one principle: you don't pay for idle time when there is no idle time.

No worker sits waiting on an API call — the function terminates and a new one wakes when the response is ready.
Cloud Functions bill per-millisecond, not per-hour.
Parallel fan-out happens through a managed Pub/Sub topic — no queue infrastructure to run.
Failed calls retry with exponential backoff inside the function rather than re-running the whole pipeline.
Vertex AI Batch Prediction is used for inference wherever the latency budget allows — roughly half the price of synchronous Vertex calls for the same model. More on this in the AI-review section below.

At a typical input size, the all-in cost per record looks roughly like this:

~$0.003 for the orchestration and compute — every Cloud Function in the chain, the batch manager, the Pub/Sub plumbing, the GCS round-trips. The bulk of that comes from the per-record processing function at the end of the chain; everything upstream is closer to free. Each step outlined below takes only milliseconds from start to finish, but the processing function has a lot to do that can take 2-6 seconds.
~$0.009 for the Vertex AI model call itself using the smallest available model that can handle our needs. This is also lessened by cutting down the input tokens to the bare minimum needed to accurately review the given data. On the higher end of this cost estimate if we used a larger model or fuller token usage, we could have been looking at between $0.018 - $0.036.

So roughly three quarters of the per-record cost is the model, and the entire pipeline around it is the cheap part. That's the goal: drive the orchestration cost down to a rounding error so what you pay for is almost exclusively the work that actually moves the product forward.

A real example: the AI-review experiment

The whole point of this pipeline is AI review — running a frontier multimodal model over the incoming data to grade it, classify it, and extract structure. So it's worth zooming in on one experiment: the heaviest one, the one that does most of the actual model work.

Following the data through it:

Prepare. When a new record lands in the experiment, a prepare function builds a Gemini request for it and parks that request in a queue.
Submit. Every few minutes, a scheduled function sweeps the queue and checks if anything is ready to submit. It waits until there are X accumulated objects, or X minutes have passed, and submits them as a single Vertex AI Batch Prediction job. This gives us a good balance of batching a good number of requests together without making the users wait too long for a response.
Monitor. A separate scheduled function polls Vertex for batch status, specifically watching for jobs that have timed out or got stuck, and marking them so they can be retried or escalated. Successful jobs and failures don't need monitoring at all — when a batch finishes, Vertex drops the results in a GCS bucket on its own.
Route. That GCS write fires an event that wakes a routing function. The result file from Vertex can be enormous — one file containing the model output for every record in the batch — so the routing function splits it: it walks the file, breaks it into per-record result files written back to GCS, and emits one completion message per record. Each downstream processor only ever sees its own slice, never the whole batch.
Process. Each completion message wakes a per-record processor that parses the model output, validates it, calculates scores, and stores the structured result. Then it phones back to the outer pipeline to advance the DAG.

The reason this shape matters: Vertex AI Batch Prediction is roughly half the price of synchronous inference. You hand it a big batch, it runs when capacity is available, and you pay the discounted rate. For a workload that doesn't need sub-second response, this is free money — and it stacks on top of every other cost win in the pipeline.

What it looks like to add a new analysis

The whole pipeline — every Cloud Function, every Pub/Sub topic and subscription, every GCS bucket, every IAM binding, every Cloud Scheduler trigger, every monitoring dashboard and alert — is infrastructure-as-code in Terraform. Nothing about this system exists outside of source control. That's what makes the "add a new experiment" workflow tractable: the infrastructure isn't something you set up, it's something you declare.

Adding a new experiment takes four steps:

Write the experiment-specific logic. Every experiment's steps pull from a shared Python library — Gemini client, batch manager, structured logging with trace IDs, retry-with-backoff helpers, GCS utilities, API client, etc. What you actually write is the prompt assembly and the result parser. Most new experiments are one prepare function and one process function; everything else is reuse.
Write the config. Which model, which prompt template, which input fields.
Add a row to the experiments table. This declares the experiment's upstream dependencies in the DAG.
terraform apply. The new experiment ships as a Terraform module call; all of its functions, topics, subscriptions, schedulers, IAM, and dashboards get created, wired up, and deployed in one command.

That's it. The outer pipeline picks up the new node in the DAG automatically — nothing about routing, ordering, or parallelism needs to change. A new analysis goes from idea to running in production in an afternoon, not a sprint.

One pipeline, any number of environments

The conventional environment mental model is three slots: dev, qa, prod. This pipeline doesn't think that way. The number of environments is arbitrary, and which environment a record flows through is decided by which GCS bucket it was uploaded to. The mobile app picks the bucket; everything downstream follows.

That sounds like a small detail, but it's the thing that makes real production experimentation possible:

Multiple production environments can run in parallel on the same incoming data, each with its own DAG of experiments, its own model choices, its own prompts.
Or, instead of having the same data go through different processing, we can also ask different sets of users for different sets of data, and process each set differently according to the requirements. This way, we don't have to paint ourselves into a corner of only requesting one generic form of data.
A new environment is a Terraform module call — describe it once, terraform apply, and you have a complete independent copy of the pipeline: its own bucket, its own functions, its own topics, its own dashboards.
There's no code branching, no feature flags, no shared state to reason about. Each environment is its own self-contained world.

This is what makes the analyses honest experiments: you can stand up environment B alongside environment A, route a slice of real production traffic to each, and compare the results directly. The pipeline isn't just deployed to production — it's actually run, in production, in several configurations at once.

Reflection

The thing I keep coming back to: most "expensive" data pipelines aren't expensive because of the work — they're expensive because of the waiting. When you decouple the orchestration from the workers, and let the cloud scale each step independently, the cost drops to roughly the cost of the actual API calls. Everything else is just plumbing.

Working on this processing pipeline has been some of the most engaging and challenging development that I've had the opportunity to work on in my career so far. It's pushed my own perception of what I am capable of conceptualizing and building. Instead of feeling stuck in a rut of building simple web or mobile apps, this has felt like the kind of real engineering that made me fall in love with software development in the first place.