In traditional software development, the CI/CD pipeline is our bedrock. It’s the automated process that takes our code, runs it through a gauntlet of tests, and deploys it with confidence. Unit tests, integration tests, and end-to-end tests ensure that every new feature or bug fix doesn't break what's already working.
But what happens when the "code" isn't deterministic? What happens when it's a prompt, a language model configuration, or a complex RAG pipeline? Welcome to the new frontier of building Services-as-Software. A simple change to a prompt might pass a basic syntax check, but it could silently degrade response quality, increase latency, or triple your operational costs. The old testing paradigms are no longer enough.
This is where you close the loop. By integrating data-driven AI experimentation directly into your CI/CD pipeline, you can evolve your AI applications with the same rigor and safety as traditional software. Let's explore how to build this modern AI development workflow using experiments.do.
Testing AI components, especially those powered by LLMs, presents unique challenges that traditional CI pipelines aren't built to handle:
Without a systematic way to test, every AI change is a roll of the dice. You're flying blind, hoping that your latest "improvement" is actually an improvement.
Experiments.do is an agentic testing platform built for the modern AI stack. It allows you to A/B test your AI—from individual prompts and models to complete RAG pipelines—to find the optimal configuration for any workflow.
Instead of relying on guesswork, you define the success metrics that matter to your business:
By running controlled experiments, you transform subjective prompt engineering into a data-driven science.
Integrating experiments.do into your CI/CD pipeline creates an automated quality gate for your AI. It ensures that only statistically superior changes make it to production. Here’s how it works.
A developer makes a change in a feature branch. This could be anything:
They push the change and open a pull request (PR). This is the trigger.
Your CI pipeline (e.g., GitHub Actions, GitLab CI) kicks off. Alongside your usual unit tests, you add a new stage: "AI Evaluation".
In this stage, a script uses the experiments.do SDK to define and run an experiment comparing the "control" (your main branch) against the "variant" (your feature branch).
// ci-experiment-script.ts
import { Experiment } from 'experiments.do';
import { getPromptFromBranch } from './utils'; // Helper to read a prompt file from a git branch
// 1. Get the control and variant configurations
const controlPrompt = getPromptFromBranch('main');
const variantPrompt = getPromptFromBranch('feature/empathetic-responses');
// 2. Define the experiment
const exp = new Experiment({
name: `PR-113-Support-Response-Prompt-Test`,
variants: {
'control': { prompt: controlPrompt },
'variant': { prompt: variantPrompt }
}
});
// 3. Run the experiment against a golden dataset of test queries
const results = await exp.run({
dataset: 'golden-support-queries', // A predefined dataset in experiments.do
metrics: ['cost', 'latency', 'customer_sentiment']
});
// 4. Output the results for the CI pipeline to use
console.log(JSON.stringify(results.summary));
This script programmatically sets up an A/B test. The platform now runs both prompts against a predefined set of evaluation queries and collects data on your defined metrics.
Once experiments.do completes the run, the statistical results are available via the API. Your CI script fetches this summary and acts as an intelligent gatekeeper.
You can configure rules based on the outcome:
The most powerful pattern is to have the CI job post the experiment results as a comment on the pull request.
CI Bot Comment on PR #113:
🧪 Experiment "PR-113-Support-Response-Prompt-Test" complete!
Metric Control Variant Change Winner cost $0.021 $0.018 -14.3% ✅ variant latency 450ms 430ms -4.4% variant customer_sentiment 0.85 0.92 +8.2% ✅ variant Result: The variant is a statistically significant winner on cost and sentiment. Ready to merge.
With this data in hand, your team can merge the pull request with absolute confidence. They aren't just merging code; they are merging a proven improvement.
Once merged, your CD pipeline takes over as usual, deploying the new, optimized AI component to production. You have successfully closed the loop, creating a repeatable, safe, and data-driven system for continuous AI improvement.
Integrating AI A/B testing into your CI/CD pipeline offers transformative benefits:
The era of "prompt-and-pray" is over. By treating AI development as a rigorous engineering discipline, you can build and iterate on AI-powered services faster, safer, and more effectively than ever before.
Ready to build your AI safety net? Visit a A/B Test and Optimize AI Components to get started.
Q: What kind of AI components can I test with Experiments.do?
A: You can A/B test anything from individual prompts and LLM models to entire RAG pipelines and agentic workflows. It's designed for both granular component testing and end-to-end evaluation.
Q: Can I test different Large Language Models (LLMs)?
A: Yes. Experiments.do is model-agnostic. You can easily set up experiments to compare the performance and cost of models from OpenAI, Anthropic, Google, open-source providers, and more.
Q: How does Experiments.do integrate into my existing workflow?
A: Experiments.do is an API-first platform. You can trigger and manage experiments directly from your codebase using our simple SDK, making it easy to embed continuous, data-driven improvement into your development lifecycle, including CI/CD pipelines.