CI/CD for ML Models: GitHub Actions, Docker, and Kubernetes

Introduction

You have trained a model. It performs well on your laptop. Now you need to put it into production, where real users send requests to it, where it must stay running reliably day and night, and where it will need to be updated regularly as the model improves or data changes.

This is where most ML practitioners hit a wall. Training a model is a single event you can run manually. Deploying and maintaining it is an ongoing engineering discipline. The gap between those two realities is larger than most people expect when they first encounter it.

The core challenge is that ML models are not deployed once. They are deployed repeatedly, because new data arrives and the model is retrained, because feature engineering logic changes, because a newer model version performs better, because dependency upgrades are needed, and because bugs are discovered and fixed. Without a structured process, each of those events becomes a manual, stressful, error-prone operation.

The solution is a CI/CD pipeline: an automated system that tests, packages, and deploys your model every time a change is made, reproducibly and safely. This post builds the conceptual and practical foundation for a complete pipeline using GitHub Actions for automation, Docker for packaging, and Kubernetes for scalable deployment.

Continuous Delivery pipeline process diagram showing stages from code commit through build, test, and deployment with feedback loops — **Figure:** A Continuous Delivery pipeline illustrating how code changes flow through automated build, test, and deployment stages with feedback loops at each checkpoint, the same pattern applied to ML models in a CI/CD pipeline. Source: Grégoire Détrez (original by Jez Humble) / Wikimedia Commons (CC BY-SA 4.0)

Problem Statement

Traditional software deployment is already challenging, but ML deployment is uniquely difficult because multiple kinds of artifacts must all work together in exactly the right combination. In regular software, you deploy code. In ML, you deploy code, a trained model file, preprocessing logic, and a specific set of library versions, all of which must be consistent with each other.

Without automation, teams commonly encounter a situation where nobody is sure which model version is actually running in production, where a library upgrade on the server causes silent prediction errors, or where rolling back to a previous model version requires reconstructing the original environment from memory. These are not edge cases. They are the default outcome of manual deployment practices.

Existing general-purpose deployment tools do not address ML-specific concerns such as validating that a model artifact loads correctly, that predictions pass sanity checks, or that accuracy has not regressed below a minimum threshold. CI/CD for ML must handle all of these.

Core Concepts and Terminology

Term	Definition
Continuous Integration (CI)	Automatically running tests and validation checks every time code is pushed to a repository, before changes reach production.
Continuous Deployment (CD)	Automatically packaging and deploying code after CI passes, with no manual steps required.
Docker	A tool that bundles code, dependencies, and a model artifact into a single portable container image that runs identically on any machine.
Container Image	A read-only snapshot of a container, including all files, libraries, and configuration needed to run an application.
Container Registry	A storage service for container images. GitHub Container Registry (GHCR) is one example. Kubernetes pulls images from here to run them.
Kubernetes	A system that manages how containers are run, scaled, and updated across a cluster of servers.
Pod	The smallest deployable unit in Kubernetes. A pod runs one or more containers together on the same node.
Deployment	A Kubernetes resource that defines how many pods to run, which image to use, and how to handle updates.
Rolling Update	A deployment strategy where old pods are replaced gradually by new pods, so the service remains available throughout the update.
MLOps	Machine Learning Operations, the discipline of keeping ML systems reliable, reproducible, and maintainable in production over time.
Inference API	A web service that receives input data, passes it through the model, and returns a prediction. The interface through which a deployed model is called.

How It Works

Think of a CI/CD pipeline for ML as a factory production line. Raw materials enter at one end (your code and model changes), pass through a series of quality checks and packaging stations, and emerge at the other end as a finished product running in production. Each station must pass inspection before the next one begins.

The pipeline runs automatically every time a developer pushes code to the main branch. No one needs to remember to trigger it. No one needs to manually run tests or copy files to a server. The entire sequence is defined in code and runs in a clean, controlled environment every time.

Checkout the repository. The CI system obtains the current version of all files, including the inference code, the model artifact, the Dockerfile, and the Kubernetes configuration files. This ensures the pipeline works with the exact state of the repository at the moment of the push.
Install dependencies and run validation. Python dependencies are installed in a fresh environment. The model artifact is loaded to confirm it exists and is not corrupted. Basic prediction tests are run to confirm the model produces sensible output. If any of these fail, the pipeline stops here and the team is notified. Nothing is deployed.
Build the Docker image. If validation passes, Docker packages the inference code, the model file, and all dependencies into a single container image. This image will run identically whether it is run on a developer laptop, a CI server, or a Kubernetes cluster in the cloud.
Tag and push the image to the registry. The image is tagged with a unique identifier, typically the Git commit hash, so it can be traced back to the exact code that produced it. The image is then pushed to a container registry where Kubernetes can access it.
Update the Kubernetes deployment. Kubernetes is told to use the new image. It begins a rolling update, gradually replacing old pods with new ones. During this process, the service remains available. If the new pods fail their health checks, Kubernetes stops the rollout automatically.
Monitor health probes. Kubernetes continuously checks whether each pod is alive (liveness probe) and ready to accept traffic (readiness probe). Pods that fail these checks are automatically restarted or removed from the load balancer until they recover.

Practical Example

Suppose a data science team maintains a credit scoring model that classifies loan applications as approved or rejected. The model is retrained every two weeks as new loan data becomes available. Before CI/CD, each retraining cycle required a data scientist to manually copy the new model file to a shared server, update a configuration file, restart the API service, and hope nothing broke.

After setting up a CI/CD pipeline, the process changes entirely. When retraining completes, the new model file is committed to the repository. GitHub Actions immediately triggers. It loads the model, runs a set of prediction tests on held-out examples, checks that accuracy exceeds a minimum threshold, builds a fresh Docker image tagged with the commit hash, pushes it to the registry, and updates the Kubernetes deployment. The whole process completes in under ten minutes with no human involvement beyond the initial commit.

When a bad model version is accidentally deployed because a training bug caused accuracy to slip past the validation checks, the team uses Kubernetes rollback to revert to the previous deployment in seconds. They then fix the training bug, push the corrected model, and the pipeline runs again automatically.

The key components of this system are a GitHub repository that stores all code, model artifacts, and Kubernetes configuration files; GitHub Actions workflow files that define the pipeline steps; a Dockerfile that describes how to package the inference service; a container registry such as GitHub Container Registry where built images are stored; and Kubernetes deployment and service configuration files that define how the application runs in the cluster.

Advantages

Reproducibility. Every deployment is built from the same recipe in a clean environment. The container image that runs in production is identical to what was built in CI. Environment drift, the silent cause of countless production bugs, is eliminated.
Speed and consistency. A process that previously took an hour of careful manual work now runs in under fifteen minutes without human involvement. Every team member follows the same process because the process is automated.
Full auditability. Every deployed image is tagged with a Git commit hash. You can always look at what is running in Kubernetes and trace it back to the exact code, model, and data that produced it. This is essential for compliance and debugging.
Automatic quality gates. The pipeline can enforce standards that are easy to forget under deadline pressure: model accuracy must exceed a threshold, the model must load without errors, and prediction outputs must be in a valid range. These gates cannot be skipped.
Safe rollbacks. Because every deployment is tied to a versioned image, reverting to a previous version is a single command. This dramatically reduces the risk of each deployment, making teams more willing to deploy frequently rather than accumulating large, risky batches of changes.
Zero-downtime updates. Rolling deployments in Kubernetes mean users are never aware of an update happening. Old pods serve traffic until new pods are confirmed healthy, then old pods are retired.

Limitations and Trade-offs

Infrastructure complexity. Running Kubernetes in production requires meaningful expertise. Configuring clusters, managing access credentials, understanding networking, and handling storage for model artifacts all require engineering effort that a small team may struggle to staff.
Pipeline maintenance overhead. CI/CD workflows are code, and like all code, they require maintenance. As dependencies evolve and tools change, workflows must be updated. A broken pipeline can block all deployments until it is fixed.
Model validation is hard to automate fully. Simple checks such as loading the model and confirming predictions are in range are straightforward. Detecting subtle model degradation, concept drift, or performance regressions on real production traffic requires additional monitoring beyond what CI/CD alone provides.
Large model artifacts complicate CI. Storing large model files in a Git repository is impractical. Large models are typically stored in object storage such as S3 or GCS and pulled during the Docker build, which adds complexity and build time.
Cold start latency. When Kubernetes starts a new pod, the model must be loaded into memory before the pod can serve requests. For large models, this can take several seconds. Readiness probes handle this, but users may notice delays during high-traffic scaling events.

Common Mistakes

Using the "latest" image tag in production. The latest tag always points to the most recently pushed image. If something goes wrong, you cannot tell what version is actually running, and rolling back is difficult because there is no specific image to roll back to. Always tag images with a unique identifier such as the Git commit hash.
Skipping model validation in CI. Running code tests without testing the model artifact itself is a critical gap. The model file could be corrupted, incompatible with the current inference code, or silently producing wrong predictions. At minimum, the pipeline should load the model and run predictions on a small set of known examples.
Storing secrets in workflow files or code. Kubernetes credentials, registry access tokens, and API keys must be stored as GitHub Secrets and injected into the workflow at runtime. Never hardcode them in workflow files or commit them to the repository.
Not testing locally before relying on CI. CI runners are not debugging environments. Building and running the Docker image locally before pushing saves significant time when something is wrong with the Dockerfile or application code.
Ignoring health probes. Deploying to Kubernetes without readiness and liveness probes means Kubernetes cannot detect a pod that loaded the wrong model, exhausted memory, or became stuck. Health probes are not optional for production ML services.
One massive pipeline job. When all steps run in a single job, a failure at the deployment step still builds and pushes the image. Splitting CI validation and CD deployment into separate jobs makes it clearer where failures occur and avoids wasted work.

Best Practices

Tag every image with the Git commit SHA. This creates an unambiguous link between what is running in production and the exact code and model that produced it, enabling instant rollback and precise debugging.
Enforce accuracy thresholds as a deployment gate. Before allowing a deployment to proceed, validate the model on a holdout set and fail the pipeline if performance falls below the established baseline. This prevents silent regressions from reaching users.
Use readiness and liveness probes on every pod. Configure a readiness probe so Kubernetes waits for the model to fully load before routing traffic, and a liveness probe so Kubernetes restarts pods that become unresponsive.
Store model artifacts in object storage, not in Git. Use a tool such as DVC or simply store model files in S3 or GCS. Reference the artifact location in configuration rather than committing large binary files to the repository.
Test the full pipeline on a staging environment first. Before enabling automatic deployment to production, run the full pipeline against a staging cluster with a representative sample of requests. Only promote to production after staging passes.
Keep workflow files under version control and review them like code. Changes to the CI/CD pipeline have as much impact as changes to application code. Require pull request reviews for workflow file changes.

Comparison: Deployment Approaches for ML Models

Approach	Setup Effort	Reproducibility	Scalability	Rollback	Best For
Manual deployment	Low	Poor	Manual scaling only	Difficult	Prototypes, single developer projects
Script-based automation	Medium	Moderate	Limited	Partial	Small teams without Kubernetes access
GitHub Actions + Docker (no Kubernetes)	Medium	Good	Limited to single server	Moderate	Small-scale production with low traffic
GitHub Actions + Docker + Kubernetes	High	Excellent	Horizontal scaling	One command	Production ML services at meaningful scale
Managed ML platforms (SageMaker, Vertex AI)	Medium	Excellent	Automatic	Built-in	Teams already on AWS or GCP wanting minimal infrastructure work

FAQ

Do I need Kubernetes, or can I just use Docker?

For a single-server deployment with low traffic, Docker alone is sufficient. You can run the container directly on a server without Kubernetes. Kubernetes becomes valuable when you need multiple replicas for redundancy, automatic scaling to handle traffic spikes, rolling deployments with zero downtime, and self-healing restarts. Start with Docker if Kubernetes feels like too much for your current situation, and add Kubernetes when you need what it provides.

What happens if the CI pipeline fails? Does it block all deployments?

Yes, that is exactly the point. A failure in CI means a broken change was caught before it reached production. The existing production deployment continues running undisturbed. The developer who pushed the change is notified of the failure and must fix it before the pipeline will proceed. This is a feature, not a limitation.

How do I handle model files that are too large for Git?

Store model artifacts in object storage such as Amazon S3, Google Cloud Storage, or Azure Blob Storage. During the Docker build step in GitHub Actions, download the model file from object storage before copying it into the container image. Tools like DVC (Data Version Control) can help manage this workflow and maintain the link between code versions and model artifact versions.

Is GitHub Actions the only option, or can I use other CI systems?

The same pipeline concepts apply to any CI system. GitLab CI, CircleCI, Jenkins, and Buildkite all support the same pattern: run tests, build a Docker image, push it, and update a Kubernetes deployment. The specific configuration syntax differs, but the architecture is identical. GitHub Actions is a natural starting point because it requires no additional infrastructure if you already host code on GitHub.

How do I prevent a bad model from being deployed if it passes all code tests?

Add a model-specific validation step to CI that runs predictions on a held-out labeled dataset and compares the results to a known baseline. If accuracy, precision, recall, or another relevant metric falls below a defined threshold, the pipeline fails and deployment is blocked. This threshold should be set based on the minimum acceptable performance for your application and updated over time as the model improves.

References

GitHub Actions Documentation
Docker Documentation
Kubernetes Documentation
Humble, J., & Farley, D. (2010). Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Addison-Wesley.
Sculley, D., et al. (2015). Hidden Technical Debt in Machine Learning Systems. NeurIPS 2015.

Key Takeaways

CI/CD for ML automates three things that would otherwise be done manually and inconsistently: testing the model artifact, packaging it with exact dependencies, and deploying it to the serving infrastructure.
Docker eliminates environment mismatch by bundling code, the model file, and library versions into a single container image that runs identically everywhere it is deployed.
Kubernetes provides rolling updates, automatic health monitoring, self-healing restarts, and one-command rollback, all of which are essential for running ML services reliably in production.
Tagging images with Git commit hashes instead of the "latest" tag creates a traceable audit trail and makes rollback straightforward when problems occur in production.
Model-specific validation in CI, checking that the artifact loads, that predictions are sensible, and that accuracy exceeds a minimum threshold, is what separates an ML CI pipeline from a generic software one.
The investment in pipeline setup pays compounding returns: every future deployment is safer, faster, and requires no manual work.