# Advanced Git & CI/CD Engineering --- ## ADVANCED GIT ENGINEERING --- ## 1. Git Internals & Object Model Most engineers use Git every day without understanding how it actually works internally. This is fine for basic usage, but when things go wrong — a rebase goes sideways, a reset loses commits, a merge produces unexpected results — understanding Git's internals is the difference between recovering confidently and panicking. More importantly, once you understand the object model, everything Git does makes intuitive sense. Git is fundamentally a content-addressable filesystem with a version control interface built on top. At its core, Git stores data as objects in a key-value store. The key is a SHA-1 hash of the content. The value is the content itself. Everything in Git — every file, every directory, every commit, every tag — is one of four types of objects stored in the `.git/objects` directory. **Blob objects** store file content. When you add a file to Git, Git takes the file's content, computes its SHA-1 hash, and stores the content under that hash. Critically, Git stores content not filenames. Two different files with identical content share a single blob. If you rename a file without changing its content, no new blob is created. The blob is purely the raw file bytes. **Tree objects** represent directories. A tree object contains a list of entries, each being either a blob reference (with a filename and permissions) or another tree reference (a subdirectory). A tree is essentially a directory snapshot — it records what files (blobs) and subdirectories (trees) existed and what their names and permissions were at a point in time. **Commit objects** are what you create when you run `git commit`. A commit object contains: a reference to a tree object (the root directory snapshot at the time of commit), references to parent commit objects (zero parents for the initial commit, one for a normal commit, two for a merge commit), the author name and email, the committer name and email, timestamps, and the commit message. The commit points to the complete state of the repository at that moment. **Tag objects** are annotated tags. They contain a reference to another object (usually a commit), a tag name, tagger identity, timestamp, and message. Lightweight tags (created with `git tag v1.0`) are just references, not objects. Annotated tags (`git tag -a v1.0 -m "Release"`) are full objects with metadata. The SHA-1 hash of each object is determined by its content. If you create the exact same commit twice on different machines, you get the same SHA-1. If any byte changes, the hash changes. This is what makes Git's integrity guarantee work — you can verify any object by recomputing its hash. **References** are how Git gives human-readable names to SHA-1 hashes. A branch like `main` is just a file in `.git/refs/heads/main` containing the SHA-1 of the commit it points to. When you commit on `main`, Git updates that file to point to the new commit. HEAD is a special reference that usually points to a branch (meaning HEAD points to the commit that branch points to). When you're in detached HEAD state, HEAD points directly to a commit SHA rather than to a branch. Understanding this object model explains things that otherwise seem magical. Why is Git fast at branching? Because creating a branch just creates a new file containing a SHA-1 — it doesn't copy anything. Why can't you change a commit's content without changing its SHA? Because the SHA is derived from the content. Why does git rebase create new commits rather than moving existing ones? Because changing a commit's parent would change its SHA, making it a different object. The original commits still exist in the object store, unreachable until garbage collected. To explore Git internals hands-on, `git cat-file -t ` shows the type of any object, `git cat-file -p ` shows its content, and `git log --graph --oneline` shows the commit graph structure. --- ## 2. Branching Strategies How you organize branches in a repository is a strategic decision that affects how your team collaborates, how frequently you integrate code, how you manage releases, and how much overhead your development process has. Different branching strategies make different tradeoffs. **Git Flow** was one of the first formalized branching strategies, introduced by Vincent Driessen in 2010. It has specific long-lived branches with specific purposes. `main` always contains production-ready code. `develop` is the integration branch where features are merged before release. Feature branches are created from `develop`, developed in isolation, then merged back to `develop`. Release branches are cut from `develop` when you're ready to release — only bug fixes go into the release branch. When the release is ready, it's merged to both `main` and `develop`. Hotfix branches are created from `main` to fix critical production bugs and are merged back to both `main` and `develop`. Git Flow made sense when software was released in discrete versions with long release cycles. It's heavily used in mobile app development, packaged software, and any context where you ship distinct versions and must maintain multiple versions simultaneously. The downside is significant complexity — multiple long-lived branches create merge conflicts, require discipline to maintain, and create distance between developers' work and the main branch. For teams doing continuous delivery, Git Flow creates unnecessary friction. **Trunk-Based Development (TBD)** is at the opposite extreme. There is exactly one long-lived branch — `main` (the trunk). Developers either commit directly to main (for small teams) or use very short-lived feature branches (hours to a couple of days, not weeks). Everyone integrates into main constantly. Releases are made from main. Feature flags control what's visible to users, not branches. There are no release branches, no develop branches, no hotfix branches. TBD is the branching strategy used by Google, Facebook, and most high-functioning engineering organizations. It eliminates merge hell by ensuring code is integrated so frequently that conflicts are small and caught early. It requires strong CI/CD infrastructure (because main must always be deployable), good test coverage, and feature flags for incomplete features. The overhead is low but the quality bar for what goes into main is high. **GitHub Flow** sits between Git Flow and Trunk-Based Development. There's one long-lived branch (`main`) and short-lived feature branches. Feature branches are created from main, developed, then merged back to main via pull request. Main is always deployable — you deploy from main. There are no develop branches or release branches. The simplicity makes it accessible to most teams, and the PR-based workflow enables code review. It works well for teams doing continuous delivery but doesn't handle the "maintain multiple released versions simultaneously" use case well. **Choosing a strategy** — the right strategy depends on your release model. If you ship SaaS software with continuous deployment and one production version, Trunk-Based Development or GitHub Flow are appropriate. If you ship packaged software or mobile apps with versioned releases that require long-term support, Git Flow provides the structure you need. For most modern teams, GitHub Flow with short-lived branches and strong CI/CD is the right balance of simplicity and discipline. For ML systems specifically, Trunk-Based Development works well for application and infrastructure code. Model development may use feature branches longer because experiments take time, but the principle of integrating frequently and using feature flags for incomplete features still applies. --- ## 3. Rebase vs Merge Rebase and merge are both ways to integrate changes from one branch into another, but they produce fundamentally different histories and have different implications for collaboration. Understanding when to use each is one of the marks of Git maturity. **Merge** combines two branches by creating a new merge commit that has two parents — one from each branch. The history shows exactly what happened: branch A diverged from main at point X, evolved independently, and was merged back at point Y. The original branch commits remain in the history unchanged. If `feature-branch` was created from `main` at commit C3, and `main` has since advanced to C5 while `feature-branch` has commits C4' and C5': ``` main: C1 - C2 - C3 - C4 - C5 \ feature-branch: C4' - C5' ``` After merge: ``` main: C1 - C2 - C3 - C4 - C5 - M1 \ / feature-branch: C4' - C5' ``` M1 is the merge commit with two parents (C5 and C5'). The history is accurate but can become complex — many active branches produce a graph that looks like a subway map and is hard to read linearly. **Rebase** rewrites the feature branch commits to appear as if they were created on top of the current main, rather than at the point where the branch originally diverged. Rebase takes each commit on the feature branch, replays it on top of the target branch, and creates new commits with the same changes but different parent commits (and therefore different SHA-1s). After rebasing `feature-branch` onto `main`: ``` main: C1 - C2 - C3 - C4 - C5 \ feature-branch (rebased): C4'' - C5'' ``` C4'' and C5'' have the same changes as C4' and C5' but are new commits (different SHAs) based on C5. Now you can fast-forward merge — since feature-branch is ahead of main in a straight line, merging just moves the main pointer forward without creating a merge commit: ``` main: C1 - C2 - C3 - C4 - C5 - C4'' - C5'' ``` The history is linear and clean. Reading git log is like reading a story without branching detours. **The golden rule of rebase** — never rebase commits that have been pushed to a shared remote branch. Rebase creates new commits (new SHAs) and abandons the old ones. If other developers have pulled your old commits and you rebase and force-push, their local repositories diverge from the remote in confusing ways. Rebase is safe for local work before sharing and for rebasing your local feature branch onto the latest main before creating a PR. **When to use merge**: preserving exact history of how work happened, merging to shared long-lived branches (never rebase onto shared branches), and when you want an explicit record of when features were integrated. **When to use rebase**: cleaning up local commits before pushing, integrating main's latest changes into your feature branch to resolve conflicts before merging, and producing a clean linear history in your repository. **Merge strategies** — `git merge --no-ff` forces a merge commit even when fast-forward is possible, preserving the existence of the branch in history. `git merge --squash` squashes all changes from the feature branch into the working tree without committing, letting you create a single clean commit incorporating all the branch's changes. Squash merging is popular on GitHub for keeping main history clean while feature branches can have messy work-in-progress commits. --- ## 4. Interactive Rebase & History Rewriting Interactive rebase (`git rebase -i`) is one of Git's most powerful features. It lets you rewrite the history of a series of commits — reordering them, combining them, splitting them, editing messages, removing them entirely — before sharing them with others. This is how you turn a messy series of work-in-progress commits into a clean, coherent history that tells a clear story. To start an interactive rebase on the last 5 commits: ```bash git rebase -i HEAD~5 ``` Git opens an editor showing the last 5 commits in order (oldest first, opposite of `git log`): ``` pick a1b2c3d Fix model loading race condition pick d4e5f6g Add feature store integration pick h7i8j9k WIP: feature store pick l0m1n2o Fix typo in config pick p3q4r5s Add tests for feature store ``` You change the verb before each commit to control what happens to it: **pick** — keep the commit as-is, unchanged. **reword** — keep the commit's changes but edit the commit message. Git pauses the rebase and opens an editor for you to write a new message. **edit** — keep the commit but pause the rebase at this point, letting you amend the commit (add files, change content) before continuing. Used for splitting a commit into multiple commits. **squash** — combine this commit with the previous commit. Both commits' messages are combined and you edit the resulting message. **fixup** — like squash, but discards this commit's message entirely, keeping only the previous commit's message. Used for commits like "fix typo" that don't deserve their own entry in history. **drop** — delete this commit entirely and its changes. **reorder** — just move the lines in the editor to change the order commits are applied. In the example above, a clean history would squash the WIP commit and the typo fix into meaningful commits: ``` pick a1b2c3d Fix model loading race condition pick d4e5f6g Add feature store integration squash h7i8j9k WIP: feature store pick l0m1n2o Fix typo in config # could fixup into p3q4r5s fixup p3q4r5s Add tests for feature store ``` Wait — "Fix typo in config" should be combined with the tests commit, so moving it: ``` pick a1b2c3d Fix model loading race condition pick d4e5f6g Add feature store integration squash h7i8j9k WIP: feature store reword p3q4r5s Add tests for feature store fixup l0m1n2o Fix typo in config ``` The result is three clean commits: the race condition fix, the feature store integration with all WIP work combined, and tests with the typo fix incorporated. **Splitting a commit** — if a single commit does too many things, use `edit` to pause at that commit, `git reset HEAD~` to unstage all the commit's changes (but keep them in working tree), then stage and commit them in logical groups. Then `git rebase --continue` to proceed. **Force pushing after history rewriting** — after interactive rebase, your local branch has different commits (different SHAs) than the remote. You must force push: `git push --force-with-lease origin feature-branch`. The `--force-with-lease` variant is safer than `--force` — it checks that nobody else has pushed to the branch since you last fetched. If someone has, the push is rejected instead of overwriting their work. **When to rewrite history** — only before sharing. Once commits are pushed to a shared branch that others have pulled, rewriting creates confusion. For personal feature branches where you're the only one working, clean up aggressively before creating the PR. The person reviewing your PR shouldn't need to wade through "WIP", "fix", "another fix", "typo", "actually fix" commits. --- ## 5. Cherry-Pick, Revert & Reset Strategies These three commands give you surgical control over Git history and the working state, each serving a distinct purpose that's important to understand clearly. **Cherry-pick** applies the changes introduced by a specific commit onto the current branch. It creates a new commit with the same changes but a different SHA (different parent, possibly different timestamp). Cherry-pick is useful when you want to apply a specific fix from one branch to another without merging the entire branch. ```bash git cherry-pick a1b2c3d ``` This takes whatever changes commit `a1b2c3d` introduced and applies them to your current branch. If you're on a release branch and need a specific bug fix from main, cherry-pick that fix's commit onto the release branch. Cherry-pick becomes complicated when the commit depends on changes that don't exist on the target branch — the patch may not apply cleanly and conflicts arise. Also, if you cherry-pick many commits from a branch and later merge that branch, Git sees duplicate changes (same diff applied twice) and creates messy conflicts. Cherry-pick is best used sparingly for isolated, independent changes. **Revert** creates a new commit that undoes the changes of a specific previous commit. Unlike reset (which actually removes commits), revert adds a new commit to the history. This is the safe way to undo a change that has already been pushed to a shared branch, because it preserves the history. ```bash git revert a1b2c3d ``` This creates a new commit applying the inverse of what `a1b2c3d` did — if that commit added a line, the revert removes it. The original commit remains in history. If you deployed a bad model configuration and need to undo it immediately, revert the commit and push — this creates a clean, auditable record that the change was made and then undone, rather than erasing it from history. Reverting merge commits requires the `-m` flag specifying which parent to revert to: `git revert -m 1 `. The `1` specifies the first parent (the branch that was merged into), effectively undoing the merge. **Reset** moves the current branch pointer and optionally changes the working tree and staging area. It has three modes that are critically different: `git reset --soft HEAD~3` — moves the branch pointer back 3 commits but leaves the changes from those commits staged in the index. The commits disappear but their changes are ready to be recommitted. Use this to squash recent commits into one: reset soft to the commit before the ones you want to squash, then `git commit` to create one commit with all the changes. `git reset --mixed HEAD~3` — moves the branch pointer back 3 commits and unstages the changes (but leaves them in the working tree). This is the default mode. The changes are still there but unstaged — you can selectively stage and recommit them. Use this when you want to restructure how changes are split across commits. `git reset --hard HEAD~3` — moves the branch pointer back 3 commits and throws away all changes from those commits. The working tree and staging area are reset to the state of the target commit. This is destructive — the changes are gone (though recoverable via `git reflog` for a while). Use this when you genuinely want to abandon work. **Recovery with reflog** — Git maintains a reference log (`git reflog`) that records every time HEAD moves, even for operations like reset --hard. Each reflog entry has a SHA. If you accidentally reset --hard and lose work, `git reflog` shows you what HEAD pointed to before the reset, and you can `git checkout ` or `git reset --hard ` to recover. The reflog is local and not pushed to remotes, but it persists for 90 days by default. --- ## 6. Git Bisect for Debugging Git bisect is a tool for finding which commit introduced a bug using binary search. If you have a repository with 1000 commits and you know the software worked at commit A but doesn't work at commit Z, git bisect finds the exact commit that broke it in about 10 steps (log₂(1000) ≈ 10) instead of manually checking hundreds of commits. The process is straightforward. You tell Git where the problem was definitely not present (a good commit) and where it definitely exists (a bad commit), and Git automatically checks out commits in the middle, asking you to classify each one. Binary search halves the search space with each classification. ```bash # Start bisecting git bisect start # Tell Git the current commit is bad git bisect bad # Tell Git where you know things worked git bisect good v2.0.0 # Git checks out the midpoint commit # Test your application... # If this commit has the bug: git bisect bad # If this commit doesn't have the bug: git bisect good # Git keeps halving the range until it identifies the first bad commit # When done: git bisect reset ``` **Automated bisect** is where bisect becomes truly powerful. If you have a test or script that automatically determines whether a commit is good or bad, you can automate the entire process: ```bash git bisect start git bisect bad HEAD git bisect good v2.0.0 git bisect run python test_model_accuracy.py ``` Git runs `test_model_accuracy.py` at each midpoint commit. The script should exit 0 if the commit is good, exit 1 if bad, and exit 125 to skip a commit (if the commit doesn't compile or can't be tested for unrelated reasons). Git runs this automatically until it finds the first bad commit, potentially finding it while you're in a meeting. For ML systems, this is valuable for finding which code commit caused a model metric regression, which configuration change introduced a performance degradation, or which dependency update broke an inference path. The test script can be as sophisticated as needed — train a small model, run inference on a test set, check that accuracy is above a threshold. After bisect identifies the culprit commit, you read that commit's diff with `git show ` to understand exactly what change caused the issue. --- ## 7. Git Hooks Git hooks are scripts that Git executes automatically before or after specific events — committing, pushing, receiving a push, merging. They're how you enforce standards, run checks, and automate processes that should happen at specific points in the Git workflow. Hooks live in the `.git/hooks/` directory. Creating an executable file with the right name makes it a hook. The hooks are just scripts — they can be bash, Python, Ruby, or any other executable. **Client-side hooks** run on the developer's machine: `pre-commit` runs before a commit is created, after you run `git commit` but before the commit message editor opens. If this hook exits with a non-zero code, the commit is aborted. This is where you enforce code formatting (run black for Python, prettier for JavaScript), static analysis (run flake8, mypy), check for accidentally committed secrets (scan for API keys, passwords), and validate that test files exist for new features. ```bash #!/bin/bash # pre-commit hook example echo "Running pre-commit checks..." # Check for Python formatting black --check src/ || { echo "Run 'black src/' to fix formatting"; exit 1; } # Check for type errors mypy src/ --ignore-missing-imports || exit 1 # Check for secrets accidentally committed if grep -r "AWS_SECRET_ACCESS_KEY\|password\|api_key" --include="*.py" src/; then echo "Potential secret detected in code!" exit 1 fi echo "Pre-commit checks passed" ``` `commit-msg` receives the commit message as a file argument and can validate or modify it. Use this to enforce commit message conventions — Conventional Commits format (`feat:`, `fix:`, `docs:`), Jira ticket references, minimum length requirements. `pre-push` runs before git push sends commits to the remote. This is where you run the full test suite (too slow for pre-commit). Aborting a push that would break CI saves time and keeps the shared branch green. **Server-side hooks** run on the Git server (GitHub, GitLab, or self-hosted Git): `pre-receive` runs on the server when it receives a push, before accepting it. It can reject pushes that don't meet requirements — commits without GPG signatures, pushes to protected branches from unauthorized users, changes that would exceed file size limits. This enforces standards that can't be bypassed by modifying local hooks. `post-receive` runs after the push is accepted and is typically used to trigger deployments, send notifications, or update CI systems. This was how simple CD pipelines worked before dedicated CI/CD tools became standard. **Managing hooks in teams** — the `.git/hooks/` directory is not tracked by Git (it's inside `.git`, not in the working tree). This means hooks aren't shared automatically with team members. Solutions: store hook scripts in a tracked directory (like `.githooks/`) and set the hook directory with `git config core.hooksPath .githooks`. Or use a hooks management tool like pre-commit (the Python tool), which defines hooks in `.pre-commit-config.yaml` that is tracked in the repository, automatically runs hooks in isolated environments, and manages hook versions consistently across all developers. ```yaml # .pre-commit-config.yaml repos: - repo: https://github.com/psf/black rev: 23.9.1 hooks: - id: black - repo: https://github.com/PyCQA/flake8 rev: 6.1.0 hooks: - id: flake8 - repo: https://github.com/Yelp/detect-secrets rev: v1.4.0 hooks: - id: detect-secrets ``` `pre-commit install` sets up the hooks for any developer who runs it. --- ## 8. Git Submodules & Monorepo Strategy As organizations grow, they face a fundamental repository organization question: should all code live in one repository (monorepo), or should each project/service have its own repository (polyrepo)? And if multiple repositories exist, how do you handle shared code and dependencies? **Git Submodules** are Git's built-in mechanism for embedding one repository inside another. A submodule is a reference to a specific commit in another repository. If your ML platform depends on a shared internal library, you can include that library's repository as a submodule: ```bash git submodule add https://github.com/myorg/ml-utils.git libs/ml-utils ``` This creates a `.gitmodules` file recording the submodule's URL and path, and records the specific commit of ml-utils that the parent repository currently uses. When someone clones your repository, they need `git clone --recurse-submodules` to also pull the submodule content, or run `git submodule update --init` after a regular clone. Updating a submodule means navigating into the submodule directory, pulling the new commits, navigating back to the parent repository, staging the submodule change (which updates the recorded commit pointer), and committing. The parent repository records a specific commit of the submodule, not a branch — you're pinning the exact version used. Submodules solve a real problem but introduce real friction. The workflow is confusing for developers unfamiliar with them. Forgetting to initialize submodules after cloning produces mysterious missing-file errors. Updating submodules across many repositories is tedious. They work well for small organizations with a few clear shared libraries but don't scale to complex dependency graphs. **Monorepo** strategy puts all code — every service, every library, every tool — in a single repository. Google, Meta, Microsoft, Twitter, and Uber all use monorepos. The advantages are significant: atomic cross-service changes (one PR changes the shared library and all services that use it simultaneously), consistent tooling and standards across all code, easy discovery of all code in the organization, simplified dependency management (no versioning of internal dependencies), and large-scale refactoring that spans multiple services. The challenges are also real: the repository becomes enormous (Google's is over 80GB), standard Git operations that scan the entire repository become slow, CI/CD must be sophisticated enough to only run tests for what changed (running all tests in a monorepo on every commit is impractical), and access control is more complex when everything is in one place. **Monorepo tools** address the scale challenges. Nx (primarily for JavaScript/TypeScript), Bazel (Google's build tool, language-agnostic), Turborepo, and Pants are build systems designed for monorepos. They understand the dependency graph of your codebase and can determine exactly which services are affected by a given change, running only the relevant tests. This is called affected-based CI and is what makes monorepo CI tractable at scale. **For ML systems**, a practical approach is a small number of repositories organized by concern rather than one extreme or the other. Infrastructure code in one repo (Terraform, Kubernetes configs), ML platform code in another (the training and serving platform), individual ML project code in their own repos or as directories in a shared ML repo. Internal shared libraries (data utilities, model evaluation frameworks) can be managed as Python packages published to a private PyPI registry, which avoids the submodule complexity while solving the shared code problem. --- ## 9. Semantic Versioning Semantic versioning (SemVer) is a versioning convention that communicates meaning about the nature of changes between releases through the version number itself. Instead of arbitrary version numbers, SemVer version numbers tell you whether a new release contains bug fixes, new features, or breaking changes. A SemVer version has the format `MAJOR.MINOR.PATCH` — for example `2.4.1`. Each component has a specific meaning: **MAJOR** version increments when you make incompatible API changes — changes that break existing users of your software. If you rename a function, remove a parameter, change a return type, or restructure an API in any way that requires callers to update their code, that's a major version bump. `1.5.3` → `2.0.0`. **MINOR** version increments when you add new functionality in a backward-compatible manner. New functions, new parameters with defaults, new configuration options — existing users don't need to change anything, but new functionality is available. `2.4.1` → `2.5.0`. **PATCH** version increments when you make backward-compatible bug fixes. No new features, no API changes — just fixes to existing behavior. `2.4.1` → `2.4.2`. Additional labels are allowed for pre-release versions (`2.5.0-alpha.1`, `2.5.0-beta.3`, `2.5.0-rc.1`) and build metadata (`2.5.0+20231015`). **Why SemVer matters** — when you specify dependencies using SemVer-aware version constraints, you can reason about what updates are safe. In Python with pip, `torch>=2.0.0,<3.0.0` means "any PyTorch version from 2.0.0 up to but not including 3.0.0 is compatible" — new minor and patch versions will be used automatically, but major version 3.0 (which might have breaking changes) won't be automatically adopted. This is the contract SemVer establishes between library maintainers and library users. **For ML systems**, versioning carries additional meaning beyond the code API. Your model serving API might be versioned separately from the model itself. `GET /v1/predict` and `GET /v2/predict` can coexist, letting you introduce a new prediction format without immediately breaking all clients. Model weights themselves should be versioned — training a new model and serving it doesn't break the API, but it changes model behavior in ways that need version tracking for reproducibility. --- ## 10. Tagging & Release Management Tags in Git are references to specific commits that, unlike branches, don't move when new commits are added. They're typically used to mark release points — this specific commit is version 2.4.1 of the software. **Lightweight tags** are just a named pointer to a commit: ```bash git tag v2.4.1 git tag v2.4.1 a1b2c3d # Tag a specific commit ``` **Annotated tags** are full objects with metadata — tagger identity, date, and message — and can be GPG-signed: ```bash git tag -a v2.4.1 -m "Release 2.4.1: fixes model loading race condition" git tag -a v2.4.1 -s # -s for GPG signing ``` Annotated tags are preferred for releases because they include the who and why of the release, and they can be verified if signed. Tags are not pushed automatically with `git push`. Push tags explicitly: ```bash git push origin v2.4.1 # Push a specific tag git push origin --tags # Push all tags ``` **Release management workflow** — a mature release process looks like this. Development happens on main (or feature branches merged to main). When ready to release, a release branch is cut from main (`release/2.4.0`). The release branch gets final testing, bug fixes cherry-picked from main if needed, and version number updates in code files. When the release is ready, the release commit is tagged (`v2.4.0`). If using a release branch, it's merged to main. The tag is pushed to trigger the release pipeline. **Automated release notes** — tools like `git-chglog` and `release-please` automatically generate changelogs and release notes from commit messages, provided you follow Conventional Commits format (`feat:`, `fix:`, `chore:`, `docs:` prefixes). `release-please` runs as a CI job, creates a "release PR" that bumps version numbers and generates a changelog, and when merged, creates the tag and GitHub release automatically. **GitHub Releases** — creating a GitHub Release (distinct from a Git tag) provides a user-facing release page with assets (binary downloads, source archives), rendered release notes, and integration with tools that watch for releases. --- ## 11. Secure Git Workflows Git security is often overlooked but important. Code repositories contain your entire history, potentially including accidentally committed secrets, and the ability to push code is the ability to affect what runs in production. **Commit signing with GPG** — by default, anyone can create a commit claiming any author email. Someone could create a commit claiming to be authored by your CEO. GPG commit signing cryptographically proves that the person who made the commit actually has the private key associated with the author's identity. GitHub shows a "Verified" badge on signed commits. Requiring signed commits on protected branches ensures all commits can be cryptographically traced to a specific developer's key. ```bash # Set your signing key git config --global user.signingkey YOUR_GPG_KEY_ID # Sign commits by default git config --global commit.gpgsign true ``` **Protected branches** — in GitHub, GitLab, and similar platforms, branch protection rules prevent force pushes to specific branches, require pull request reviews before merging, require CI checks to pass before merging, and require signed commits. Main and release branches should always be protected. Nobody, including administrators, should be able to push directly to main in a production repository. **Secret detection** — accidentally committed secrets (API keys, database passwords, cloud credentials embedded in config files or test data) are a leading cause of security incidents. Pre-commit hooks with detect-secrets or truffleHog scan staged files for patterns matching common secret formats before each commit. Server-side `pre-receive` hooks (or GitHub's secret scanning feature) scan pushes server-side as a second line of defense. Neither is perfect — some secrets don't match known patterns — but together they catch the majority of accidental commits. **When a secret is committed** — if a secret is committed, treat the secret as fully compromised immediately, even if you delete it in the next commit. Git history is preserved and the secret is visible in the old commit. You must rotate the credential immediately (generate a new API key, change the password), then optionally use `git filter-branch` or BFG Repo Cleaner to remove the secret from history and force-push all branches. This is disruptive and not always worth the effort once the credential is rotated, but may be required for compliance. **Access control** — follow least privilege for repository access. Developers get write access to feature branches but not direct push to main. Automated systems (CI/CD) get the minimum access needed — a deploy pipeline doesn't need write access to the repository. Use deploy keys (single-repository SSH keys) for CI/CD rather than personal access tokens that could access all repositories. Regularly audit who has access and revoke access promptly when someone leaves the organization. --- ## ADVANCED CI/CD ENGINEERING --- ## 12. CI/CD Architecture & Pipeline Design CI/CD (Continuous Integration and Continuous Delivery/Deployment) is the practice of automating the path from code change to production deployment. CI catches problems early by building and testing every change. CD automates the delivery of validated changes to production. Together, they make deploying software fast, reliable, and low-risk. Before designing pipelines, understand the vocabulary. **Continuous Integration** means every code change is automatically built, tested, and validated — catching integration problems early when they're cheap to fix. **Continuous Delivery** means the software is always in a state where it could be deployed to production — the pipeline validates every change to the point of production readiness, but actual deployment is a manual decision. **Continuous Deployment** takes this further — every change that passes all automated validation is automatically deployed to production without human intervention. Most mature engineering organizations practice Continuous Delivery at minimum and Continuous Deployment for non-critical services. **Pipeline architecture principles** that separate good pipelines from bad ones: **Fast feedback first** — the most common reason developers disable or ignore CI is because it takes too long. A 45-minute pipeline means developers switch to something else, context-switch back when it finishes, and fix problems with cold context. Structure your pipeline so the fastest, most common failure modes run first. Linting and type checking in seconds. Unit tests in a few minutes. Integration tests after those pass. Full end-to-end tests last. Developers get feedback on the most common problems within 2-3 minutes. **Fail fast** — don't continue running expensive stages if cheap stages have already found a problem. If linting fails, don't start building Docker images. If unit tests fail, don't run integration tests. Exit the pipeline at the earliest possible point. **Reproducibility** — pipelines should produce the same result every time they run on the same input. This means pinning tool versions, using dependency lock files, using hermetic build environments. A pipeline that fails intermittently because it pulls `latest` of a tool that changed is harder to debug than application bugs. **Parallelism** — many checks are independent and can run simultaneously. Lint Python code, lint YAML, check Terraform format, and check Dockerfile best practices can all run in parallel. Running them sequentially wastes time. **Visibility** — every pipeline run should be observable. Clear stage names, structured logs, artifact storage with retention, metrics on pipeline duration and failure rate. When a pipeline fails at 2am, the on-call engineer shouldn't have to decode cryptic output to understand what happened. For ML systems, CI/CD has additional stages beyond standard software pipelines: model training runs (at least on a sample dataset to validate training code), model evaluation (does the trained model meet minimum metric thresholds?), model registration (pushing validated models to the model registry), and model deployment stages for each environment. --- ## 13. Pipeline as Code Pipeline as Code means your CI/CD pipeline definition is stored in version control as code, not configured through a web UI. This seems obvious but represents a significant shift from how CI was done historically — Jenkins was configured through GUI, pipeline definitions lived in the CI server's database and couldn't be version-controlled, reviewed, or easily replicated. Modern CI/CD systems (GitHub Actions, GitLab CI, CircleCI, Tekton, Argo Workflows) define pipelines as YAML files stored in the repository. This gives you: version history of pipeline changes — who changed what when, code review for pipeline changes through PR review, the ability to run different pipeline versions for different branches, easy replication across repositories, and local development and testing of pipeline logic. A GitHub Actions workflow file: ```yaml # .github/workflows/ci.yml name: CI Pipeline on: push: branches: [main] pull_request: branches: [main] jobs: lint: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v4 with: python-version: "3.11" - run: pip install black flake8 mypy - run: black --check src/ - run: flake8 src/ - run: mypy src/ test: needs: lint runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v4 with: python-version: "3.11" - run: pip install -r requirements.txt - run: pytest tests/ --junitxml=test-results.xml - uses: actions/upload-artifact@v3 with: name: test-results path: test-results.xml ``` **The `needs` keyword** creates dependencies between jobs — `test` only runs after `lint` passes. This implements the fast-fail principle at the job level. **Pipeline as Code best practices**: store pipeline files in a discoverable location (`.github/workflows/` for GitHub Actions, `.gitlab-ci.yml` for GitLab CI), name pipelines and jobs descriptively, comment complex logic, treat pipeline code with the same review rigor as application code, and don't put sensitive logic (hardcoded credentials, internal hostnames) in pipeline files — use secrets and environment variables. --- ## 14. Multi-Stage Pipeline Design A multi-stage pipeline organizes the build process into distinct stages that run in sequence, with each stage building on the success of the previous. This structure makes pipelines readable, enables selective re-running of failed stages, and enforces the principle that later, more expensive stages only run when earlier, cheaper stages pass. A mature multi-stage ML CI/CD pipeline looks like this: **Stage 1: Validate (30-60 seconds)** — fast checks that catch the most common problems immediately. Code formatting (black, prettier), style (flake8, eslint), type checking (mypy), YAML linting, Dockerfile linting (hadolint), security static analysis (bandit for Python). Every developer gets this feedback within a minute. **Stage 2: Test (2-10 minutes)** — unit tests for all application code, property-based tests for data transformations, mocked tests for ML pipeline components. Tests here must be hermetic — no network calls, no real databases, everything mocked. **Stage 3: Build (5-15 minutes)** — build Docker images, compile any native extensions, generate documentation. Tag images with the commit SHA. Scan images for vulnerabilities. Sign images. **Stage 4: Integration Test (10-30 minutes)** — spin up real services using Docker Compose or ephemeral Kubernetes namespaces. Run tests against real databases, real message queues, real feature stores. Validate that the application works with real infrastructure. **Stage 5: ML Validation (15-60 minutes)** — run a training job on a sample dataset. Evaluate the trained model against baseline metrics. Validate data schemas and feature distributions. Register the model in the staging model registry if metrics pass. **Stage 6: Deploy to Staging (5-10 minutes)** — deploy the built images and validated models to the staging environment. Run smoke tests against staging. Validate critical paths are functional. **Stage 7: Deploy to Production** — either automated (Continuous Deployment) with automated smoke tests and monitoring, or requires a manual approval gate before proceeding. Canary deployment with automated metric checks before full rollout. Each stage only runs if all previous stages passed. A failure in Stage 2 (unit tests) stops the pipeline — you don't build images or run integration tests for code that has failing unit tests. This saves significant compute time and keeps feedback loops tight. --- ## 15. Matrix Builds & Parallel Execution Matrix builds run the same job multiple times with different configurations simultaneously. Instead of testing against only one Python version or one OS, you test against all supported combinations in parallel, getting comprehensive coverage in the same time as a single configuration test. GitHub Actions matrix strategy: ```yaml jobs: test: runs-on: ${{ matrix.os }} strategy: matrix: os: [ubuntu-latest, windows-latest, macos-latest] python-version: ["3.9", "3.10", "3.11"] exclude: - os: windows-latest python-version: "3.9" # Skip this specific combination fail-fast: false # Don't cancel other matrix jobs if one fails steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v4 with: python-version: ${{ matrix.python-version }} - run: pip install -r requirements.txt - run: pytest tests/ ``` This creates 8 jobs (3 OS × 3 Python versions, minus the 1 excluded combination) that run simultaneously. The whole matrix completes in the time of a single test run rather than 8 times as long. For ML systems, matrices are useful for testing model serving on different hardware configurations, testing data pipeline code against different database versions, validating that training code works with different framework versions (torch 2.0 and torch 2.1), and testing infrastructure code against different Kubernetes versions. **Parallel execution without matrices** — jobs that are independent should always run in parallel. Linting Python code and linting Terraform code are completely independent — run them simultaneously. Building multiple Docker images for different services can run in parallel. GitHub Actions, GitLab CI, and most CI systems run independent jobs in parallel by default when multiple runners are available. **Controlling parallelism** — too much parallelism can overwhelm dependent services (like a test database) with concurrent connections. The `concurrency` setting in GitHub Actions limits how many runs of a workflow can run simultaneously, and the `max-parallel` in matrix strategy limits how many matrix jobs run at once. --- ## 16. Reusable Workflows & Composite Actions As you build pipelines for multiple repositories or multiple services within a monorepo, you'll notice significant duplication. Every repository has the same Python setup steps, the same Docker build and push steps, the same deployment steps. Copy-pasting pipeline code across repositories is the pipeline equivalent of copy-pasting application code — it works until you need to change something, at which point you need to update dozens of pipeline files. Reusable workflows and composite actions are GitHub Actions' mechanisms for DRY pipeline code. **Composite Actions** package a series of steps into a reusable action that can be called from any workflow. You define the action in a repository (can be the same repository or a shared actions repository), and call it with parameters. Creating a composite action at `.github/actions/build-and-push/action.yml`: ```yaml name: Build and Push Docker Image description: Builds a Docker image and pushes to ECR inputs: image-name: required: true description: Name of the Docker image dockerfile: required: false default: Dockerfile description: Path to Dockerfile aws-region: required: true description: AWS region for ECR outputs: image-uri: description: Full ECR image URI with digest value: ${{ steps.push.outputs.image-uri }} runs: using: composite steps: - name: Configure AWS credentials uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: ${{ env.ECR_PUSH_ROLE }} aws-region: ${{ inputs.aws-region }} - name: Login to ECR uses: aws-actions/amazon-ecr-login@v2 - name: Build image shell: bash run: | docker build -f ${{ inputs.dockerfile }} \ -t ${{ inputs.image-name }}:${{ github.sha }} . - name: Push and get digest id: push shell: bash run: | docker push ${{ inputs.image-name }}:${{ github.sha }} DIGEST=$(docker inspect --format='{{index .RepoDigests 0}}' \ ${{ inputs.image-name }}:${{ github.sha }}) echo "image-uri=$DIGEST" >> $GITHUB_OUTPUT ``` Using this action: ```yaml - uses: ./.github/actions/build-and-push with: image-name: myregistry/model-server aws-region: us-east-1 ``` **Reusable Workflows** are entire workflow files that can be called from other workflows, enabling sharing of complete pipeline stages across repositories: ```yaml # .github/workflows/reusable-deploy.yml on: workflow_call: inputs: environment: required: true type: string image-tag: required: true type: string secrets: kube-config: required: true jobs: deploy: runs-on: ubuntu-latest environment: ${{ inputs.environment }} steps: - name: Deploy to Kubernetes run: | kubectl set image deployment/model-server \ model-server=myregistry/model-server:${{ inputs.image-tag }} ``` Calling this from another workflow: ```yaml deploy-staging: uses: myorg/.github/.github/workflows/reusable-deploy.yml@main with: environment: staging image-tag: ${{ github.sha }} secrets: kube-config: ${{ secrets.STAGING_KUBE_CONFIG }} ``` Storing reusable workflows and composite actions in a dedicated `.github` repository in your organization makes them available to all repositories. Changes to shared actions propagate to all consumers when they update their references. --- ## 17. Environment-Based Deployments Environments in CI/CD represent the stages infrastructure passes through from development to production — typically development, staging, and production. Each environment has different infrastructure, different secrets, different approval requirements, and different deployment strategies. GitHub Actions has native environment support with protection rules: ```yaml deploy-production: needs: [deploy-staging, integration-tests] runs-on: ubuntu-latest environment: name: production url: https://api.company.com steps: - name: Deploy to production run: kubectl apply -f k8s/production/ ``` You configure the `production` environment in GitHub's Settings → Environments with: required reviewers (specific people who must approve before the job runs), wait timer (mandatory delay between staging deployment and production deployment), deployment branch rules (only the main branch can deploy to production), and environment secrets (secrets specific to this environment, not shared with other environments). When a workflow reaches a job with an environment that has required reviewers, the workflow pauses. The designated reviewers get a notification, review what's being deployed (the PR it came from, the test results), and click Approve or Reject. This human gate is critical for production deployments. **Environment-specific configuration** — each environment needs different secrets (different database URLs, different API keys, different cloud credentials). GitHub Environments let you store secrets scoped to specific environments. A `DATABASE_URL` secret in the `staging` environment contains the staging database URL, while the same-named secret in `production` contains the production URL. The workflow code is identical — the environment determines which secret value is used. **Deployment tracking** — when a workflow deploys using an environment, GitHub records the deployment event, links it to the commit and PR, and shows it in the repository's "Deployments" view. This gives you a complete history of what was deployed to each environment and when, directly connected to the code changes. --- ## 18. Secrets Management in CI/CD CI/CD pipelines need credentials to do their work — cloud provider credentials to deploy infrastructure, registry credentials to push images, Kubernetes credentials to deploy services, database passwords to run integration tests, API tokens to communicate with external services. Managing these credentials securely is one of the most critical aspects of CI/CD security. **Never store secrets in pipeline files** — this seems obvious but the temptation to hardcode a credential "just temporarily" is constant. Pipeline files are code that goes into version control. Secrets in version control are exposed to everyone with repository access and are preserved permanently in history even if removed. Always use the secret storage mechanism provided by your CI/CD platform. **CI/CD platform secrets** — GitHub Actions Secrets, GitLab CI Variables, CircleCI Contexts all provide encrypted storage for secrets that are injected into pipeline jobs as environment variables. They're not visible in logs, not accessible in pull requests from forks, and can be scoped to specific environments. ```yaml steps: - name: Deploy to EKS env: AWS_ROLE_ARN: ${{ secrets.PROD_AWS_ROLE_ARN }} run: | aws eks update-kubeconfig --name production-cluster kubectl apply -f k8s/ ``` **OIDC federation — eliminating static credentials** — the most secure approach for cloud deployments is eliminating long-lived credentials entirely using OpenID Connect (OIDC) federation. GitHub Actions, GitLab CI, and CircleCI all support OIDC. Your cloud provider (AWS, GCP, Azure) trusts the CI system's OIDC issuer. The CI job requests a short-lived token from the OIDC issuer, presents it to the cloud provider, and receives temporary cloud credentials valid only for the duration of the job. There are no long-lived access keys stored anywhere — no credential to leak, no credential to rotate. ```yaml - name: Configure AWS credentials via OIDC uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: arn:aws:iam::123456789:role/GitHubActionsDeployRole aws-region: us-east-1 ``` AWS verifies that this job is running in the specific repository and specific branch configured in the IAM role's trust policy. If the job is running in a different repository (indicating a compromise), authentication fails. **Vault integration** — for organizations using HashiCorp Vault, CI jobs authenticate to Vault using their OIDC identity and retrieve secrets dynamically. Secrets are never stored in the CI platform — they're fetched from the authoritative secret store at the time the job needs them. **Secret rotation** — CI secrets should be rotated regularly. With OIDC-based authentication, there's nothing to rotate — credentials are ephemeral by design. For static secrets that must be stored, establish a rotation schedule and automate rotation where possible. --- ## 19. Dependency Caching Strategies Package installation is often the slowest part of a CI pipeline. Installing PyTorch with its CUDA dependencies might take 5-10 minutes. Running this on every pipeline run when the dependencies haven't changed is pure waste. Dependency caching stores the installed packages from one pipeline run and restores them in future runs. If the cache is valid (dependencies haven't changed), the cache restore takes seconds instead of minutes. The key to effective caching is the cache key — the value that determines whether a cache hit or miss occurs. The cache key must uniquely identify the combination of dependencies. For Python, the right key is a hash of the requirements file — if `requirements.txt` changes, the hash changes, the cache misses, and packages are freshly installed. If requirements haven't changed, the hash matches, the cache hits, and packages are restored instantly. ```yaml - name: Cache Python dependencies uses: actions/cache@v3 id: pip-cache with: path: ~/.cache/pip key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt', 'requirements-dev.txt') }} restore-keys: | ${{ runner.os }}-pip- - name: Install dependencies if: steps.pip-cache.outputs.cache-hit != 'true' run: pip install -r requirements.txt -r requirements-dev.txt ``` The `restore-keys` provides fallback cache keys — if no exact match exists, restore the most recent cache with the prefix. This gives you a partial cache (most packages installed, perhaps missing the newest additions) which is still faster than installing from scratch. **Docker layer caching** — Docker builds can also use caching in CI. BuildKit supports using a registry as a cache backend: ```bash docker buildx build \ --cache-from type=registry,ref=myregistry/model-server:buildcache \ --cache-to type=registry,ref=myregistry/model-server:buildcache,mode=max \ -t myregistry/model-server:$SHA \ . ``` Unchanged layers are pulled from the registry cache and not rebuilt, dramatically accelerating image builds that have many layers. **Cache invalidation** — the hardest problem in CI caching is knowing when caches are stale. Hash-based keys handle this for dependency files. But for caches keyed on other things (branch name, date), you may occasionally need to manually invalidate caches by changing the key prefix. Some teams append a cache version number to keys and increment it when they need to force invalidation across all branches. --- ## 20. Artifact Management Artifacts are the outputs of CI/CD pipeline stages that need to be stored and potentially passed between stages or retained for later reference. Test reports, compiled binaries, Docker images, trained models, Terraform plans, coverage reports — all of these are artifacts. **Artifact storage in CI platforms** — GitHub Actions, GitLab CI, and CircleCI all have built-in artifact storage. You upload artifacts at the end of a job, and they're retained for a configurable period (7-90 days typically). Subsequent jobs in the same pipeline can download artifacts from previous jobs, enabling inter-job communication without a shared filesystem. ```yaml build: runs-on: ubuntu-latest steps: - name: Build application run: python -m build - name: Upload build artifacts uses: actions/upload-artifact@v3 with: name: dist-packages path: dist/ retention-days: 7 deploy: needs: build runs-on: ubuntu-latest steps: - name: Download build artifacts uses: actions/download-artifact@v3 with: name: dist-packages path: dist/ - name: Deploy run: ./deploy.sh dist/ ``` **Artifact naming and versioning** — artifacts should be named descriptively and consistently. Include the pipeline run ID, commit SHA, or version number in artifact names to ensure uniqueness and traceability. `model-server-v2.4.1-linux-amd64.tar.gz` is better than `build.tar.gz`. **External artifact registries** — for important artifacts with long retention requirements, store them in purpose-built registries rather than relying on CI platform artifact storage which has limited retention. Docker images go to container registries (ECR, GCR). Python packages go to private PyPI (Artifactory, AWS CodeArtifact). Helm charts go to chart repositories. Terraform modules go to the Terraform registry or an S3 bucket. ML models go to the model registry (MLflow, Weights & Biases). **Artifact provenance** — knowing where an artifact came from is essential for security and debugging. Your artifacts should be traceable back to the exact source code commit and pipeline run that produced them. Attaching SBOM, SLSA provenance attestations, and build logs to artifacts creates a complete chain of custody from source code to deployed artifact. --- ## 21. Docker Build & Push Automation Automating Docker image builds and pushes is a foundational CI/CD capability for any containerized system. Every code change that affects deployed services should trigger an automated build, producing a versioned, scanned, signed image pushed to the registry. A complete Docker build and push workflow: ```yaml name: Build and Push on: push: branches: [main] pull_request: branches: [main] permissions: id-token: write # For OIDC contents: read security-events: write # For Sarif upload jobs: build-push: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Set up Docker Buildx uses: docker/setup-buildx-action@v3 - name: Configure AWS credentials uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: ${{ secrets.ECR_PUSH_ROLE }} aws-region: us-east-1 - name: Login to ECR id: ecr-login uses: aws-actions/amazon-ecr-login@v2 - name: Extract metadata id: meta uses: docker/metadata-action@v5 with: images: ${{ steps.ecr-login.outputs.registry }}/model-server tags: | type=sha,prefix=sha- type=semver,pattern={{version}} type=ref,event=branch - name: Build and push id: build uses: docker/build-push-action@v5 with: context: . push: ${{ github.event_name != 'pull_request' }} tags: ${{ steps.meta.outputs.tags }} labels: ${{ steps.meta.outputs.labels }} cache-from: type=gha cache-to: type=gha,mode=max provenance: true sbom: true - name: Scan image uses: aquasecurity/trivy-action@master with: image-ref: ${{ steps.ecr-login.outputs.registry }}/model-server:sha-${{ github.sha }} format: sarif output: trivy-results.sarif severity: CRITICAL,HIGH exit-code: 1 - name: Upload scan results uses: github/codeql-action/upload-sarif@v2 with: sarif_file: trivy-results.sarif - name: Sign image if: github.event_name != 'pull_request' run: | cosign sign --yes \ ${{ steps.ecr-login.outputs.registry }}/model-server@${{ steps.build.outputs.digest }} ``` This workflow builds with BuildKit layer caching, pushes with multiple tags, generates SBOM and provenance attestations, scans for vulnerabilities, uploads scan results to GitHub Security, and signs the image — all in a single automated pipeline. **On PR** — the image is built but not pushed (`push: ${{ github.event_name != 'pull_request' }}`). This validates that the Dockerfile is correct without pushing potentially unstable images. **On merge to main** — the image is built, pushed with the commit SHA as a tag, scanned, and signed. The commit SHA tag means every commit produces a uniquely identified, traceable image. --- ## 22. Infrastructure Deployment Automation Automating infrastructure deployments — applying Terraform, deploying Helm charts, updating Kubernetes manifests — closes the loop between code changes and actual running systems. Manual infrastructure changes are slow, error-prone, and leave no audit trail. **Terraform automation in CI/CD:** ```yaml terraform-plan: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: hashicorp/setup-terraform@v3 with: terraform_version: 1.6.0 - name: Configure AWS credentials uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: ${{ secrets.TF_PLAN_ROLE }} aws-region: us-east-1 - name: Terraform init working-directory: infrastructure/environments/staging run: terraform init - name: Terraform plan working-directory: infrastructure/environments/staging run: terraform plan -out=plan.tfplan - name: Upload plan uses: actions/upload-artifact@v3 with: name: terraform-plan path: infrastructure/environments/staging/plan.tfplan terraform-apply: needs: terraform-plan environment: staging # Requires approval runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: hashicorp/setup-terraform@v3 - name: Download plan uses: actions/download-artifact@v3 with: name: terraform-plan path: infrastructure/environments/staging/ - name: Terraform apply working-directory: infrastructure/environments/staging run: terraform apply plan.tfplan ``` The pattern of separating plan and apply — with human approval between — ensures that the plan output is reviewed before infrastructure changes are actually made. The plan artifact is generated in the plan job and consumed in the apply job, guaranteeing that what was reviewed is exactly what's applied. **GitOps-based deployment** — for Kubernetes application deployments, GitOps (ArgoCD or Flux watching a Git repository) is often superior to pipeline-driven deployments. The pipeline updates image tags or Helm values in the GitOps repository. ArgoCD detects the change and applies it to the cluster. The pipeline doesn't need Kubernetes credentials — ArgoCD running inside the cluster handles the actual deployment. ```yaml update-deployment: needs: build-push runs-on: ubuntu-latest steps: - name: Checkout GitOps repo uses: actions/checkout@v4 with: repository: myorg/gitops-config token: ${{ secrets.GITOPS_PAT }} - name: Update image tag run: | sed -i 's|image: myregistry/model-server:.*|image: myregistry/model-server:sha-${{ github.sha }}|' \ environments/staging/model-server/deployment.yaml - name: Commit and push run: | git config user.email "ci@company.com" git config user.name "CI Bot" git add -A git commit -m "Update model-server to sha-${{ github.sha }}" git push ``` ArgoCD detects the Git commit and syncs the cluster to match. --- ## 23. Automated Versioning & Release Pipelines Manual versioning — a developer decides when to bump versions, manually edits version files, creates a tag, writes release notes — doesn't scale and produces inconsistent results. Automated versioning derives version numbers from commit history and creates releases without human intervention. **Conventional Commits** is the foundation. When your team commits using structured commit messages — `feat:`, `fix:`, `chore:`, `docs:`, `breaking change:` — automated tooling can analyze the commit history and determine what the next semantic version should be. A commit with `feat:` since the last release indicates a minor version bump. A commit with `fix:` indicates a patch bump. A commit with `BREAKING CHANGE:` in the footer indicates a major version bump. **Release Please** from Google automates the entire release workflow as a GitHub Action: ```yaml release-please: runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' steps: - uses: google-github-actions/release-please-action@v4 with: release-type: python package-name: model-server ``` When this runs after a merge to main, Release Please analyzes commits since the last release, determines the appropriate version bump, updates version files and CHANGELOG.md, and creates a "Release PR" with all these changes. When the Release PR is merged, Release Please creates the Git tag and GitHub Release automatically. The changelog is generated from commit messages — `feat:` entries become "Features" sections, `fix:` entries become "Bug Fixes", breaking changes get prominent documentation. This creates accurate, consistently formatted release notes without manual writing. **Semantic Release** is another popular tool for the same purpose, with plugins for different languages and ecosystems: ```yaml # .releaserc.yml branches: - main plugins: - "@semantic-release/commit-analyzer" - "@semantic-release/release-notes-generator" - "@semantic-release/changelog" - "@semantic-release/github" - "@semantic-release/git" ``` **Version propagation** — when a release is created, the version needs to propagate to all the places it's referenced. Python's `pyproject.toml`, Docker image tags, Helm chart versions, Kubernetes deployment labels. Automated release pipelines handle all of this through post-release jobs triggered by the tag creation event. --- ## 24. Blue-Green & Canary Deployment Automation We covered Blue-Green and Canary strategies conceptually in the Deployment Strategies section. Here we focus on automating these strategies within CI/CD pipelines. **Automated Blue-Green with Kubernetes:** ```yaml blue-green-deploy: runs-on: ubuntu-latest environment: production steps: - name: Deploy to green slot run: | # Create green deployment with new version kubectl set image deployment/model-server-green \ model-server=myregistry/model-server:${{ github.sha }} \ -n production # Wait for green to be ready kubectl rollout status deployment/model-server-green \ -n production --timeout=10m - name: Run smoke tests against green run: | GREEN_IP=$(kubectl get svc model-server-green -o jsonpath='{.status.loadBalancer.ingress[0].ip}') python tests/smoke_test.py --host $GREEN_IP - name: Switch traffic to green run: | # Update the main service selector to point to green pods kubectl patch service model-server \ -p '{"spec":{"selector":{"slot":"green"}}}' - name: Wait and validate run: | sleep 60 python tests/production_validation.py - name: Update blue slot for next deployment run: | kubectl set image deployment/model-server-blue \ model-server=myregistry/model-server:${{ github.sha }} \ -n production ``` **Automated Canary with Argo Rollouts:** ```yaml canary-deploy: runs-on: ubuntu-latest steps: - name: Update rollout image run: | kubectl argo rollouts set image model-server \ model-server=myregistry/model-server:${{ github.sha }} - name: Monitor rollout run: | kubectl argo rollouts status model-server --timeout=30m ``` Argo Rollouts handles the canary progression automatically based on its configured analysis — it promotes traffic from 5% to 20% to 50% to 100% as long as error rate and latency metrics stay within thresholds, and rolls back automatically if they don't. The pipeline just updates the image and monitors the outcome. --- ## 25. Rollback Strategies Automated rollback is the safety net for when deployments go wrong. In high-quality CI/CD pipelines, rollback is not a manual panic response — it's an automated process that fires based on monitoring signals. **Automated deployment rollback** — in your deployment job, after applying the change, watch key metrics for a defined window: ```yaml deploy-and-validate: runs-on: ubuntu-latest steps: - name: Deploy new version run: kubectl set image deployment/model-server model-server=myregistry/model-server:${{ github.sha }} - name: Wait for rollout run: kubectl rollout status deployment/model-server --timeout=10m - name: Validate deployment health run: | # Wait 5 minutes and check error rate sleep 300 ERROR_RATE=$(curl -s "http://prometheus/api/v1/query?query=rate(http_errors_total[5m])" | jq '.data.result[0].value[1]') if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then echo "Error rate too high: $ERROR_RATE. Rolling back." kubectl rollout undo deployment/model-server exit 1 fi echo "Deployment healthy. Error rate: $ERROR_RATE" ``` **Kubernetes-native rollback** — for Kubernetes deployments, `kubectl rollout undo` reverts to the previous ReplicaSet and is very fast. The previous image is still cached on nodes, so rollback typically completes in under a minute. **GitOps rollback** — with GitOps deployments, rollback means reverting the Git commit that updated the image tag. ArgoCD detects the revert and rolls the cluster back to the previous state. This creates a clean audit trail — the rollback is visible in Git history. **Database migration rollback** — the hardest rollback scenario involves database schema changes. If a deployment includes a migration that is not backward-compatible (removes a column the old code reads), rolling back the application code while the migration has run leaves the old code broken against the new schema. This is why migrations should always be backward-compatible — add columns as nullable, don't remove columns in the same release as the code change that removes references to them (use the expand-contract pattern from the Terraform section). --- ## 26. CI/CD Security Best Practices CI/CD pipelines are one of the highest-value targets in your infrastructure. A compromised pipeline can deploy malicious code to production, exfiltrate secrets, or destroy infrastructure. Securing pipelines is as important as securing production systems. **Least privilege for pipeline credentials** — the IAM role or service account used by a pipeline job should have exactly the permissions needed for that job. The linting job needs no cloud access. The build job needs registry push access. The staging deployment job needs staging cluster access. The production deployment job needs production cluster access. Separate roles for each, with no cross-environment access. **Protect secrets from pull request workflows** — GitHub Actions has an important security boundary: secrets are not available to workflows triggered by pull requests from forks, because a malicious PR author could modify the workflow to exfiltrate secrets. This is correct behavior. For your own organization's PRs, configure your secrets and environments appropriately, but be aware of this boundary. **Pin action versions with SHA** — using `uses: actions/checkout@v4` is convenient but the `v4` tag can be moved to point to a different commit (supply chain attack). Using the full commit SHA `uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29` ensures you're using exactly the code you audited. **OIDC instead of static credentials** — as covered in secrets management, replacing long-lived static credentials with OIDC-based temporary credentials eliminates the most common credential leak vector in CI/CD. **Limit permissions on GitHub token** — GitHub Actions automatically provides a `GITHUB_TOKEN` with broad repository permissions. Restrict it to the minimum needed: ```yaml permissions: contents: read # Only read repository contents packages: write # Write to GitHub Packages if needed id-token: write # OIDC token generation ``` **Audit pipeline changes** — pipeline files are code and should go through PR review. Changes to pipeline files — especially to how secrets are used, what environments they deploy to, and what approval requirements are enforced — should require review from senior engineers or security team members. Consider code owners rules that require specific approvers for `.github/workflows/` changes. --- ## 27. Observability in CI/CD A CI/CD pipeline is infrastructure that you depend on for your entire development velocity. Without observability, you can't answer basic questions: how long do pipelines take? What's the failure rate? Where is time being spent? Which steps are getting slower over time? **Pipeline metrics** — collect and track key metrics for every pipeline run. Total duration, duration per stage, success/failure rate, failure rate by stage (which stages fail most often?), time to first feedback (how long until a developer knows if their commit passed?), and frequency (how many runs per day?). These metrics, tracked over time, reveal degrading performance, flaky tests, and bottlenecks. GitHub Actions doesn't expose metrics natively, but tools like Datadog CI Visibility, BuildPulse, and Grafana's GitHub Actions integration collect and visualize this data. **Test result tracking** — upload test results in standard formats (JUnit XML) and track them in a dedicated system. Tools like BuildPulse and Datadog CI Visibility track test results across runs and identify flaky tests (tests that fail intermittently without code changes), slow tests (tests taking significantly longer than average), and newly failing tests (tests that were passing and just started failing). Flaky tests are particularly insidious — they erode trust in CI. When developers see a red pipeline and assume it's a flaky test rather than a real failure, they start merging code without validating CI. Identifying and fixing flaky tests is a high-priority maintenance task for any team that depends on CI. **Pipeline failure alerts** — when pipelines fail on main (not on a PR — PR failures are expected), that's a signal that something merged that broke the build. Alert the team immediately — a broken main pipeline means nobody can ship until it's fixed. Paging or Slack alerts for main branch pipeline failures are appropriate. **Cost observability** — CI/CD compute costs money. Tracking cost per pipeline run, cost per repository, and cost trends over time helps identify expensive pipelines that might benefit from optimization. A matrix build running on 20 large runners for every PR might be appropriate for a critical library but overkill for a small service. **Trace individual pipeline runs** — when a pipeline fails, diagnosing the cause should be fast. Pipeline logs should be structured (machine-parseable as well as human-readable), retain for long enough to investigate incidents (at least 90 days), and be searchable. When an on-call engineer gets paged at 2am because the production deployment pipeline failed, they should be able to find the relevant log lines within seconds, not minutes. --- The thread connecting all of them is this: both Git discipline and CI/CD engineering are about creating systems that make the path from idea to production fast, safe, auditable, and reversible. Git gives you the history and collaboration foundation. CI/CD automates the quality gates and deployment mechanics. Together, done well, they create an engineering culture where shipping is easy and safe, and where mistakes are caught early and recovered from quickly.