Case Study: Implementing a Quality Framework for AI Code
After eight articles of theory, metrics and tools, it is time to see everything in action. This case study documents the end-to-end implementation of a quality framework for AI-generated code in a fintech startup with a team of 12 developers, 65% of whose new code is generated through AI assistants. We will see the implementation timeline, before-and-after metrics, challenges encountered and concrete results.
The project took place over a period of 8 weeks with a dedicated team of 2 people (a quality engineer and a senior developer), with the goal of reducing the AI code defect rate by 40% and production incidents by 50%.
What You Will Learn
- How to plan and implement a quality framework for AI code in 8 weeks
- Baseline metrics and how to measure them before intervention
- The technical architecture of the framework: SAST, test intelligence, CI/CD guardrails
- Real challenges encountered and how they were overcome
- Measurable results achieved: defect reduction, incidents, costs
- Lessons learned and recommendations for other teams
Context: The FinPay Startup
FinPay (fictitious name) is a fintech startup developing a B2B payments platform. The team of 12 developers works on a Python/FastAPI codebase with approximately 180,000 lines of code. The adoption of GitHub Copilot and Claude occurred 8 months before the intervention, and the team had already noticed a concerning increase in production incidents.
Team and Project Profile
| Characteristic | Detail |
|---|---|
| Team size | 12 developers (4 senior, 5 mid, 3 junior) |
| Codebase | Python/FastAPI, 180K LOC |
| AI code ratio | 65% of new code |
| AI tools | GitHub Copilot, Claude (via API) |
| CI/CD | GitHub Actions, Docker, AWS ECS |
| Deployment frequency | 3-4 deploys/week |
| Main problem | 2x increase in incidents over 6 months |
Phase 1: Assessment and Baseline (Week 1-2)
Before implementing any solution, we conducted a thorough assessment to establish baseline metrics. Without starting data, it would be impossible to measure the impact of the intervention.
Baseline Data Collection
We analyzed the last 3 months of data from the Git repository, CI/CD pipeline, incident management system and existing SonarQube metrics (configured with standard thresholds, not optimized for AI code).
# Baseline metrics collection script
class BaselineCollector:
"""Collects baseline metrics for the quality framework"""
def __init__(self, git_client, sonar_client, incident_db):
self.git = git_client
self.sonar = sonar_client
self.incidents = incident_db
def collect_baseline(self, months=3):
"""Collects all baseline metrics"""
return {
"code_quality": self._collect_code_quality(),
"defect_metrics": self._collect_defect_metrics(months),
"incident_metrics": self._collect_incident_metrics(months),
"review_metrics": self._collect_review_metrics(months),
"coverage_metrics": self._collect_coverage_metrics(),
}
def _collect_defect_metrics(self, months):
"""Defect metrics separated by AI vs human"""
commits = self.git.get_commits(months=months)
ai_commits = [c for c in commits if self._is_ai_generated(c)]
human_commits = [c for c in commits if not self._is_ai_generated(c)]
return {
"total_commits": len(commits),
"ai_commits": len(ai_commits),
"ai_ratio": round(len(ai_commits) / len(commits), 2),
"ai_defect_rate": self._calculate_defect_rate(ai_commits),
"human_defect_rate": self._calculate_defect_rate(human_commits),
"ai_bugs_total": self._count_bugs(ai_commits),
"human_bugs_total": self._count_bugs(human_commits),
}
def _collect_incident_metrics(self, months):
"""Production incident metrics"""
incidents = self.incidents.get_recent(months=months)
return {
"total_incidents": len(incidents),
"ai_related": len([i for i in incidents if i.ai_code_related]),
"avg_mttr_hours": self._avg_mttr(incidents),
"severity_distribution": self._severity_dist(incidents),
"monthly_trend": self._monthly_trend(incidents),
}
Baseline Metrics (Before Intervention)
| Metric | Baseline Value | Target |
|---|---|---|
| Defect rate (AI code, per 1000 LOC) | 5.8 | <3.5 |
| Production incidents/month | 8.3 | <3.5 |
| MTTR (hours) | 4.2 | <2.5 |
| Code coverage (new AI code) | 38% | >75% |
| Change Failure Rate | 22% | <10% |
| New code duplication | 19% | <5% |
| Average cognitive complexity | 18.4 | <10 |
| PR review time (hours) | 6.8 | <4 |
Phase 2: Quick Wins (Week 3-4)
In the second phase we implemented immediate-impact changes: pre-commit secret detection, custom SonarQube quality gate and AI code review checklist. The goal was to achieve visible results quickly to generate momentum within the team.
Secret Detection Implementation
The first intervention was implementing secret detection as a pre-commit hook. In the first 3 days, the tool intercepted 7 hardcoded credentials that would have ended up in the repository, including 2 production API keys.
Custom SonarQube Quality Gate
We created an "AI Code Strict" Quality Gate with thresholds calibrated for AI code, based on the baseline metrics collected in phase 1. Thresholds were set to block 20-25% of PRs on first attempt, balancing enforcement and productivity.
Structured Review Checklist
We introduced a mandatory checklist for AI-generated code review, integrated as a PR template in GitHub. The checklist covers the 10 critical areas identified in the assessment and includes specific questions the reviewer must answer before approving.
Phase 3: Advanced Automation (Week 5-6)
The third phase introduced advanced automation into the pipeline: SAST with custom rules for AI patterns, test intelligence with mutation testing and integrated security scanning.
# CI/CD Pipeline implemented for FinPay
# .github/workflows/quality-framework.yml
name: AI Code Quality Framework
on:
pull_request:
branches: [main, develop]
jobs:
secret-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: TruffleHog Secret Scan
uses: trufflesecurity/trufflehog@main
with:
extra_args: --only-verified
quality-analysis:
needs: secret-scan
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Run Tests with Coverage
run: |
pip install -r requirements.txt
pytest tests/ --cov=src --cov-report=xml \
--cov-fail-under=75 -v
- name: Semgrep SAST
uses: returntocorp/semgrep-action@v1
with:
config: >-
p/python
p/security-audit
.semgrep/ai-rules.yml
- name: SonarQube Analysis
uses: sonarsource/sonarqube-scan-action@master
env:
SONAR_TOKEN: 






