Quality Metrics for AI-Generated Code
Measuring code quality is the first step toward improving it. When dealing with AI-generated code, traditional metrics remain valid but require calibrated thresholds and specific interpretations. AI produces code with complexity patterns different from human code, making an adapted measurement framework necessary.
In this article we will analyze the fundamental metrics for evaluating AI-generated code quality: cyclomatic complexity, code coverage, maintainability index, duplication and how to integrate them into a continuous monitoring system with tools like SonarQube and CodeFactor.
What You Will Learn
- How cyclomatic and cognitive complexity metrics work in the AI context
- Strategies to measure and improve code coverage on AI-generated code
- The Maintainability Index and how to interpret it for generated code
- Duplication detection techniques specific to AI output
- How to configure SonarQube with adapted thresholds for AI code
- DORA metrics adapted to the AI development context
Cyclomatic Complexity: Measuring Execution Paths
Cyclomatic complexity, introduced by Thomas McCabe in 1976, measures the number of independent execution paths through source code. For AI-generated code, this metric assumes particular importance: AI tends to generate functions with more branches than necessary, often duplicating checks or adding redundant conditions.
The basic formula is simple: V(G) = E - N + 2P, where E are the edges of the
flow graph, N the nodes and P the connected components. In practice, each if,
for, while, case and ternary operator adds 1 to complexity.
# Example: cyclomatic complexity calculation
import ast
import sys
class CyclomaticComplexityVisitor(ast.NodeVisitor):
"""Calculates the cyclomatic complexity of Python functions"""
def __init__(self):
self.results = []
def visit_FunctionDef(self, node):
complexity = 1 # base path
for child in ast.walk(node):
if isinstance(child, (ast.If, ast.IfExp)):
complexity += 1
elif isinstance(child, ast.For):
complexity += 1
elif isinstance(child, ast.While):
complexity += 1
elif isinstance(child, ast.ExceptHandler):
complexity += 1
elif isinstance(child, ast.BoolOp):
# and/or add paths
complexity += len(child.values) - 1
self.results.append({
"function": node.name,
"line": node.lineno,
"complexity": complexity,
"risk": self._classify_risk(complexity)
})
self.generic_visit(node)
def _classify_risk(self, complexity):
if complexity <= 5:
return "LOW"
elif complexity <= 10:
return "MODERATE"
elif complexity <= 20:
return "HIGH"
else:
return "VERY_HIGH"
Cyclomatic Complexity Thresholds for AI Code
| Complexity | Human Code Risk | AI Code Risk | Recommended Action |
|---|---|---|---|
| 1-5 | Low | Low | Acceptable, standard review |
| 6-10 | Moderate | High | Thorough review, mandatory tests |
| 11-20 | High | Critical | Mandatory refactoring before merge |
| 20+ | Very high | Automatic rejection | Merge blocked, decomposition required |
Cognitive Complexity: Beyond Numbers
Cognitive complexity, developed by SonarSource, goes beyond cyclomatic complexity by measuring how difficult the code is to understand for a human being. It particularly penalizes deep nesting, breaks in linear flow and recursive structures, all common patterns in AI code.
Unlike cyclomatic complexity, cognitive complexity assigns progressive increments to nesting
levels: an if inside a for inside a try has a much higher
weight than three sequential conditions.
Code Coverage: Beyond the Percentage
Code coverage measures the percentage of code exercised by automated tests. For AI-generated code, the problem is not just the low percentage, but the quality of coverage. AI often generates tests that only verify the happy path, systematically ignoring edge cases and error conditions.
# Framework for AI-specific code coverage analysis
class AICoverageAnalyzer:
"""Analyzes coverage quality on AI-generated code"""
def __init__(self, coverage_data, source_metadata):
self.coverage = coverage_data
self.metadata = source_metadata
def analyze_coverage_quality(self):
"""Multi-dimensional coverage analysis"""
return {
"line_coverage": self._line_coverage(),
"branch_coverage": self._branch_coverage(),
"path_coverage": self._path_coverage(),
"error_path_coverage": self._error_path_coverage(),
"edge_case_coverage": self._edge_case_score(),
"overall_quality": self._quality_score()
}
def _error_path_coverage(self):
"""Measures coverage specific to error paths"""
error_handlers = self.metadata.get("error_handlers", [])
covered_handlers = [h for h in error_handlers
if self.coverage.is_covered(h["line"])]
if not error_handlers:
return 0.0
return len(covered_handlers) / len(error_handlers)
def _edge_case_score(self):
"""Evaluates whether tests cover typical AI edge cases"""
checks = [
self._has_null_tests(),
self._has_empty_input_tests(),
self._has_boundary_tests(),
self._has_type_error_tests(),
self._has_concurrent_tests()
]
return sum(checks) / len(checks)
def _quality_score(self):
"""Composite score: not just how much, but WHAT is covered"""
line = self._line_coverage()
branch = self._branch_coverage()
error = self._error_path_coverage()
edge = self._edge_case_score()
# Higher weight on error and edge for AI code
return (line * 0.2 + branch * 0.2 +
error * 0.35 + edge * 0.25)
Coverage Strategies for AI Code
For AI-generated code, 80% coverage that only covers happy paths is less useful than 60% coverage that includes error handling and edge cases. Recommended strategies include mandatory branch coverage, specific tests for error conditions and property-based testing to discover unexpected edge cases.
Maintainability Index
The Maintainability Index (MI) is a composite metric that combines Halstead volume, cyclomatic complexity and lines of code to produce a single value representing how easy the code is to maintain. The scale ranges from 0 (impossible to maintain) to 100 (perfectly maintainable).
import math
def maintainability_index(halstead_volume, cyclomatic_complexity, loc):
"""
Calculates the Maintainability Index using the Microsoft formula.
MI = max(0, (171 - 5.2 * ln(V) - 0.23 * G - 16.2 * ln(LOC)) * 100 / 171)
Args:
halstead_volume: Halstead volume of the module
cyclomatic_complexity: Average cyclomatic complexity
loc: Lines of code (SLOC)
Returns:
float: Maintainability Index (0-100)
"""
if halstead_volume <= 0 or loc <= 0:
return 0.0
mi = (171
- 5.2 * math.log(halstead_volume)
- 0.23 * cyclomatic_complexity
- 16.2 * math.log(loc))
# Normalization 0-100
mi = max(0, mi * 100 / 171)
return round(mi, 2)
# Interpretation for AI-generated code:
# 85-100: Excellent - acceptable without modifications
# 65-84: Good - review recommended
# 40-64: Moderate - refactoring recommended
# 0-39: Poor - mandatory refactoring
Why Maintainability Index Is Critical for AI Code
- AI generates verbose code that increases LOC without adding semantic value
- AI functions tend to have high Halstead volume due to operator variety
- Average cyclomatic complexity is higher in automatically generated code
- A low MI in AI code indicates nobody has optimized the generator's output
Duplication Detection
AI-generated code exhibits significantly higher duplication rates than average. This happens because each prompt generates an isolated solution, without awareness of code already existing in the project. The result is duplicate utility functions, identically copied error handling patterns and repeated boilerplate.
Traditional duplicate detection tools (CPD, jscpd) work but require reduced sensitivity thresholds to catch the near-miss duplication typical of AI, where functions are nearly identical but with different variable names or slight structural variations.
Technical Debt Scoring
The Technical Debt Score aggregates all previous metrics into a single indicator representing the accumulated "cost" of technical debt. For AI code, this scoring must place greater weight on metrics where AI tends to perform worse.
Recommended Weights for AI Technical Debt Score
| Metric | Human Code Weight | AI Code Weight | Rationale |
|---|---|---|---|
| Cyclomatic complexity | 20% | 15% | Already penalized in MI |
| Duplication | 15% | 25% | Dominant problem in AI code |
| Missing coverage | 25% | 30% | AI generates fewer tests |
| Vulnerabilities | 30% | 20% | Treated separately in security |
| Code smells | 10% | 10% | Weight unchanged |
DORA Metrics Adapted for AI Development
DORA metrics (Deployment Frequency, Lead Time for Changes, Change Failure Rate, Time to Restore Service) are the reference standard for measuring engineering team performance. With the adoption of AI coding tools, these metrics need specific reinterpretation.
- Deployment Frequency: may increase with AI, but without quality gates risks becoming a misleading indicator
- Lead Time for Changes: decreases drastically in the coding phase but may increase in review and fix
- Change Failure Rate: the key metric to monitor, tends to increase with unvalidated AI code
- Time to Restore Service: may worsen if AI code is harder to debug
SonarQube Configuration for AI Code
SonarQube is the most widely used static analysis tool and natively supports all the metrics discussed. For AI-generated code, the standard configuration is not sufficient: custom quality profiles with more restrictive thresholds and additional rules are needed.
# sonar-project.properties - Configuration for AI code
sonar.projectKey=my-project
sonar.sources=src
sonar.tests=tests
sonar.language=py
# Custom Quality Gate for AI code
# More restrictive thresholds on critical KPIs
sonar.qualitygate.conditions=\
new_coverage >= 75,\
new_duplicated_lines_density <= 5,\
new_maintainability_rating = A,\
new_reliability_rating = A,\
new_security_rating = A,\
new_cognitive_complexity <= 8
# Additional rules for AI code patterns
sonar.issue.enforce.multicriteria=ai_error_handling,ai_input_validation
sonar.issue.enforce.ai_error_handling.resourceKey=src/**
sonar.issue.enforce.ai_error_handling.ruleKey=python:S1181
sonar.issue.enforce.ai_input_validation.resourceKey=src/**
sonar.issue.enforce.ai_input_validation.ruleKey=python:S5659
Monitoring Dashboard
An effective dashboard for monitoring AI code quality must visualize metrics so that trends are immediately visible. Key metrics to track include the ratio of AI to human code over time, defect rate trends by origin, coverage quality score and technical debt trends.
Essential Metrics for the AI Code Quality Dashboard
- AI Code Ratio: percentage of AI-generated code out of total
- Defect Density by Origin: defects per 1000 LOC, separated by AI vs human
- Coverage Quality Score: composite metric evaluating both quality and quantity of coverage
- Complexity Trend: average complexity trends over time
- Duplication Delta: duplication variation commit by commit
- MTTR by Code Origin: mean time to resolution for bugs in AI vs human code
Conclusions
Quality metrics for AI-generated code are not fundamentally different from traditional ones, but they require calibrated thresholds and specific interpretations. Cyclomatic and cognitive complexity, code coverage, maintainability index and duplication detection remain the pillars of measurement, but must be adapted to the characteristic patterns of AI output.
In the next article we will focus on security detection in AI-generated code, analyzing the most common vulnerabilities introduced by AI assistants, specific OWASP patterns and how to implement automated scanning to intercept security issues before they reach production.
The fundamental principle is clear: you cannot improve what you do not measure. And for AI code, measurement must be more careful, more frequent and more specific than ever.







