🎯 Overfit Detection Guide
Understand how the engine evaluates whether your strategy has genuine predictive power or is just curve-fitting historical data
When you backtest a trading strategy, you're essentially asking: "Would this strategy have worked in the past?"But there's a trap: with enough parameters and tweaking, you can make almost any strategy look profitable on historical data—even if it has no real predictive power. This is called overfitting. The Overfit Detection Engine helps you distinguish between genuine strategies and illusory performance that will likely disappear in live trading.
How the Engine Works
The engine calculates an Overfit Probability (0-100%) by combining five independent signals, each measuring a different aspect of potential overfitting.
The Logistic Transformation
The engine uses a logistic (S-curve) function to convert composite scores to probabilities:
- • Composite score of 0 → ~2% overfit probability
- • Composite score of 50 → 50% overfit probability
- • Composite score of 100 → ~98% overfit probability
The Five Detection Signals
1. Parameter Complexity
The relationship between how many adjustable parameters your strategy has versus how many trades it generated
More parameters = more ways to accidentally fit noise. A strategy with 20 parameters and only 50 trades is almost certainly overfit. A strategy with 3 parameters and 500 trades is much more likely to be genuine.
Parameter-to-Trade Ratio Interpretation
Rule of thumb: Aim for at least 50 trades per parameter. If you have 10 parameters, aim for 500+ trades.
2. Sample Adequacy
Whether you have enough trades for statistically reliable conclusions
With too few trades, even random strategies can look profitable by chance. More trades = more confidence that results aren't just luck.
| Trades | Confidence Level | Score Range |
|---|---|---|
| < 20 | Critical | 90-100 |
| 20-50 | Low | 40-70 |
| 50-100 | Medium | 20-40 |
| 100-200 | High | 5-20 |
| > 200 | Very High | < 5 |
Warning: Results from fewer than 20 trades are essentially meaningless. The engine will flag these as "Critical" regardless of other signals.
3. Return Normality
Whether your trade returns follow a normal (bell-curve) distribution
Genuine trading strategies typically produce returns that are roughly normally distributed. Highly non-normal distributions (extreme skew, fat tails) can indicate data issues, outlier-dependent performance, or curve-fitting to specific events.
Shapiro-Wilk W Statistic
Skewness
Measures asymmetry of the distribution. Positive skew means more extreme gains than losses; negative skew means more extreme losses than gains.
Kurtosis
Measures "tailedness". Positive (fat tails) means more extreme outcomes than normal; negative (thin tails) means fewer extreme outcomes.
This signal requires at least 5 trades to calculate. Some non-normality is expected in trading returns—focus on extreme cases (W < 0.80).
4. Trade Clustering
Whether winning and losing trades are randomly distributed or clustered together
In a genuine strategy, wins and losses should be relatively random. If all your wins happened in one period and all losses in another, the strategy may have been fit to specific market conditions.
Wald-Wolfowitz Runs Test
A "run" is a sequence of consecutive wins or consecutive losses. The engine compares observed runs to expected runs under randomness.
Warning: Severe clustering (runs ratio < 0.5) is a strong overfit indicator. Your strategy may only work in specific market conditions.
5. Regime Sensitivity
How consistently your strategy performs across different market regimes (bull, bear, sideways, volatile)
An overfit strategy often performs brilliantly in one regime and terribly in others. A robust strategy should show reasonable consistency across different market conditions.
Coefficient of Variation (CV)
CV = (Standard Deviation of regime returns) / |Mean of regime returns| × 100%
This signal needs at least 2 regimes with 5+ trades each. If regime data isn't available, consider running regime analysis first.
Risk Level Thresholds
The final overfit probability maps to four risk levels, each with specific recommended actions.
🟢 Low Risk
Strategy shows strong signs of genuine predictive power
🔵 Moderate Risk
Some overfit signals present, proceed with caution
🟡 High Risk
Multiple overfit indicators suggest curve-fitting
🔴 Critical Risk
Strong evidence of overfitting—strategy likely unreliable
Bayes Factor
The Bayes Factor (BF) provides an alternative perspective by comparing two hypotheses: that the strategy is genuine vs. that it's overfit.
| Bayes Factor | Evidence Strength |
|---|---|
| < 1 | Evidence favors overfitting |
| 1-3 | Anecdotal evidence for genuine strategy |
| 3-10 | Moderate evidence for genuine strategy |
| 10-30 | Strong evidence for genuine strategy |
| 30-100 | Very strong evidence for genuine strategy |
| > 100 | Extreme evidence for genuine strategy |
Tip: Use Bayes Factor alongside the overfit probability. If they disagree significantly, investigate the individual signals more closely.
Assessment Confidence
The engine also reports how confident it is in its own assessment based on your sample size.
Results may not be reliable
Consider running more trades
Robust assessment
Important: A "Low Risk" assessment with "Low Confidence" is not as reassuring as "Low Risk" with "High Confidence". Always consider both together.
Practical Examples
Example 1: The Simple Winner
- • Moving average crossover with 2 parameters
- • 350 trades
- • Parameter Complexity: 8 ✅
- • Sample Adequacy: 12 ✅
- • Return Normality: 18 ✅
- • Trade Clustering: 25 ✅
- • Regime Sensitivity: 35 ⚠️
Example 2: The Over-Optimized Mess
- • Machine learning model with 45 parameters
- • 80 trades
- • Parameter Complexity: 78 ❌
- • Sample Adequacy: 72 ❌
- • Return Normality: 55 ⚠️
- • Trade Clustering: 68 ❌
- • Regime Sensitivity: 82 ❌
Example 3: The Promising but Unproven
- • Momentum strategy with 5 parameters
- • 45 trades
- • Parameter Complexity: 22 ✅
- • Sample Adequacy: 58 ⚠️
- • Return Normality: 30 ✅
- • Trade Clustering: 42 ⚠️
- • Regime Sensitivity: 50 (no data)
Tips and Best Practices
Before Running Analysis
- Ensure enough trades: Aim for 100+ trades minimum, 200+ for high confidence
- Count your parameters honestly: Include all tunable values, even "obvious" ones
- Run regime analysis first: This provides crucial data for the regime sensitivity signal
Reducing Overfit Risk
- Simplify your strategy: Fewer parameters = less room for overfitting
- Get more data: More trades provide more statistical power
- Use walk-forward optimization: Validates that parameters work out-of-sample
- Test across regimes: Ensure your strategy works in different market conditions
- Reserve holdout data: Never optimize on all your data—keep some for final validation
Red Flags to Watch For
- • Parameter-to-trade ratio > 0.05
- • Fewer than 50 trades
- • Runs ratio < 0.6 (severe clustering)
- • Regime CV > 150%
- • Bayes Factor < 1
Quick Reference Table
| Signal | Weight | Good | Warning | Critical |
|---|---|---|---|---|
| Parameter Complexity | 25% | < 25 | 25-50 | > 50 |
| Sample Adequacy | 15% | < 20 | 20-60 | > 60 |
| Return Normality | 20% | < 20 | 20-50 | > 50 |
| Trade Clustering | 20% | < 30 | 30-60 | > 60 |
| Regime Sensitivity | 20% | < 30 | 30-60 | > 60 |
| Metric | Good | Concerning | Action |
|---|---|---|---|
| Overfit Probability | < 25% | > 50% | Simplify or get more data |
| Bayes Factor | > 10 | < 1 | Investigate signals |
| Assessment Confidence | High (100+ trades) | Low (< 30) | Run longer backtest |
Putting It All Together
The Overfit Detection Engine is your first line of defense against deploying strategies that look good on paper but fail in reality. Use it as part of a comprehensive validation workflow:
- 1Develop your strategy with simplicity in mind
- 2Backtest with sufficient data
- 3Analyze with the Overfit Detection Engine
- 4Validate with walk-forward optimization
- 5Deploy only strategies that pass all checks
Remember: A strategy that shows "Low Risk" with "High Confidence" and a Bayes Factor > 10 is a strong candidate for live trading. Anything less deserves more scrutiny.
