🎯 Overfit Detection Guide

Understand how the engine evaluates whether your strategy has genuine predictive power or is just curve-fitting historical data

When you backtest a trading strategy, you're essentially asking: "Would this strategy have worked in the past?"But there's a trap: with enough parameters and tweaking, you can make almost any strategy look profitable on historical data—even if it has no real predictive power. This is called overfitting. The Overfit Detection Engine helps you distinguish between genuine strategies and illusory performance that will likely disappear in live trading.

How the Engine Works

The engine calculates an Overfit Probability (0-100%) by combining five independent signals, each measuring a different aspect of potential overfitting.

Overfit Probability = Weighted combination of 5 signals → Logistic transformation

No overfit risk

Uncertain

100

Maximum risk

The Logistic Transformation

The engine uses a logistic (S-curve) function to convert composite scores to probabilities:

• Composite score of 0 → ~2% overfit probability
• Composite score of 50 → 50% overfit probability
• Composite score of 100 → ~98% overfit probability

The Five Detection Signals

1. Parameter Complexity

Weight: 25%

The relationship between how many adjustable parameters your strategy has versus how many trades it generated

More parameters = more ways to accidentally fit noise. A strategy with 20 parameters and only 50 trades is almost certainly overfit. A strategy with 3 parameters and 500 trades is much more likely to be genuine.

Parameter-to-Trade Ratio Interpretation

< 0.02

Good — Score 0-25

0.02 - 0.05

Moderate — Score 25-50

0.05 - 0.10

High risk — Score 50-75

> 0.10

Critical — Score 75-100

5 params / 500 trades

Ratio: 0.01 → Score: ~4 ✅

10 params / 100 trades

Ratio: 0.10 → Score: ~39 ⚠️

20 params / 50 trades

Ratio: 0.40 → Score: ~86 ❌

Rule of thumb: Aim for at least 50 trades per parameter. If you have 10 parameters, aim for 500+ trades.

2. Sample Adequacy

Weight: 15%

Whether you have enough trades for statistically reliable conclusions

With too few trades, even random strategies can look profitable by chance. More trades = more confidence that results aren't just luck.

Trades	Confidence Level	Score Range
< 20	Critical	90-100
20-50	Low	40-70
50-100	Medium	20-40
100-200	High	5-20
> 200	Very High	< 5

Warning: Results from fewer than 20 trades are essentially meaningless. The engine will flag these as "Critical" regardless of other signals.

3. Return Normality

Weight: 20%

Whether your trade returns follow a normal (bell-curve) distribution

Genuine trading strategies typically produce returns that are roughly normally distributed. Highly non-normal distributions (extreme skew, fat tails) can indicate data issues, outlier-dependent performance, or curve-fitting to specific events.

Shapiro-Wilk W Statistic

W > 0.95

Approximately normal ✅

W = 0.90-0.95

Slight deviation ⚠️

W = 0.80-0.90

Non-normal ⚠️

W < 0.80

Highly non-normal ❌

Skewness

Measures asymmetry of the distribution. Positive skew means more extreme gains than losses; negative skew means more extreme losses than gains.

Kurtosis

Measures "tailedness". Positive (fat tails) means more extreme outcomes than normal; negative (thin tails) means fewer extreme outcomes.

This signal requires at least 5 trades to calculate. Some non-normality is expected in trading returns—focus on extreme cases (W < 0.80).

4. Trade Clustering

Weight: 20%

Whether winning and losing trades are randomly distributed or clustered together

In a genuine strategy, wins and losses should be relatively random. If all your wins happened in one period and all losses in another, the strategy may have been fit to specific market conditions.

Wald-Wolfowitz Runs Test

A "run" is a sequence of consecutive wins or consecutive losses. The engine compares observed runs to expected runs under randomness.

0.9 - 1.1

Random (good) — Score 0-30

0.7 - 0.9

Slight clustering — Score 30-50

0.5 - 0.7

Moderate clustering — Score 50-70

< 0.5

Severe clustering — Score 70-100

Warning: Severe clustering (runs ratio < 0.5) is a strong overfit indicator. Your strategy may only work in specific market conditions.

5. Regime Sensitivity

Weight: 20%

How consistently your strategy performs across different market regimes (bull, bear, sideways, volatile)

An overfit strategy often performs brilliantly in one regime and terribly in others. A robust strategy should show reasonable consistency across different market conditions.

Coefficient of Variation (CV)

CV = (Standard Deviation of regime returns) / |Mean of regime returns| × 100%

< 50%

Consistent (good) — Score 0-30

50-100%

Moderate variation — Score 30-60

100-200%

High variation — Score 60-80

> 200%

Critical variation — Score 80-100

This signal needs at least 2 regimes with 5+ trades each. If regime data isn't available, consider running regime analysis first.

Risk Level Thresholds

The final overfit probability maps to four risk levels, each with specific recommended actions.

🟢 Low Risk

0-25%

Strategy shows strong signs of genuine predictive power

Action: Proceed with confidence, validate with out-of-sample testing

🔵 Moderate Risk

25-50%

Some overfit signals present, proceed with caution

Action: Investigate elevated signals, consider improvements

🟡 High Risk

50-75%

Multiple overfit indicators suggest curve-fitting

Action: Run walk-forward optimization before live trading

🔴 Critical Risk

75-100%

Strong evidence of overfitting—strategy likely unreliable

Action: Do not deploy—simplify strategy or gather more data

Bayes Factor

The Bayes Factor (BF) provides an alternative perspective by comparing two hypotheses: that the strategy is genuine vs. that it's overfit.

BF = P(data | Genuine) / P(data | Overfit)

Bayes Factor	Evidence Strength
< 1	Evidence favors overfitting
1-3	Anecdotal evidence for genuine strategy
3-10	Moderate evidence for genuine strategy
10-30	Strong evidence for genuine strategy
30-100	Very strong evidence for genuine strategy
> 100	Extreme evidence for genuine strategy

Tip: Use Bayes Factor alongside the overfit probability. If they disagree significantly, investigate the individual signals more closely.

Assessment Confidence

The engine also reports how confident it is in its own assessment based on your sample size.

Low Confidence

< 30 trades

Results may not be reliable

Medium Confidence

30-100 trades

Consider running more trades

High Confidence

> 100 trades

Robust assessment

Important: A "Low Risk" assessment with "Low Confidence" is not as reassuring as "Low Risk" with "High Confidence". Always consider both together.

Practical Examples

Example 1: The Simple Winner

Strategy Details

• Moving average crossover with 2 parameters
• 350 trades

Signal Scores

• Parameter Complexity: 8 ✅
• Sample Adequacy: 12 ✅
• Return Normality: 18 ✅
• Trade Clustering: 25 ✅
• Regime Sensitivity: 35 ⚠️

Overfit: 22%

BF: 3.5

✅ Healthy strategy

Example 2: The Over-Optimized Mess

Strategy Details

• Machine learning model with 45 parameters
• 80 trades

Signal Scores

• Parameter Complexity: 78 ❌
• Sample Adequacy: 72 ❌
• Return Normality: 55 ⚠️
• Trade Clustering: 68 ❌
• Regime Sensitivity: 82 ❌

Overfit: 89%

BF: 0.12

❌ Do not deploy

Example 3: The Promising but Unproven

Strategy Details

• Momentum strategy with 5 parameters
• 45 trades

Signal Scores

• Parameter Complexity: 22 ✅
• Sample Adequacy: 58 ⚠️
• Return Normality: 30 ✅
• Trade Clustering: 42 ⚠️
• Regime Sensitivity: 50 (no data)

Overfit: 41%

BF: 1.4

⚠️ Needs more data

Tips and Best Practices

Before Running Analysis

Ensure enough trades: Aim for 100+ trades minimum, 200+ for high confidence
Count your parameters honestly: Include all tunable values, even "obvious" ones
Run regime analysis first: This provides crucial data for the regime sensitivity signal

Reducing Overfit Risk

Simplify your strategy: Fewer parameters = less room for overfitting
Get more data: More trades provide more statistical power
Use walk-forward optimization: Validates that parameters work out-of-sample
Test across regimes: Ensure your strategy works in different market conditions
Reserve holdout data: Never optimize on all your data—keep some for final validation

Red Flags to Watch For

• Parameter-to-trade ratio > 0.05
• Fewer than 50 trades
• Runs ratio < 0.6 (severe clustering)
• Regime CV > 150%
• Bayes Factor < 1

Quick Reference Table

Signal	Weight	Good	Warning	Critical
Parameter Complexity	25%	< 25	25-50	> 50
Sample Adequacy	15%	< 20	20-60	> 60
Return Normality	20%	< 20	20-50	> 50
Trade Clustering	20%	< 30	30-60	> 60
Regime Sensitivity	20%	< 30	30-60	> 60

Metric	Good	Concerning	Action
Overfit Probability	< 25%	> 50%	Simplify or get more data
Bayes Factor	> 10	< 1	Investigate signals
Assessment Confidence	High (100+ trades)	Low (< 30)	Run longer backtest

Putting It All Together

The Overfit Detection Engine is your first line of defense against deploying strategies that look good on paper but fail in reality. Use it as part of a comprehensive validation workflow:

1
Develop your strategy with simplicity in mind
2
Backtest with sufficient data
3
Analyze with the Overfit Detection Engine
4
Validate with walk-forward optimization
5
Deploy only strategies that pass all checks

Remember: A strategy that shows "Low Risk" with "High Confidence" and a Bayes Factor > 10 is a strong candidate for live trading. Anything less deserves more scrutiny.