Research: Scoring Theory
Scoring Theory
Section titled “Scoring Theory”Chrono Score (CS) is a soulbound on-chain forecasting reputation built on calibration mathematics. This page covers the theory behind the scoring rule choices.
The Core Problem: Measuring Forecasting Skill
Section titled “The Core Problem: Measuring Forecasting Skill”A forecasting reputation system needs to answer: is this person actually skilled, or did they get lucky?
This is harder than it looks. Consider two forecasters:
- Forecaster A: Made 10 predictions, got 9 right. CS based on accuracy: 90%.
- Forecaster B: Made 200 predictions with calibrated probabilities averaging 65%, resolving YES 65% of the time. CS based on calibration: high.
Forecaster A looks better on accuracy. But Forecaster A might have made easy predictions (things that resolved 95%+ of the time). Forecaster B’s track record shows something deeper: their stated probabilities match reality.
Calibration is the right metric. Accuracy is a noise signal.
Three Scoring Rules Compared
Section titled “Three Scoring Rules Compared”Brier Score
Section titled “Brier Score”Brier = (p - o)²where p = stated probability, o = outcome (0 or 1)Range: 0 (perfect) to 1 (worst). Lower is better. For a binary outcome:
- State 0.7, outcome is YES (1): Brier = (0.7 - 1)² = 0.09
- State 0.7, outcome is NO (0): Brier = (0.7 - 0)² = 0.49
Properties: Bounded, intuitive, commonly reported. Quadratic penalty — not as sharply incentive-compatible as log score for extreme probabilities.
Best for: Public display. Human-readable. Superforecaster benchmarks (ForecastBench) use Brier.
Log Score (Proper Logarithmic Scoring Rule)
Section titled “Log Score (Proper Logarithmic Scoring Rule)”LogScore = log(p) if outcome YES, log(1-p) if outcome NOUnbounded below (stating 0% when outcome is YES → −∞). Proper — the theoretically optimal strategy is to state your true belief.
Critical property: The log score’s strict properness means a forecaster maximizes expected score only by stating their true probability. Brier score also has this property, but log score penalizes overconfident wrong predictions more severely — relevant when stakes are financial.
Modified log score with floor:
CS_raw = max(log(p), C) if YES; max(log(1-p), C) if NOwhere C = floor constant (Chronomancy uses C = 3.0, approximately log(0.05))The floor prevents a single 1%-stated prediction that resolves YES from catastrophically destroying a reputation. This makes the system resistant to a specific manipulation: deliberately stating near-0% probabilities as a hedge.
Elo / Glicko-2
Section titled “Elo / Glicko-2”Relative ranking systems. Forecaster A beats Forecaster B if A is closer to the right answer on the same market.
Not used as CS base: Elo measures relative performance, not absolute calibration. A forecaster can have high Elo by being less wrong than their opponents while being systematically miscalibrated. Chronomancy uses Elo separately for competitive ranking (the Elo system page) — it does not conflate the two.
Chrono Score Calculation
Section titled “Chrono Score Calculation”Base Score
Section titled “Base Score”CS uses a modified log score with floor as the base calculation, displayed publicly as Brier score equivalent for readability.
CS_raw(prediction) = max(log(p), 3.0) if YES = max(log(1-p), 3.0) if NOBayesian Shrinkage
Section titled “Bayesian Shrinkage”New forecasters have small sample sizes. Raw scores from 10 predictions are highly variable. Bayesian shrinkage (κ-blending) pulls a forecaster’s score toward the population mean:
CS_adjusted = (n × CS_raw + κ × CS_mean) / (n + κ)where n = number of resolved predictions, κ = shrinkage constantChronomancy uses κ = 150 — calibrated so that a forecaster needs approximately 150 predictions before their personal CS dominates the population prior. Below that, the system is appropriately uncertain.
This prevents the cold-start manipulation: making 5 perfect predictions to claim a 100% score.
Time Decay
Section titled “Time Decay”Historical predictions decay in relevance. Market dynamics change; a forecaster who was excellent in 2020 may be miscalibrated in 2026.
Weight(t) = λ^(months_ago)where λ = 0.95 (monthly decay factor)At λ = 0.95:
- 6 months ago: weight = 0.74
- 12 months ago: weight = 0.54
- 24 months ago: weight = 0.28
- 36 months ago: weight = 0.16
This creates a ~18-month effective half-life for CS — recent performance matters more than historical performance. Forecasters who haven’t made predictions recently see their CS drift toward population mean via a separate activity adjustment.
Glicko-2 Uncertainty
Section titled “Glicko-2 Uncertainty”Every CS value has an uncertainty parameter (RD, Rating Deviation, from the Glicko-2 system). A forecaster with 10 predictions has high RD — their CS is imprecise. A forecaster with 500 predictions has low RD — their CS is reliable.
Products that rely on CS (FF vault pricing, REWIND premiums) use both the CS value and its RD:
- High CS + low RD → strong confidence, best pricing
- High CS + high RD → uncertain, priced conservatively until more predictions accumulate
- Low CS + any RD → full price, no discount
Empirical Alpha Evidence
Section titled “Empirical Alpha Evidence”The scoring system is only useful if forecasting skill actually exists and is predictable. The research evidence:
| Source | Finding |
|---|---|
| Good Judgment Project (Tetlock) | Superforecasters (top 2%) beat crowd by 7–12 percentage points, beat intelligence analysts by 25–30% |
| Metaculus pro forecasters | Top 1% estimated at 4–8pp above community aggregate |
| ForecastBench (ICLR 2025) | Superforecaster aggregate Brier 0.081 vs. crowd 0.149 |
| Polymarket on-chain analysis | Top 0.51% of wallets had >$1K cumulative P&L; 70% of volume from 1% of traders |
Alpha distribution across forecasters:
| Percentile | Alpha range | Description |
|---|---|---|
| Top 0.1% | 12–18pp | Elite superforecasters; extremely rare |
| Top 1% | 7–12pp | Consistent edge; Chronomancy’s Premium tier target |
| Top 10% | 3–6pp | Meaningful edge; Standard tier |
| Top 25% | 1–3pp | Marginal edge |
| Median | ~0pp | No edge |
| Bottom 50% | Negative | Systematically miscalibrated |
Persistence: Alpha survives with 2–4 year half-life (lambda ~0.62/year for skill persistence). Skill is real but degrades without continued practice and market exposure.
Domain variation: Skill is domain-specific. The same forecaster who has 12pp edge on geopolitical events might have 2pp edge on US elections (which are highly efficient markets) and 0pp edge on crypto price movements.
Cold-Start Protocol
Section titled “Cold-Start Protocol”New users have no track record. The cold-start protocol moves them through phases:
| Phase | Trigger | CS Status |
|---|---|---|
| Precognition | 0–30 predictions | Unscored (no CS displayed) |
| Qualification | 30–150 predictions | CS shown with high RD; not yet gateable |
| Graduated | 150+ predictions | Full CS with RD; gates FF pricing |
| Full Access | 300+ predictions + time filter | Low RD; gates REWIND premium tiers |
The minimum track record of 150 predictions is calibrated from empirical research: below 50–100 predictions, individual CS estimates are too noisy to be useful. 150 provides a meaningful sample while remaining achievable within 3–6 months of regular forecasting activity.
EAS Integration
Section titled “EAS Integration”Chrono Score is attested on-chain using Ethereum Attestation Service (EAS) — deployed on Ethereum mainnet and major L2s.
Each CS update creates an on-chain attestation:
attestation { subject: wallet_address schema: chrono_score_v1 data: { cs_value: uint16, // scaled 0-10000 rd: uint16, // rating deviation predictions: uint32, // prediction count domains: bytes32[], // domain tags timestamp: uint64 } signer: chronomancy_attester}This makes CS:
- Verifiable by any contract or dapp without trusting Chronomancy’s API
- Composable — other protocols can use CS attestations natively
- Non-custodial — the user owns their score’s attestation; Chronomancy can update but not delete
The EAS design is the foundation for CS becoming a cross-platform forecasting standard rather than a Chronomancy-specific metric.
Related:
- Chrono Score — the scoring system in product context
- Elo System — the separate competitive ranking system
- Retail Failure — why reputation matters for retention