Skip to content

Research: Scoring Theory

Chrono Score (CS) is a soulbound on-chain forecasting reputation built on calibration mathematics. This page covers the theory behind the scoring rule choices.


The Core Problem: Measuring Forecasting Skill

Section titled “The Core Problem: Measuring Forecasting Skill”

A forecasting reputation system needs to answer: is this person actually skilled, or did they get lucky?

This is harder than it looks. Consider two forecasters:

  • Forecaster A: Made 10 predictions, got 9 right. CS based on accuracy: 90%.
  • Forecaster B: Made 200 predictions with calibrated probabilities averaging 65%, resolving YES 65% of the time. CS based on calibration: high.

Forecaster A looks better on accuracy. But Forecaster A might have made easy predictions (things that resolved 95%+ of the time). Forecaster B’s track record shows something deeper: their stated probabilities match reality.

Calibration is the right metric. Accuracy is a noise signal.


Brier = (p - o)²
where p = stated probability, o = outcome (0 or 1)

Range: 0 (perfect) to 1 (worst). Lower is better. For a binary outcome:

  • State 0.7, outcome is YES (1): Brier = (0.7 - 1)² = 0.09
  • State 0.7, outcome is NO (0): Brier = (0.7 - 0)² = 0.49

Properties: Bounded, intuitive, commonly reported. Quadratic penalty — not as sharply incentive-compatible as log score for extreme probabilities.

Best for: Public display. Human-readable. Superforecaster benchmarks (ForecastBench) use Brier.

Log Score (Proper Logarithmic Scoring Rule)

Section titled “Log Score (Proper Logarithmic Scoring Rule)”
LogScore = log(p) if outcome YES, log(1-p) if outcome NO

Unbounded below (stating 0% when outcome is YES → −∞). Proper — the theoretically optimal strategy is to state your true belief.

Critical property: The log score’s strict properness means a forecaster maximizes expected score only by stating their true probability. Brier score also has this property, but log score penalizes overconfident wrong predictions more severely — relevant when stakes are financial.

Modified log score with floor:

CS_raw = max(log(p), C) if YES; max(log(1-p), C) if NO
where C = floor constant (Chronomancy uses C = 3.0, approximately log(0.05))

The floor prevents a single 1%-stated prediction that resolves YES from catastrophically destroying a reputation. This makes the system resistant to a specific manipulation: deliberately stating near-0% probabilities as a hedge.

Relative ranking systems. Forecaster A beats Forecaster B if A is closer to the right answer on the same market.

Not used as CS base: Elo measures relative performance, not absolute calibration. A forecaster can have high Elo by being less wrong than their opponents while being systematically miscalibrated. Chronomancy uses Elo separately for competitive ranking (the Elo system page) — it does not conflate the two.


CS uses a modified log score with floor as the base calculation, displayed publicly as Brier score equivalent for readability.

CS_raw(prediction) = max(log(p), 3.0) if YES
= max(log(1-p), 3.0) if NO

New forecasters have small sample sizes. Raw scores from 10 predictions are highly variable. Bayesian shrinkage (κ-blending) pulls a forecaster’s score toward the population mean:

CS_adjusted = (n × CS_raw + κ × CS_mean) / (n + κ)
where n = number of resolved predictions, κ = shrinkage constant

Chronomancy uses κ = 150 — calibrated so that a forecaster needs approximately 150 predictions before their personal CS dominates the population prior. Below that, the system is appropriately uncertain.

This prevents the cold-start manipulation: making 5 perfect predictions to claim a 100% score.

Historical predictions decay in relevance. Market dynamics change; a forecaster who was excellent in 2020 may be miscalibrated in 2026.

Weight(t) = λ^(months_ago)
where λ = 0.95 (monthly decay factor)

At λ = 0.95:

  • 6 months ago: weight = 0.74
  • 12 months ago: weight = 0.54
  • 24 months ago: weight = 0.28
  • 36 months ago: weight = 0.16

This creates a ~18-month effective half-life for CS — recent performance matters more than historical performance. Forecasters who haven’t made predictions recently see their CS drift toward population mean via a separate activity adjustment.

Every CS value has an uncertainty parameter (RD, Rating Deviation, from the Glicko-2 system). A forecaster with 10 predictions has high RD — their CS is imprecise. A forecaster with 500 predictions has low RD — their CS is reliable.

Products that rely on CS (FF vault pricing, REWIND premiums) use both the CS value and its RD:

  • High CS + low RD → strong confidence, best pricing
  • High CS + high RD → uncertain, priced conservatively until more predictions accumulate
  • Low CS + any RD → full price, no discount

The scoring system is only useful if forecasting skill actually exists and is predictable. The research evidence:

SourceFinding
Good Judgment Project (Tetlock)Superforecasters (top 2%) beat crowd by 7–12 percentage points, beat intelligence analysts by 25–30%
Metaculus pro forecastersTop 1% estimated at 4–8pp above community aggregate
ForecastBench (ICLR 2025)Superforecaster aggregate Brier 0.081 vs. crowd 0.149
Polymarket on-chain analysisTop 0.51% of wallets had >$1K cumulative P&L; 70% of volume from 1% of traders

Alpha distribution across forecasters:

PercentileAlpha rangeDescription
Top 0.1%12–18ppElite superforecasters; extremely rare
Top 1%7–12ppConsistent edge; Chronomancy’s Premium tier target
Top 10%3–6ppMeaningful edge; Standard tier
Top 25%1–3ppMarginal edge
Median~0ppNo edge
Bottom 50%NegativeSystematically miscalibrated

Persistence: Alpha survives with 2–4 year half-life (lambda ~0.62/year for skill persistence). Skill is real but degrades without continued practice and market exposure.

Domain variation: Skill is domain-specific. The same forecaster who has 12pp edge on geopolitical events might have 2pp edge on US elections (which are highly efficient markets) and 0pp edge on crypto price movements.


New users have no track record. The cold-start protocol moves them through phases:

PhaseTriggerCS Status
Precognition0–30 predictionsUnscored (no CS displayed)
Qualification30–150 predictionsCS shown with high RD; not yet gateable
Graduated150+ predictionsFull CS with RD; gates FF pricing
Full Access300+ predictions + time filterLow RD; gates REWIND premium tiers

The minimum track record of 150 predictions is calibrated from empirical research: below 50–100 predictions, individual CS estimates are too noisy to be useful. 150 provides a meaningful sample while remaining achievable within 3–6 months of regular forecasting activity.


Chrono Score is attested on-chain using Ethereum Attestation Service (EAS) — deployed on Ethereum mainnet and major L2s.

Each CS update creates an on-chain attestation:

attestation {
subject: wallet_address
schema: chrono_score_v1
data: {
cs_value: uint16, // scaled 0-10000
rd: uint16, // rating deviation
predictions: uint32, // prediction count
domains: bytes32[], // domain tags
timestamp: uint64
}
signer: chronomancy_attester
}

This makes CS:

  1. Verifiable by any contract or dapp without trusting Chronomancy’s API
  2. Composable — other protocols can use CS attestations natively
  3. Non-custodial — the user owns their score’s attestation; Chronomancy can update but not delete

The EAS design is the foundation for CS becoming a cross-platform forecasting standard rather than a Chronomancy-specific metric.

Related: