Research: Scoring Theory

Scoring Theory

Chrono Score (CS) is a soulbound on-chain forecasting reputation built on calibration mathematics. This page covers the theory behind the scoring rule choices.

The Core Problem: Measuring Forecasting Skill

A forecasting reputation system needs to answer: is this person actually skilled, or did they get lucky?

This is harder than it looks. Consider two forecasters:

Forecaster A: Made 10 predictions, got 9 right. CS based on accuracy: 90%.
Forecaster B: Made 200 predictions with calibrated probabilities averaging 65%, resolving YES 65% of the time. CS based on calibration: high.

Forecaster A looks better on accuracy. But Forecaster A might have made easy predictions (things that resolved 95%+ of the time). Forecaster B’s track record shows something deeper: their stated probabilities match reality.

Calibration is the right metric. Accuracy is a noise signal.

Three Scoring Rules Compared

Brier Score

Brier = (p - o)²
where p = stated probability, o = outcome (0 or 1)

Range: 0 (perfect) to 1 (worst). Lower is better. For a binary outcome:

State 0.7, outcome is YES (1): Brier = (0.7 - 1)² = 0.09
State 0.7, outcome is NO (0): Brier = (0.7 - 0)² = 0.49

Properties: Bounded, intuitive, commonly reported. Quadratic penalty — not as sharply incentive-compatible as log score for extreme probabilities.

Best for: Public display. Human-readable. Superforecaster benchmarks (ForecastBench) use Brier.

Log Score (Proper Logarithmic Scoring Rule)

LogScore = log(p) if outcome YES, log(1-p) if outcome NO

Unbounded below (stating 0% when outcome is YES → −∞). Proper — the theoretically optimal strategy is to state your true belief.

Critical property: The log score’s strict properness means a forecaster maximizes expected score only by stating their true probability. Brier score also has this property, but log score penalizes overconfident wrong predictions more severely — relevant when stakes are financial.

Modified log score with floor:

CS_raw = max(log(p), C) if YES; max(log(1-p), C) if NO
where C = floor constant (Chronomancy uses C = 3.0, approximately log(0.05))

The floor prevents a single 1%-stated prediction that resolves YES from catastrophically destroying a reputation. This makes the system resistant to a specific manipulation: deliberately stating near-0% probabilities as a hedge.

Elo / Glicko-2

Relative ranking systems. Forecaster A beats Forecaster B if A is closer to the right answer on the same market.

Not used as CS base: Elo measures relative performance, not absolute calibration. A forecaster can have high Elo by being less wrong than their opponents while being systematically miscalibrated. Chronomancy uses Elo separately for competitive ranking (the Elo system page) — it does not conflate the two.

Chrono Score Calculation

Base Score

CS uses a modified log score with floor as the base calculation, displayed publicly as Brier score equivalent for readability.

CS_raw(prediction) = max(log(p), 3.0)    if YES
                   = max(log(1-p), 3.0)  if NO

Bayesian Shrinkage

New forecasters have small sample sizes. Raw scores from 10 predictions are highly variable. Bayesian shrinkage (κ-blending) pulls a forecaster’s score toward the population mean:

CS_adjusted = (n × CS_raw + κ × CS_mean) / (n + κ)
where n = number of resolved predictions, κ = shrinkage constant

Chronomancy uses κ = 150 — calibrated so that a forecaster needs approximately 150 predictions before their personal CS dominates the population prior. Below that, the system is appropriately uncertain.

This prevents the cold-start manipulation: making 5 perfect predictions to claim a 100% score.

Time Decay

Historical predictions decay in relevance. Market dynamics change; a forecaster who was excellent in 2020 may be miscalibrated in 2026.

Weight(t) = λ^(months_ago)
where λ = 0.95 (monthly decay factor)

At λ = 0.95:

6 months ago: weight = 0.74
12 months ago: weight = 0.54
24 months ago: weight = 0.28
36 months ago: weight = 0.16

This creates a ~18-month effective half-life for CS — recent performance matters more than historical performance. Forecasters who haven’t made predictions recently see their CS drift toward population mean via a separate activity adjustment.

Glicko-2 Uncertainty

Every CS value has an uncertainty parameter (RD, Rating Deviation, from the Glicko-2 system). A forecaster with 10 predictions has high RD — their CS is imprecise. A forecaster with 500 predictions has low RD — their CS is reliable.

Products that rely on CS (FF vault pricing, REWIND premiums) use both the CS value and its RD:

High CS + low RD → strong confidence, best pricing
High CS + high RD → uncertain, priced conservatively until more predictions accumulate
Low CS + any RD → full price, no discount

Empirical Alpha Evidence

The scoring system is only useful if forecasting skill actually exists and is predictable. The research evidence:

Source	Finding
Good Judgment Project (Tetlock)	Superforecasters (top 2%) beat crowd by 7–12 percentage points, beat intelligence analysts by 25–30%
Metaculus pro forecasters	Top 1% estimated at 4–8pp above community aggregate
ForecastBench (ICLR 2025)	Superforecaster aggregate Brier 0.081 vs. crowd 0.149
Polymarket on-chain analysis	Top 0.51% of wallets had >$1K cumulative P&L; 70% of volume from 1% of traders

Alpha distribution across forecasters:

Percentile	Alpha range	Description
Top 0.1%	12–18pp	Elite superforecasters; extremely rare
Top 1%	7–12pp	Consistent edge; Chronomancy’s Premium tier target
Top 10%	3–6pp	Meaningful edge; Standard tier
Top 25%	1–3pp	Marginal edge
Median	~0pp	No edge
Bottom 50%	Negative	Systematically miscalibrated

Persistence: Alpha survives with 2–4 year half-life (lambda ~0.62/year for skill persistence). Skill is real but degrades without continued practice and market exposure.

Domain variation: Skill is domain-specific. The same forecaster who has 12pp edge on geopolitical events might have 2pp edge on US elections (which are highly efficient markets) and 0pp edge on crypto price movements.

Cold-Start Protocol

New users have no track record. The cold-start protocol moves them through phases:

Phase	Trigger	CS Status
Precognition	0–30 predictions	Unscored (no CS displayed)
Qualification	30–150 predictions	CS shown with high RD; not yet gateable
Graduated	150+ predictions	Full CS with RD; gates FF pricing
Full Access	300+ predictions + time filter	Low RD; gates REWIND premium tiers

The minimum track record of 150 predictions is calibrated from empirical research: below 50–100 predictions, individual CS estimates are too noisy to be useful. 150 provides a meaningful sample while remaining achievable within 3–6 months of regular forecasting activity.

EAS Integration

Chrono Score is attested on-chain using Ethereum Attestation Service (EAS) — deployed on Ethereum mainnet and major L2s.

Each CS update creates an on-chain attestation:

attestation {
  subject: wallet_address
  schema: chrono_score_v1
  data: {
    cs_value: uint16,      // scaled 0-10000
    rd: uint16,            // rating deviation
    predictions: uint32,   // prediction count
    domains: bytes32[],    // domain tags
    timestamp: uint64
  }
  signer: chronomancy_attester
}

This makes CS:

Verifiable by any contract or dapp without trusting Chronomancy’s API
Composable — other protocols can use CS attestations natively
Non-custodial — the user owns their score’s attestation; Chronomancy can update but not delete

The EAS design is the foundation for CS becoming a cross-platform forecasting standard rather than a Chronomancy-specific metric.

Related:

Chrono Score — the scoring system in product context
Elo System — the separate competitive ranking system
Retail Failure — why reputation matters for retention