where sij are elements of a scoring matrix given by (i = j, diagonal), (i ≠ j, off-diagonal), and with the sample probabilities (observed frequencies) given by pi = N(Oi) / N).
Answers the question: What was the accuracy of the forecast in predicting the correct category, relative to that of random chance?
Range: -1 to 1, 0 indicates no skill. Perfect score: 1
Characteristics: Uses all entries in the contingency table, does not depend on the forecast distribution, and is equitable (i.e., random and constant forecasts score a value of 0). GS does not reward conservative forecasting like HSS and HK, but rather rewards forecasts for correctly predicting the less likely categories. Smaller errors are penalized less than larger forecast errors. This is achieved through the use of the scoring matrix. A more detailed discussion and examples for 3-category forecasts can be found in Jolliffe and Stephenson (2012).
Relative operating characteristic - Plot hit rate (POD) vs false alarm rate (POFD), using a set of increasing probability thresholds (for example, 0.05, 0.15, 0.25, etc.) to make the yes/no decision. The area under the ROC curve is frequently used as a score.
Answers the question: What is the ability of the forecast to discriminate between events and non-events?
ROC: Perfect: Curve travels from bottom left to top left of diagram, then across to top right of diagram. Diagonal line indicates no skill. ROC area: Range: 0 to 1, 0.5 indicates no skill. Perfect score: 1
Characteristics: ROC measures the ability of the forecast to discriminate between two alternative outcomes, thus measuring resolution. It is not sensitive to bias in the forecast, so says nothing about reliability. A biased forecast may still have good resolution and produce a good ROC curve, which means that it may be possible to improve the forecast through calibration. The ROC can thus be considered as a measure of potential usefulness. The ROC is conditioned on the observations (i.e., given that Y occurred, what was the correponding forecast?) It is therefore a good companion to the reliability diagram, which is conditioned on the forecasts. More information on ROC can be found in Mason 1982, Jolliffe and Stephenson 2012 (ch.3), and the WISE site.
Reliability diagram - (called "attributes diagram" when the no-resoloution and no-skill w.r.t. climatology lines are included).
The reliability diagram plots the observed frequency against the forecast probability, where the range of forecast probabilities is divided into K bins (for example, 0-5%, 5-15%, 15-25%, etc.). The sample size in each bin is often included as a histogram or values beside the data points.
Answers the question: How well do the predicted probabilities of an event correspond to their observed frequencies?
Characteristics: Reliability is indicated by the proximity of the plotted curve to the diagonal. The deviation from the diagonal gives the conditional bias. If the curve lies below the line, this indicates overforecasting (probabilities too high); points above the line indicate underforecasting (probabilities too low). The flatter the curve in the reliability diagram, the less resolution it has. A forecast of climatology does not discriminate at all between events and non-events, and thus has no resolution. Points between the "no skill" line and the diagonal contribute positively to the Brier skill score. The frequency of forecasts in each probability bin (shown in the histogram) shows the sharpness of the forecast. The reliability diagram is conditioned on the forecasts (i.e., given that X was predicted, what was the outcome?), and can be expected to give information on the real meaning of the forecast. It it a good partner to the ROC, which is conditioned on the observations.
Brier score -
Answers the question: What is the magnitude of the probability forecast errors?
Measures the mean squared probability error. Murphy (1973) showed that it could be partitioned into three terms: (1) reliability, (2) resolution, and (3) uncertainty.
Range: 0 to 1. Perfect score: 0.
Characteristics: Sensitive to climatological frequency of the event: the more rare an event, the easier it is to get a good BS without having any real skill. Negative orientation (smaller score better) - can "fix" by subtracting BS from 1.
- - - - - - - - - - -
Answers the question: What is the relative skill of the probabilistic forecast over that of climatology, in terms of predicting whether or not an event occurred?
Range: -∞ to 1, 0 indicates no skill when compared to the reference forecast. Perfect score: 1.
Characteristics: Measures the improvement of the probabilistic forecast relative to a reference forecast (usually the long-term or sample climatology), thus taking climatological frequency into account. Not strictly proper. Unstable when applied to small data sets; the rarer the event, the larger the number of samples needed.