Model reliability
The reliability of ECASA's models
During ECASA we tried to assess the reliability of models in a standard manner. Results are given in detail by model, in reports available from most model pages, and in summary by the final three columns in the model summary table. The table headings are reproduced here:
|
model name |
scale |
brief description |
partner |
N |
r2 |
performance |
We compared model predictions with observations in two ways.
- First, the observations show variation (for example, in the amount of nutrient present in the sea, or the size of an oyster). We estimated the proportion of that variation that can be predicted by a model. This proportion is given by a statistic called r2, with maximum value 1 (all variation explained) to 0 (no variation explained).
- Second, we asked whether the predictions of the model were biased - did they persistently overestimate or underestimate the observations? The results are in the 'performance' column.
- An EXCELLENT model has no biases;
- GOOD and FAIR models have increasing amounts of bias.
- In a POOR model there is no distinguishable relationship between observations and model predicts.
The POOR category corresponds to a value of r2 that is indistinguishable from zero, and which could have been obtained by comparing observations with numbers picked at random from those predicted by the model.
The column headed 'N'
Mathematical models contain equations for 'state variables' and the numerical solutions of these equations are used to predict (or hindcast) the values of these state variables. Examples of state variables are the amount of nutrient present in the sea, or the size of an oyster. More complicated models have more state variables. The column headed 'N' gives a number that is the result of multiplying:
- the number of model state variables that have been tested, by
- the number of years of observations for which they have been tested, by
- the number of independent sites at which they have been tested
For example, in the case of the LESV model, comparisons with observations have been made for two years (1975 and 2003) at one site (loch Creran) for three state variables (chlorophyll, DAIN and DIP). Thus 'N' is 6.
The higher the value of 'N', the more reliable the model - depending on the value of r2 and the category of fit.
Detailed explanation of the calculations of 'r2' and 'fit'
Results from comparing chlorophyll concentrations observed in 1975 in loch Creran, and predicted by the LESV model, are used as an example. The first diagram shows the two sets of data plotted as time series.
![]() |
The next step is to take pairs of comparable values: each pair is an observation and the simulated value for the same day. These are plotted against simulations on the horizontal axis and observations on the vertical axis, as shown in the next diagram.
![]() |
In this case it has been necessary to transform the data (the chlorophyll values) to correct for statistically undesirable skewed distributions of individual values about means, and a statistically undesirable tendency for variation to increase as values get larger. The transformation used here was that obtained by taking log10(X+1) of values X.
In this diagram is plotted a diagonal line which indicates perfect agreement between observed values and model-predicted values. Clearly, agreement betwen the actual observations and predictions is not perfect, as shown by the scatter of points. A regression line:
observation = a * prediction + b
has been fitted to these points, and three statistics calculated:
- r2 - the proportion of variance explained by the regression; if there was perfect agreement, this proportion would be 1;
- a - the slope of the regression; if there was perfect agreement, this would be exactly 1; the difference (1-a) must be tested to decide if it is significant (i.e., if the slope is significantly different from 1);
- b - the intercept of the regression; if there was perfect agreement, this would be exactly 0; the value must be tested to decide if the intercept is significantly different from 0.
The slope and intercept values can be used to categorize the performance of the model, following Mesples et al. (1996) and Oreskes et al. (1994):
| category | is slope significantly different from 1? | is slope significantly different from 0? | is intercept significantly different from 0? |
| EXCELLENT | NO | YES | NO |
| GOOD | YES | YES | NO |
| GOOD | NO | YES | YES |
| FAIR | YES | YES | YES |
| POOR | YES | NO | YES |
| POOR | NO | NO | YES |
| POOR | YES | NO | NO |
Combining statistics for several state variables or sites or years
The procedure given in the predeeding section applies to a single model state variable tested for a single site and time-period. In general, a model was subject to N tests, with N defined above. Values reported in the model list were obtained as follows:
- r2 was the average of the individual values of r2 over N;
- performance categories were ranked in order POOR, FAIR, GOOD, EXCELLENT, and the median rank was selected; except that a final rating of EXCELLENT was downgraded to GOOD if it related to an r2of less than 0.5.
Interpreting the reliability data
In this assessment of model reliability we are not primarily concerned with the internal complexity of models or the costs of using them, but only with the extent to which they can be relied on to make useful predictions.
- N - a model with a high value of N has been more extensively tested than a model with a low value of N; however, a more complex model has more state variables than a simple model, and so requires more variables to be tested;
- r2 - a model that has shown a high value of r2 over many state variables, years or sites, can be considered robust;
- performance categories - these should be read in conjunction with the value of r2. A model may be deemed to be not POOR (because the slope of the regression is significantly greater than zero, and yet explain only a small part of the observed variance. In such cases, the errors in the estimates of slope and intercept are large, and performance may be deemed to be GOOD or EXCELLENT on the basis of the criteria of slope not significantly different from 1, and intercept not significntly different from 0.
References
- Mesple, F., Troussellier, M., Casellas, C., Legendre, P., 1996. Evaluation of simple statistical criteria to qualify a simulation. Ecological modelling, 88 (1), 9-18.
- Oreskes, N., Shrader-Frechette, K., Belitz, K., 1994. Verification, validation, and confirmation of numerical models in the earth sciences. Science, 263, 641-646.



