Some thoughts on assessing the skill of large scale (global and regional) coupled hydrodynamic ecosystem models.

 

J. Icarus Allen#

 

With contributions from amongst others , S Harrison, N Mountford, T Platt, R Rivkin, S Sathendrayath, B Sinha, P Somerfield, M Vichi.

 

The ideas and suggestions proposed in this report were influenced by the discussions at the following meetings, AMEMR global plankton model skill assessment workshop Plymouth Feb 2008, 2nd Model Skill Assessment workshop, Chapel Hill NC USA March 2007; Mercator Vert project meeting Paris March 2007 and the 4th Green Ocean Model workshop, Villefranche France May 2007.

 

# Plymouth Marine Laboratory, Prospect Place, West Hoe, Plymouth, UK; email jia@pml.ac.uk

 


1. Introduction

A numerical model is a temporally dynamic cartoon of a system; ideally it should capture the essential bulk properties and feedbacks of the system we are interested in at the level of the data available to support it. It should also produce emergent properties (i.e. outcomes which are not a direct extrapolation of the processes and switches placed in the model). Marine ecosystem models are increasingly being used to investigate how marine biota interacts with its environment and it they may respond to future climate scenario’s and in turn this information is increasingly being used to guide and implement environmental policy. All of the aforementioned uses require an answer to the fundamental questions, what is the skill of the model in question and how much can we trust it?

Historically there has been lack of effort which can partly be attributed to a lack of data (or ease of access to data) and partly to a cultural acceptance that a subjective visual comparison between model and data is acceptable. A systematic analysis of the performance of 153 biological models that include plankton demonstrated that the efforts over the last decade to increase the level of biological detail and spatial complexity, and to explore longer simulation periods, have not led to a systematic or demonstrable improvement in model performance (Arhonditsis and Brett, 2004). They found that only 47% of the models assessed had any validation and only 30% determined some measure of goodness of fit. Arhonditsis et al. (2006) reported no relationship between the level of skill assessment presented or the accuracy of planktonic models, and the subsequent citation rate of the published paper. Allen et al. (2007a) demonstrated that there is no scientific and objective consensus as to what constitutes a “good fit” when model results and observations are visually compared.

Any assessment of model skill needs to be in the context of the science or policy question being addressed. In a broad-brush generic sense the majority of questions being asked of marine ecosystem models involve some form of quantification of the response of the system (including feedbacks) to an external driver or combinations of drivers (e.g. climate, eutrophication, acidification, fisheries, invasive species).

Model development is an iterative process: only by a quantitative benchmarking of model uncertainty can we reduce subjectivity when assessing changes to a model. This is important whether the models are used in heuristic or forecast mode. However, we must be cautious, as Flynn (2005) points out ‘just because a model gives a fit to a particular data set, it does not guarantee the structure is not dysfunctional’. Consequently, there is a balance to be achieved between statistical fit and intuitive understanding of system function.

 

The relationship between model and data

Generally when we assess skill we are asking: How well does the model represent truth over a specified range of conditions? However, because truth cannot be measured, we use observations as a surrogate and ask instead: How well does the model fit the data? Both our model predictions and the observations reside in a halo of uncertainty and the true state of the system is assumed to be unknown, but lie within the observational uncertainty (Figure 1a). A model starts to have skill when the observational and predictive uncertainty halos overlap, in the ideal case the halos overlap completely (Figure1b). Thus, skill assessment requires a set of quantitative metrics and procedures for comparing model output with observational data in a manner appropriate to the particular application.  The residual (or misfit) is defined as the difference between the observation and the prediction, and most of the metrics described in this paper are some function of this quantity.


 

 


Figure 1. Schematic diagram of the relationships between, model prediction (P), observations (O) and the true state of the system (T). Both P and O are assumed to have a halo of uncertainty. Fig 1a) shows the case for a model with no skill and b) shows the case for the ideal mode (Stow et al JMS submitted).

 

2. Data

At the risk of stating the obvious, data is the life blood of model skill assessment. The process requires not only the data (and ideally lots of it), but also an understanding of both what the data tells use about the system and what it doesn’t. There are a number of issues we need to consider including

 

What exactly has been measured and how does it relate to what has been modelled?

In the simplest cases such as temperature or nitrate the model variable is maps exactly onto what has been measured. However biological variable are much more complicated. The first issue is that of aggregation. With reference to bulk biomass functional type models, most modelled planktonic variables describe the bulk properties of a size class and or a particular biogeochemical function, which is quantified as a concentration of a chemical component e.g. nitrogen or carbon). This has to be compared against biological measurement which are often qualitative, e.g. species type and abundance.

 

What are the assumptions and uncertainties of the observations?

This includes measurement errors, replicates, and the uncertainties in any assumptions made. This is particularly pertinent where the measurement has been ‘modelled’ in order to produce the final product (e.g. satellite chlorophyll).

 

How representative is the observation at the time and space scales of the model?

This is a crucial question particularly when we are considering biological variables which in reality often display considerable variability and sub model grid scales.  This is particularly important when considering short term forecast skill for biological variables.  

 

The available in-situ data types, include

 

Satellite earth observation provides extensive spatial and temporal coverage of the surface of the ocean, primarily for temperature, sea surface height and chlorophyll. A number of new products including inherent optical properties, dominant phytoplankton type and primary production estimates are currently being developed and evaluated.

 

3. Metrics

An assessment of the confidence we can place on model results (known as model validation) must take into account the complex combination of model and observational uncertainties. Model errors derive from inaccuracies in process descriptions, parameterisation, initialisation and forcing functions. Errors in observations arise from basic measurement error, inappropriate scales of sample distribution (for example data that are over influenced by small-scale processes) or lack of replication in highly heterogeneous systems and issues of methodology. A crucial issue is balancing precision (how well does the model fit each data point) with trend (i.e. how well the model reproduces the observed seasonal cycles). For example, even when the trend is well reproduced small differences in the timing of an event can lead to large errors in precision. The choice of error statistic is crucial and a comprehensive validation process must consider several. This section outlines some metrics that may be of use in evaluating model skill.

Our general philosophy of model data comparison is to start with as best as is possible to extract the observations from the model (i.e. reconstruct the data set using the model). This model data set can then be subsequently aggregated in space and or time depending on the timescale of comparison required. For example if we are considering short term forecast (e.g. an operational model) then the focus is on precision which requires the rather unforgiving comparison of a direct like-with-like comparison of model and data in space and time; this is appropriate because we are seeking the reproduce the short term high frequency variability of the system. On the other hand if the focus is on seasonal or longer timescales, then model and data need to be matched to the timescales in question.

A variety of uni-variate and multivariate methods can be sued to assess model skill, the choice of which is dependent on the questions being asked and the data available to confront the model with. Appendix A contains some suggestions for possible metrics which have been used in recent publications.

 

4. Evaluation Strategy

In this section we propose a model evaluation strategy. The focus is on models to be used in an earth systems context, thus defining the temporal and spatial scales of interest (i.e. seasonal, inter-annual and decadal timescales, global and regional spatial scales).

The aim is to move beyond subjective visual analysis to qualitative analysis which captures patterns are trends to quantitative analysis. The focus is on PFT models and the challenge is to demonstrate where possible that the individual PFT’s are fulfilling there climate feedback role(s). The ideas are based on the notion that a good model should be descriptive (represent the available data), integrative (demonstrate how elements interact) and explanatory (provide biological insight).

 

The aim is to establish a hierarchy of skill assessment which distinguishes between patterns (qualitative) and magnitude (quantitative)

 

         Statistical demo of qualitative patterns

        Structure of the ocean: Does the model exhibit plausible biogeography (e.g. define provinces by Longhurst provinces, SOM, PCA, use MDS and ANOSIM to test for independence in the province structure)?

        Seasonal cycle: Does the model at a qualitative level exhibit the seasonal cycle of key variables (e.g. Chl, O2, pCO2,)? Look at trends e.g. rank spearman correlation or comparison of model and observed PCA.

        Inter annual and 20th C trend e.g. Does the model at a qualitative level exhibit the long term trends of key variables (pCO2, ).

 

         Magnitude

        Global bulk properties (P-R, export etc…) – these are all poorly constrained so not sure how much use.

        Seasonal and inter-annual global bulk properties (Chl, O2 etc) Use univatiate metrics, Taylor diagrams or MDS and relate test.

        Seasonal and inter-annual region properties: Repeat above using biogeography and appropriate regional datasets)

 

         Emergent properties:

        At a global or regional scale does the model exhibit known responses which are not explicitly linearly parameterised. (E.g. oxygen which integrates community P-R balance with temperature and air sea exchange, plankton biomass/temperature relationships, distribution of plankton functional types (either qualitative or quantitative).

 

5. Prospectus: (making this happen)

In addition to the outline methodology we need the following information to be available. The user requires a subset of tests which can be automated such that a rapid assessment of model performance can be made and benchmarks compared. This requires;

 

a)      An agreed methodology which the community subscribes to.

b)      Error quantified standard data sets (global and regional)

c)      A library of scripts which merge model and observations, thus automating the comparisons.

d)      A healthy degree of scepticism. This should be an iterative process, evaluation is not a static process and should evolve in the context of model development and

 

A Starting point could be to illustrate the strategy with examples from existing models (e.g. PISCES, PLANKTOM, PELAGOS etc) and publish as a position paper. The second stage might be an inter-comparison exercise, with a view towards an ensemble simulation of the global carbon cycle.  

 

Coda

The thoughts and discussion presented above are not intended to be prescriptive; rather this is intended as a starting point for a constructive dialogue within the community. If you think the approach (or a specific metric) is poor tell us why and how to fix it, if it cannot be fixed, offer an alternative.

 

Acknowledgements:

The workshop activities which have contributed to this report have been funded by, the AMEMR IOF grant, Oceans2025, NERC QUEST, the Eur-Oceans Network of excellence and NOAA.

 


APPENDIX A: METHODS and METRICS

Thsi a quick overview of some of the approaches avialble to evaluate models. Much of the text is directly extracted from other documents, and therefore this section has not been carefully editied.

 

A.1 Univariate metrics

There follows a few simple univariate metrics which allow the direct comparison of the skill of a single model variable

 to reproduce observations. These provide a useful reference point for benchmarking models.

 

The Nash Sutcliffe Model Efficiency (Nash and Sutcliffe 1970) of a model variable is a measure of the ratio of the model error to the variability of the data. It was developed to assess the performance of river catchment models, which exhibit a similar temporal variability to phytoplankton (rapid increases and decreases). 

 

                                                                                                                                    A1.

 

 

Where D is the data, M the corresponding model estimate and the overbar indicates the mean of the data set for the chosen variable, N is the total number of model data matches and n is the nth comparison. The squaring of the error rewards a good fit and punishes a poor fit. Performance levels are categorised as follows >0.65 excellent, 0.65-0.5 very good, 0.5-0.2 good, < 0.2 poor and are taken from Marechal (2004).

 

The Percentage Model Bias (the sum of model error normalized by the data) is given by

 

 


                                                                                                                                    A2.

 

 

And gives measure of whether the model is systematically underestimating or over estimating the observations. The close the value is to zero the better the model. Performance levels are categorised as follows |Pbias| < 10 excellent, 10-20 very good, 20-40 good, > 40 poor (Marechal 2004)

 

The Pearson correlation coefficient (R) is defined by

R =                                                                 A3

 

It expresses the quality of a least squares fitting between two model and data ( R=0 no relationship, R =1 perfect fit). The square of the correlation coefficient (R2) expresses the percentage of the variability in data that can be accounted for by the model.

 

Spearman's rank correlation coefficient (r) is a non-parametric measure of correlation; it assesses how well an arbitrary monotonic function could describe the relationship between two variables, without making any assumptions about the frequency distribution of the variables. Unlike the Pearson product-moment correlation coefficient, it does not require the assumption that the relationship between the variables is strictly linear, nor does it require the variables to be measured on interval scales; it can be used for variables measured at the ordinal level.

 

The Root Mean Squared Error (RMSE) of n model data comparisons is defined as

 

                                                                    A4

 

And provides a useful measure of the goodness of fit between two data sets; the closer the RMSE is to zero the better the fit. It needs to be evaluated in the context of the variance of the data, if the RMSE is greater than the variance of the observations the model skill is poor.

 

Cost functions are a measure of model data mismatch and are primarily used in data assimilation, usually taking the form of the difference between model and observation, scaled by some measure of the variance of the data. They can be both uni-variate and multivariate. The cost function gives a non-dimensional value which is indicative of the ‘‘goodness of fit’’ between two sets of data; it quantifies the difference between model results and measurement data.

The simplest example of a cost function involves scaling the RMSE by the variance of the data (e.g. Holt et al 2005),

                                                                            A5

where M and D are the model and observed variable, N is the number of observations and so  is the standard deviation of the observations. It allows the accuracy of the model variables to be compared in a systematic fashion. We might expect values of c<1.0 to be required if the model is to have any chance of predictive skill; i.e. if the cost function is less than 1 the model data mismatch is less than the variance of the data.

 

A.2 Taylor diagrams

Taylor diagrams (Taylor, 2001, Fig A1) provide a way of graphically summarizing how closely patterns (or a set of patterns) match observations. The similarity between two patterns is quantified in terms of their correlation, their centred root-mean-square difference and the amplitude of their variations (represented by their standard deviations). These diagrams are especially useful in evaluating multiple aspects of complex models or in gauging the relative skill of many different models

Figure A1. Example Taylor diagram.

 

In general, the Taylor diagram characterizes the statistical relationship between two fields, a "test" field (often representing a field simulated by a model) and a "reference" field (usually representing “truth”, based on observations). Note that the means of the fields are subtracted out before computing their second-order statistics, so the diagram does not provide information about overall biases, but solely characterizes the centred pattern error.

The reason that each point in the two-dimensional space of the Taylor diagram can represent three different statistics simultaneously (i.e., the centred RMS difference, the correlation, and the standard deviation) is that these statistics are related by the following formula:

 

E R f r f r 2 = σ 2 +σ 2 − 2σ σ ,                                                                                         A6

 

where R is the correlation coefficient between the test and reference fields, E' is the centred RMS difference between the fields, and σf2 and σr 2 are the variances of the test and reference fields, respectively. The construction of the diagram (with the correlation given by the cosine of the azimuthal angle) is based on the similarity of the above equation and the Law of Cosines:

 

c2 = a2 + b2 −2ab cosφ                                                                                                  A7

 

A.3 Discrimination analysis

This is a class of tests which assess the predictive power of a binary classification system to evaluate how useful a model is in a decision-making process. These tests can reveal the following about a model: a) whether or not the fit between model and observations is better or worse than we would obtain if the model was replaced with a random number generator, and b) how well it quantifies skill as a function of threshold using a binary discriminator, i.e. if an algal bloom is defined as being above a certain concentration of chlorophyll, what is the probability that our model predicts a bloom?


Figure A2 Schematic diagrams of the discrimination analysis.

The best known example is the Receiver Operator Characteristic (ROC) devised during the Second World War for radar operators to correctly differentiate hostile and friendly aircraft. These techniques are now widely used in a number of fields, particularly medical research. Brown and Davis (2006) provide a detailed and accessible tutorial of the use of ROC curves and related metrics. We outline the methods below, following the nomenclature of Brown and Davis (2006).

At the heart of the test is a simple yes/no decision, based on the comparison of two independent information sets (in our case observations and model) with respect to a threshold value. Each trial has four possible outcomes, either correctly positive (CP), correctly negative (CN), incorrectly positive (IP) and incorrectly negative (IN), these are also known as Type 1 and Type 2 errors (Figure A2). We can use this approach to make an analysis of similarity of how well model the fits the data. The perfect model is one where all the points in a scatter diagram of model vs. data lie on the x = y line (Figure A2). 

 


Figure A3 Schematic diagrams of the binary discrimination skill assessment curves.

If we set a threshold criterion (TD) dividing the data into two sets and then compare it with the model using the same threshold (TM, Figure A2) we can assess model data similarity at that threshold, effectively assessing the model ability to discriminate that threshold. The perfect model will only give CP and CN outcomes; the more scatter there is in the model-data relationship the more IP and IN conditions will occur and the worse the model performance. Because we are interested in model performance we want to assess how well the model resolves the data across the whole range of data. By allowing TD to co-vary with TM, we obtain a non-parametric measure of the model’s ability to simulate a given variable, which can be compared directly with other simulated variables. The decision process can be further assessed by calculating the correct negative fraction (CNF) and the correct positive fraction (CPF).

                    A8

 

                    A9

CNF and CPF are independent of the actual numbers of positive and negative events in the trials and express the fraction of negative and positive events, which are correctly determined. A curve which illustrates model performance can then be calculated by plotting CPFi on the vertical axis and 1-CNFi on the horizontal axis for i=1, k threshold values (Figure A3).  These values are sometimes referred to as the sensitivity and specificity, where the sensitivity (CPF) is the probability that case X classified correctly as above the threshold and the specificity (1-CNF) is the probability that X classified correctly as below the threshold. The perfect model corresponds to a point in the top left hand corner of the Y axis (i.e. CNF = 1 and CPF = 1), the top right (CPF=1, CNF=0) and bottom left (CPF = 0 and CNF = 1) of the diagram correspond to the extremes of the decision process where every trial is always deemed either positive or negative. A completely random predictor (by definition CP = IP and CN = IN) gives a straight line at an angle of 45˚ from the horizontal. This is because as the threshold rises equal numbers of true and false positives occur. Results below this line suggest the model gives consistently incorrect results. 

Decisions based on CPF and CNF are estimators of probabilities of decisions contingent on events: if a positive event has occurred what is the probability I will make the correct decision. While these probabilities are useful they do not address the fundamental question, if I make a positive decision what’s the probability that the decision is correct. The positive predictive value (PPV) and negative predictive value (NPV) can be expressed as (see Brown and Davis (2006) for the theoretical background and derivation).

 

                     A10

                   A11

 

Values of PPV and NPV can range between 0 and 1, reflecting the intrinsic power of the decision; high values indicating a decision can be trusted, low values suggesting the decision should be regarded with scepticism.

 

A.4 Qualitative approaches

Most modelled planktonic variables describe the bulk properties of a size class and or a particular biogeochemical function, which is quantified as a concentration of a chemical component e.g. nitrogen or carbon). This has to be compared against biological measurement which are often qualitative, e.g. species type and abundance. One approach is to normalise both model and observation to a mean of zero and a standard deviation of 1 and compare the trends. Cumulative sum analysis (e.g. Taylor et al 2002) allows us to assess when the models have deviate form each other. This ideas are explored in Lewis et al 2006, in which CPR tows are reconstructed form the model and compared in space and time with the plankton survey data. 

 

A.5 Reducing the dimensionality of the problem

Big models produce big data sets, and one of the biggest challenges is to reduce the dimensionality of the problem in order to make the analysis tractable. The are a number of techniques which can be used to do this including invoking biogeography (e.g. Longhurst provinces), using cluster analysis to define model provinces and principle component analysis to assess whether the model and data have the same major models of variability. . The combination of these techniques can simplify complex comparisons of model outputs with real data, and analysis of error distributions.  The major limitation of this approach is access to large multivariate spatio- temporal data sets.

 

Multivariate approaches

Univariate and multivariate metrics are useful measures to summarize model skill, but considerable information can be lost when complex multivariate information is reduced to a single numerical index.  Multivariate approaches that allow the simultaneous examination of the ways in which numerous variables vary in relation to each other spatially and temporally are also helpful to evaluate model skill.  Marine ecologists commonly use these approaches to interpret complex data sets and marine ecosystem modellers are beginning to use them to investigate patterns and modes of variability in model outputs (Allen et al., 2002; Blackford et al., 2004, Schrum et al., 2006, Allen et al., 2007a, Allen and Clark, 2007).

If we have a set of multivariate observations available for model validation we can subject them to multivariate analysis. If we then reconstruct a data set from the model by taking the nearest equivalents in space and time, we can subject them to the same analysis and compare the results. By definition, if the observations are the truth then the perfect model should exactly reproduce the observed multivariate patterns. Multivariate analysis allows us to explore complex relationships by reducing the dimensionality of the problem. Useful techniques include;

The purpose of Principle Component Analysis PCA is to display, in 2-dimensional space, the appropriate relationship between different environmental and biotic variables over a seasonal cycle. Provided the first two principal axes (PC1 and PC2) account for much of the total variance in the full variable set, which can happen when there are strong inter-correlations between the modelled variables, then points which are close together on the PCA plot indicate conditions which are similar. Furthermore, the PC axes are simple linear combinations of the model variables, so that the sign of a linear coefficient and its magnitude indicate the direction and strength of the change in that variable across (PC1) or up (PC2) the plot.  PCA has been determined separately for both the model and observational data. The perfect model simulation will exactly reproduce the modes of variability observed in the data. We compare the first and second PC’s of the model and data to see whether the magnitude and dominant variables are similar.

A non-metric Multi Dimensional Scaling (MDS) algorithm constructs MDS plots iteratively by as closely as is possible satisfying the dissimilarity between samples; dissimilarities between pairs of samples, derived from normalised Euclidean-distance matrices, are turned into distances between sample locations on a map.

A test of ‘no relationship between distance matrices’, essentially a test for concordance in multivariate pattern, was carried out using a non-parametric Mantel test. This provides a multivariate validation statistic which indicates how well the model reproduces the observed multivariate pattern. A correlation (r) between corresponding elements in each distance matrix was calculated using Spearman’s rank correlation, adjusted for ties (Kendall, 1970). The significance of the correlation was determined by a Monte Carlo permutation procedure, using the PRIMER program RELATE. A permutation procedure is essential because the lack of independence between elements of a resemblance matrix make the standard tables for testing r invalid. For the ideal model r =1.

Differences between clustered groups can be tested for statistical significance using analysis of similarities (ANOSIM, Clarke and Green, 1988). The ANOSIM statistic (R) compares the similarity of ranked variables within a province with the average rank of different provinces and it is scaled to vary between −1 and +1. A value of +1 indicates that the similarity between all samples within one province is higher than all similarities between groups.

Allen and Somerfield. (submitted) have demonstrated the applicability of a range of techniques (Principal Component Analysis (PCA, e.g. Chatfield and Collins 1980), Multi Dimensional Scaling (MDS e.g. Clark 1993) and cluster analysis e.g., Clark and Gorley 2006) and shown that the dimensions of the problem can be reduced and multivariate and univariate goodness of fit measures, in terms of both magnitude and trend determined.

 

The Self Organising Map (SOM)is a clustering technique which allows use to define biogeomes (regions of self consistent biogeochemical properties), the boundaries of which will vary in time. The SOM analysis could be used to define seasonally-varying boundaries which may in turn inform our understanding of model performance. Additionally we could use SOMs to define parameter sets for each region to optimize model performance in each biogeome. Potentially the methodology allows the systematic analysis of model error, determination of the extent to which we can extrapolate data, and may inform new data collection strategies and consequently we recommend the use of this combination of techniques.

The SOM is an unsupervised neural network which is particularly adept at pattern recognition and classification (Kohonen, 1995), which has been used to visualise and interpret large high-dimensional data sets in many diverse areas of research (over 5300 papers) including model validation (Hewitson and Crane, 1994).  The key to the ability of the SOM to extract patterns is that it learns by an iterative process.  The SOM defines a mapping from the input data space onto a regular two-dimensional array of nodes. The map attempts to represent all the available observations with optimal accuracy using a restricted set of models. At the same time the models become ordered on the grid so that similar models are close to each other and dissimilar models far from each other. In essence the algorithm groups large volumes of data into a number of categories, the number (and arrangement) of which are selected by the investigator. Its application in marine science is described in Richardson et al., (2003), who recommend it as a tool for the analysis of large volumes of model data.

A 2 stage SOM (Allen et al 2007) allows us to condense information about complex spatio-temporal patterns into a simple 2D plot, (Fig. A4). This is required to allow us to classify multivariate data in space and time. Firstly all grid cells from all time-steps within the model domain are classified, using SOM, into a pre-selected number of categories on the basis of values of a set of variables. In effect grid cells with similar properties are classified together independent of the temporal dimension.  In the second stage each grid cell is again classified into a pre selected number of categories by a SOM, this time on the basis of its category membership (from stage 1) through time.  The result is a 2 dimensional plot in which each box is assigned to a category on the basis of its complex spatio-temporal dynamics (Fig. A4).

 


 


Figure. A4. Schematic representation of the 2 stage SOM

 

A.6 Testbeds

The systematic inter-comparison of models to assess which model structures are best able to reproduce observations across a range of regions (e.g. Freidrichs et al 2006, 2007). Crucial to this is the notion of portability; the ability of the model the reproduce data in multiple regions without retuning the model. Freidrichs et al 2007 found that increasing the number of phytoplankton functional types generally increases the portability of models. They also found that increasing the number of zooplankton functional types does not always improve model portability.

 

 

References

Allen J.I., T.J. Smyth, J.R. Siddorn, and M. Holt. 2007. How well can we forecast high biomass algal bloom events in a eutrophic coastal sea? Harmful Algae, In press.

Allen, J. I., J. T. Holt, J. Blackford, and R. Proctor. 2007b. Error quantification of a high-resolution coupled hydrodynamic-ecosystem coastal-ocean model: Part 2. Chlorophyll-a, nutrients and SPM. Journal of Marine Systems, 68 381-404.

Allen, J. I., P. J. Somerfield, and F. J. Gilbert. 2007a. Quantifying uncertainty in high-resolution coupled hydrodynamic-ecosystem models, Journal of Marine Systems, 64, 3-14.

Allen, J. I., P. J. Somerfield, and J. Siddorn. 2002. Primary and bacterial production in the Mediterranean Sea: a modelling study. Journal of Marine Systems, 33-34: 473-495.

Allen, J.I., Clarke K.R., 2007. Effects of demersal trawling on ecosystem functioning in the North Sea: a modelling study. Marine Ecology Progress Series, 336: 63-75.

Allen JI and P. J. Somerfield. 2008.  A multivariate approach to model skill assessment, Journal of Marine Systems, submitted.

Arhonditsis, G.B., and M.T. Brett. 2004. Evaluation of the current state of mechanistic aquatic biogeochemical modeling. Marine Ecology Progress Series, 271: 13-26.

Arhonditsis, G.B., B.A. Adams-VanHarn, L. Nielsen, C.A. Stow, and K.H. Reckhow. 2006. Evaluation of the current state of mechanistic aquatic biogeochemical models: citation analysis and future perspectives. Environmental Science & Technology, 40: 6547-6554

Brown C.D. and H.T. Davis, 2006. Receiver operating characteristics curves and related decision measures: A tutorial. Chemometrics and Intelligent Laboratory Systems, 80; 24-38.

Chatfield, C. and A.J. Collins. 1980. Introduction to Multivariate Analysis. Chapman and Hall, London.

Clarke, K.R. and R.N. Gorley. 2006. PRIMER v6: User manual/tutorial. PRIMER-E Ltd, Plymouth, 192pp

Clarke, K.R., 1993. Nonparameteric multivariate analyses of changes in community structure. Australian Journal of Ecology, 18: 117-143.

Friedrichs, M.A.M., J.A. Dusenberry, L.A. Anderson, R. Armstrong, F. Chai, J.R. Christian, S.C. Doney, J. Dunne, M. Fujii, R. Hood, D. McGillicuddy, J.K. Moore, M. Schartau, Y.H. Spitz, and J.D. Wiggert. 2007. Assessment of skill and portability in regional marine biogeochemical models: the role of multiple planktonic groups, J. Geophys. Res. Oceans, doi:10.1029/2006JC003852, in press.

Friedrichs, M.A.M., R. Hood, and J. Wiggert. 2006. Ecosystem model complexity versus physical forcing: Quantification of their relative impact with assimilated Arabian Sea data. Deep-Sea Research II, 53: 576-600.

Nash, J.E., and J. V. Sutcliffe. 1970. River flow forecasting through conceptual models, Part 1 – A discussion of principles. Journal of Hydrology, 10: 282-290.

Stow et al Skill Assessment for Coupled Biological/Physical Models of Marine Systems JMS submitted.

Taylor, K. E., 2001. Summarizing multiple aspects of model performance in a single diagram, Journal of Geophysical Research, 106: 7183-7192.