Testing of null hypotheses in exploratory community analyses: similarity profiles and biota-environment linkage

https://doi.org/10.1016/j.jembe.2008.07.009Get rights and content

Abstract

Tests for null hypotheses of 'absence of structure' should play an important role in any exploratory study, to guard against interpretation of sample patterns that could have been obtained by chance, and two new tests of this type are described. In the multivariate analyses that arise in community ecology and many other environmental contexts, e.g. in linking assemblage patterns to forcing environmental variables (gradient analysis), the problem of chance associations is exacerbated by the large number of combinations of abiotic variables that can usually be examined. A test which allows for this selection bias is described (the global BEST test), which applies to any dissimilarity measure, utilises only rank dissimilarities, and operates by permutation, assuming no specific distributional form or parametric expression for the biotic to abiotic links. A second permutation procedure, the similarity profile routine (SIMPROF), tests for the presence of sample groups (or more continuous sample patterns) in a priori unstructured sets of samples, for which an a priori structured test (e.g. the widely-used ANOSIM) is invalid. One context is in interpreting dendrograms from hierarchical cluster analyses: a series of SIMPROF tests provides objective stopping rules for ever-finer dissection into subgroups. Connecting these two tests is a third methodological strand, adapting De'ath's multivariate equivalent of univariate CART analysis (Classification And Regression Trees) to a non-parametric context. This produces a divisive, constrained, hierarchical cluster analysis of samples, based on their assemblage data, termed a linkage tree. The constraint is that each binary division of the tree corresponds to a threshold on one of the environmental variables and, consistently with related non-parametric routines, maximises the high-dimensional separation of the two groups, as measured by the ANOSIM R statistic. Such linkage trees therefore provide abiotic 'explanations' for each biotic subdivision of the samples but, as with unconstrained clustering, the LINKTREE routine requires objective stopping rules to avoid over-interpretation, these again being provided by a sequence of SIMPROF tests. The inter-connectedness of these three new developments is illustrated by data from the literature of marine ecology.

Introduction

Professor John Gray was a strong advocate of the insights obtainable from exploratory studies of gradients and did much to demonstrate their efficacy in the contexts of monitoring for pollution and studying biodiversity (see Gray et al., 1988, Gray et al., 1990, Ellingsen and Gray, 2002, amongst many others). For multivariate community analyses, however, gradient studies (broadly characterisable as adopting a regression approach) have sometimes suffered in comparison to studies involving factorial designs (broadly speaking, an analysis of variance approach) by their perceived lack of hypothesis testing for structure elucidated only a posteriori. Such criticism is often justified: for example, a search through large numbers of environmental variables, for combinations which 'explain' the among-sample structure of a biotic assemblage, is almost guaranteed to find a combination with some apparent explanatory power, even where there is no real linkage. The process of searching through many solutions for the one that optimises some criterion inevitably involves strong selection bias. At the least, what is required here is a formal test of the null hypothesis that there is no link between the sample patterns of biota and environment, adjusting for this selection bias. If the null can be decisively rejected then there is some objective basis for interpreting the observed correlative links.

In similar vein, application of hierarchical cluster analysis to a set of a priori unstructured samples of assemblage data yields a dendrogram, whether agglomerative or divisive, in which ever finer distinctions are drawn between groups of samples, ultimately terminating in each sample placed in a different group. Given that a cluster analysis will produce a grouping from data consisting entirely of random numbers, and thus with no meaningful sample structure, the question naturally arises as to what objective basis there is for interpreting particular groups or subgroups displayed by the dendrogram. Again, statistical testing is needed, this time in the form of a series of null hypothesis tests that particular groups displayed in the dendrogram have no meaningful internal structure. Only if such a hypothesis can be rejected is it permissible to interpret a further subdivision of an existing group.

This paper describes such tests, for analysis of any similarity, distance or dissimilarity matrices (generically referred to as 'resemblance' measures, following Legendre and Legendre, 1998). It places these tests in the context of the non-parametric approach to analysing species-by-samples matrices described by Clarke (1993), which has been widely adopted in marine community ecology in particular, largely through availability of the PRIMER package (v6, Clarke and Warwick, 2001, Clarke and Gorley, 2006). A notable early step in the latter was Professor Gray's enthusiastic encouragement of development of these techniques through a series of workshops held under the auspices of the UNESCO/IOC Group of Experts on the Effects of Pollutants (Bayne et al., 1988) and the FAO/UNEP Mediterranean Pollution Programme. The core routines in this approach include non-metric MDS ordination of samples and ANOSIM tests of a priori factors defined on them, together with indirect gradient analyses linking biotic assemblage patterns to 'best' subsets of environmental variables, exhibiting matching sample structure (BEST routine). These routines are based on unconstrained choice of a resemblance matrix appropriate to the data type and question of interest, and the ANOSIM and BEST routines utilise only the rank values of the among-sample resemblances.

Within the existing framework, this paper adds, firstly, a 'global BEST test' which examines whether the highest rank correlation (ρ), obtainable between the biotic similarity matrix and the matching distance matrix from the optimal subset of environmental variables, exceeds values that would be expected by chance under the null hypothesis (of no real biota-environment link). Secondly, a 'similarity profile' (SIMPROF) test is described, in which the biotic similarities from a group of a priori unstructured samples are ordered from smallest to largest, plotted against their rank (the similarity profile), and this profile compared with that expected under a simple null hypothesis of no meaningful structure within that group. Repeated application of this test generates a stopping rule for a posteriori division of the samples into ever smaller subgroups, as in hierarchical cluster analysis. These two analytical strands converge in a third routine, a counterpart to the BEST procedure of matching environmental information to species patterns, which adapts the Multivariate Regression Trees of De'ath (2002) to the non-parametric framework in PRIMER. The LINKTREE procedure is a form of constrained cluster analysis involving a divisive partition of the biotic community samples into ever smaller groups, but in which each division has an 'explanation' in terms of a threshold on one of the environmental variables. As with agglomerative hierarchical clustering, such linkage trees also need stopping rules to avoid random sampling variation among samples from a single assemblage being interpreted as further sub-group structure. These are again provided by a series of similarity profile (SIMPROF) tests.

It should be borne in mind throughout that, though the above outline and the examples of this paper are couched in terms of tests on species assemblages and their relation to environmental variables, nothing in the formulation of the methods restricts their use to this context. The SIMPROF test will provide stopping rules for any a posteriori subdivision of a group of samples, based on multiple variables of taxa, physical environment, chemical water-quality, measures of diversity, biomarkers, distributions of particle sizes, etc. The global BEST test, rather than matching subsets of environmental variables to a fixed pattern of resemblances for whole communities, can be applied to testing whether subsets of species show a significantly matching pattern of samples to those of a fixed environmental gradient. Similarly, an optimal subset of biomarkers (or other metrics) can be tested for its match to an observed chemical gradient, or to manipulated levels of contaminants; patterns in subsets of one group of biotic variables can be tested for their ability to 'explain' patterns in another set (corals structuring assemblages of reef-fish, infaunal macrobenthos structuring meiobenthic communities etc); and many other 'linkage' problems could be formulated and tested in this way.

Section snippets

Global BEST test

The idea behind this test is outlined in the schematic diagram of Fig. 1. This shows, in normal typeface, the routine for linking of biota to environment (Bio-Env) described by Clarke and Ainsworth (1993). The data consist of two matrices (left-hand side), both referring to the same set of n samples (locations/times/treatments, or whatever context determines the sampling programme). For the biotic variables (top row), a triangular matrix of resemblances between samples is calculated for

Results and specific discussion

The analyses are not presented in the contexts and under the hypotheses of the original studies. It is not the purpose of this methodological paper to discuss and interpret the specific data in any detail. They are used merely to illustrate the techniques with realistic examples of possible outcomes. The associated discussion, similarly, focuses on general caveats and corollaries of the three methods and their inter-relationships.

Global BEST test

In an indirect way, the BEST routine is trying to solve the same problem for multivariate community data as standard multiple regression does for single response variables (though more direct multivariate analogies of multiple regression are given by the dbRDA and DISTLM procedures of Legendre and Anderson, 1999, and McArdle and Anderson, 2001, since these employ explicit linear models). In multiple linear regression, 'all subset' regressions and basic stepwise selection methods (Efroymson, 1960

Acknowledgments

We thank the referees and the guest editor (RMW) for their most helpful and positive comments. This work is a contribution to the biodiversity component of the Plymouth Marine Laboratory's core strategic research programme. It was supported by the UK Natural Environment Research Council (NERC) and the UK Department for Environment, Food and Rural Affairs (DEFRA) through the AMBLE project ME3109. KRC acknowledges his position as honorary fellow of the Plymouth Marine Laboratory and of the Marine

References (40)

  • ClarkeK.R. et al.

    Change in marine communities: an approach to statistical analysis and interpretation

    (2001)
  • ClarkeK.R. et al.
  • ClarkeK.R. et al.

    An index showing breakdown of seriation, related to disturbance, in a coral-reef assemblage

    Mar. Ecol. Prog. Ser.

    (1993)
  • ClarkeK.R. et al.

    Dispersion-based weighting of species counts in assemblage analyses

    Mar. Ecol. Prog. Ser.

    (2006)
  • CollinsN.R. et al.

    Zooplankton communities in the Bristol Channel and Severn Estuary

    Mar. Ecol. Prog. Ser.

    (1982)
  • CopasJ.B.

    Regression, prediction and shrinkage

    J. Roy. Statist. Soc. B

    (1983)
  • Danielidis, D.B., 1991. A systematic and ecological study of diatoms in the lagoons of Messolongi, Aitoliko and...
  • De'athG.

    Multivariate regression trees: a new technique for modeling species environment relationships

    Ecology

    (2002)
  • DraperN. et al.

    Applied Regression Analysis

    (1981)
  • EfroymsonM.A.

    Multiple regression analysis

  • Cited by (822)

    View all citing articles on Scopus
    View full text