Methods of investigating forecast error sensitivity to ensemble size in a limited-area convection-permitting ensemble

Ensemble-based predictions are increasingly used as an aid to weather forecasting and to data assimilation, where the aim is to capture the range of possible outcomes consistent with the underlying uncertainties. Constraints on computing resources mean that ensembles have a relatively small size, which can lead to an incomplete range of possible outcomes, and to inherent sampling errors. This paper discusses how an existing ensemble can be relatively easily increased in size, it develops a range of standard and extended diagnostics to help determine whether a given ensemble is ‘large enough’ to 5 be useful for forecasting and data assimilation purposes, and it applies the diagnostics to a convective-scale case study for illustration. Diagnostics include the effect of ensemble size on various aspects of rainfall forecasts, kinetic energy spectra, and (co)-variance statistics in the spatial and spectral domains. The work here extends the Met Office’s 24 ensemble members to 93. It is found that the extra members do develop a significant degree of linear independence, they increase the ensemble spread (although with caveats to do with non-Gaussianity), 10 they reduce sampling error in many statistical quantities (namely variances, correlations, and length-scales), and improve the effective spatial resolution of the ensemble. The extra members though do not improve the probabilistic rain rate forecasts. It is assumed that the 93-member ensemble approximates the error-free statistics, which is a practical assumption, but the data suggests that this number of members is ultimately not enough to justify this assumption, and therefore more ensembles are likely required for such convective-scale systems to further reduce sampling errors, especially for ensemble data assimilation 15 purposes. Copyright statement.

and 128 members of different resolutions over 14 case studies. They showed that although increases in both ensemble size and resolution are beneficial, larger ensembles achieve a better skill. In particular they found better probability predictions of temperature and precipitation, as measured by the Brier score. Higher resolution forecasts did increase the ensemble spread, but insufficiently. Later Mullen and Buizza (2002) used the ECMWF's EPS to show that increasing only the resolution does provide clear benefits in precipitation forecast skill of lighter rain, but increasing the number of members provides better 5 value for forecasts of heavier rain (as estimated from a simple cost-loss model). More recently Bonavita et al. (2011) used the ECMWF's ensemble of data assimilations (EDA) to show that the sample forecast error standard deviation of the vorticity fields calculated with 10 and 50 member ensembles are highly correlated (between 80% and 95%), despite the ensembles having a larger sampling noise at smaller scales. In order to reduce sampling noise, forecast error variances need to be cleaned up, which was done in spectral space using a filter with a truncation wave number estimated from climatological statistics. Forecast error 10 standard deviations estimated with larger ensembles tend to need less filtering and therefore tend to have a higher effective spatial resolution, leading to better forecast scores.

Sensitivity to ensemble size in convective-scale systems
The conclusions of the above studies do not necessarily apply to EPSs using a convection-permitting ensemble, given the highly varying nature of predictability at convective-scales and their potential deviation from Gaussianity. For these reasons, 15 work has also been done to study the effect of changing the number of ensemble members in convection-permitting model forecasts and some important studies are mentioned below.
Tong and Xue (2005) performed a synthetic observation study, assimilating perfectly modelled synthetic radar Doppler and reflectivity observations with an EnKF into a 2 km grid-length configuration of the ARPS 2 model. Although they used 100 members, they commented that as few as 40 members were enough to produce good analyses. Clark et al. (2011) used a 4 km 20 grid-length configuration of WRF over the central United States. They took between 1 and 17 members over multiple cases to show that the area under the ROC curve for 6-h accumulated precipitation at various thresholds increased with increasing ensemble size for all considered scales. They found that there was little impact on the ROC area by increasing the ensemble size above 9 members, although they postulated that at least 60 members would be needed to bring sampling error and under dispersiveness down to acceptable values. They also argued that more members are required for skillful forecasts of rare events 25 or in low-predictability regimes. Bouallegue et al. (2013) found an improvement in the reliability and resolution of precipitation ensemble forecasts by increasing the number of members from 20 to 60 in a 2.8 km grid-length version of the COSMO-DE model. Ménétrier et al. (2014) discussed the characteristics of the forecast error variances and correlation length scales for small (6 member) and large (84 member) ensembles of forecasts using the convection-permitting AROME model (2.5 km grid-length) over France, and focused on a case study characterized by strong convection. They investigated the effects of 30 sampling errors in the small and large ensembles by comparing the forecast error standard deviations of near surface specific humidity derived from each. The small ensemble showed larger variability at small scales in particular, due to larger sampling noise. They calculated anomaly correlations between variance maps computed from the two ensembles for a range of quantities 2 Advanced Regional Prediction System.

3
Geosci. Model Dev. Discuss., https://doi.org /10.5194/gmd-2017-260 Manuscript under review for journal Geosci. Model Dev. Discussion started: 20 November 2017 c Author(s) 2017. CC BY 4.0 License. and found values between ∼60% and 80%. They also showed that the correlation length-scales for a range of quantities are, on average, shortened as the number of ensemble members is increased (which is consistent with the need for less localisation for larger ensembles). In the case (at least) of specific humidity fields though, their study found that this effect is dominated by a decrease in the number of instances of long correlation length-scales in their large ensemble compared to their small ensemble. The number of instances of small correlation length-scales in their large ensemble though was found actually to 5 increase (their Fig. 13), thus complicating the effect that sampling error has on the length-scales.  studied ensemble sizes of 5, 10,20,30,40, and 50 members in a 3 km grid-length version of WRF over the central United States.
They found that there were sometimes clear improvements in reliability, fractions skill score, and area under the ROC curve for precipitation forecasts. This was especially true for low precipitation thresholds, higher probability events, and longer forecasts.
Their conclusion though was that as few as 20 to 30 members showed similar skill to the full 50 members in many respects. 10 Harnisch and Keil (2015) found increases in Brier skill score for precipitation forecasts, and decreases in the CRPS for 10 m winds, and in the average forecast-minus-observation departures of various quantities, by increasing the number of members from 20 to 60 in a 2.8 km grid-length version of the COSMO-DE model. These benefits were found to be significantly larger than using various ensemble inflation methods, or the introduction of a model error scheme.
The overall conclusion from these studies is that the ensembles give improved forecast skill as the ensemble size is increased, 15 but a judgment of the number of ensemble members required to achieve a particular goal depends on the model and on the application. Running high-resolution forecasts operationally remains an expensive activity, and so any studies indicating the degree of sensitivity of a range of diagnostics to ensemble size is valuable, especially when applied to models and outputs that have not been studied in this way, such as to convective-scale model forecasts of rainfall rate as is done in this paper. 20 Throughout this work the authors had resources to generate a large ensemble for only a single convective-scale case study. This paper is intended to do four things: (i) highlight some issues around sampling error in convective-scale systems, (ii) document a means of generating a larger ensemble from an existing small ensemble, (iii) develop a number of potentially informative diagnostics, and (iv) test the diagnostics for the large ensemble. Due to the availability of only one case study, the results are therefore not intended to provide a definitive answer to the number of ensemble members required in the system studied, as 25 to do this, more members, and more realisations would be required. It is hoped though that the methodology, the choice of diagnostics, and how they are interpreted will be useful to researchers and developers who do have the resources to use the tools to their full potential, especially those who are planning new or extended EPSs operationally.

Aims and scope of this paper
A central assumption made in this work is that the large ensemble is sufficient to neglect sampling errors. Sub-samping from this large ensemble is then done to see how smaller ensembles (of varying size, now assumed to have sampling errors) 30 reproduce aspects of the full ensemble. The methods are illustrated with a case study with an (up to) 93 member convectionpermitting forecast ensemble based on a 1.5 km grid-length version of the Met Office's Met UM over the Southern UK. We do acknowledge that 93 members is not sufficiently large to neglect sampling errors and, as stated above, and that a single case study is not sufficient to reach a definitive conclusion, but it is beyond the scope of our project resources to generate more 4 Geosci. Model Dev. Discuss., https://doi.org /10.5194/gmd-2017-260 Manuscript under review for journal Geosci. Model Dev. Discussion started: 20 November 2017 c Author(s) 2017. CC BY 4.0 License. members or to study further case studies. The case study is meteorologically interesting though; it is characterized by multiple rain bands generated by a cold front passing over the model's domain. This paper attempts to use and develop methods to help answer the following kinds of questions that arise in ensemble studies.
-How can linearly independent extra members be generated from an existing ensemble?
-How can the ensemble size impact probabilistic forecasts of rainfall? 5 -How can the ensemble size affect how the kinetic energy spectrum is resolved?
-How can the ensemble size affect estimates of (co-)variability of thermodynamic and moisture fields?
-Is the ensemble used in a particular application large enough to neglect sampling error?
The structure of this paper is as follows. Section 2 is a description of the case study, Sect. 3 explains how the 93-member ensemble is created from the operational 24-member ensemble and examines how linearly independent the extra members are. 10 The remaining sections describe how sensitive various diagnostics are to ensemble size. Section 4 looks at ensemble means and spreads, Sect. 5 looks at probabilistic aspects, Sect. 6 looks at kinetic energy spectra, and Sect. 7 looks at aspects of the forecast error (co)variances. Finally, Sect. 8 discusses the main conclusions.

The 20th September 2011 case study
This case study is of interest to the DIAMET 3 project. It comprises multiple cloud bands (labelled "1", "2", and "3" in Fig.   15 1) over Southern UK and it was intensively observed with a field campaign (Vaughan et al., 2015). The case is also studied by Baker et al. (2014) using a 24-member Ensemble Transform Kalman Filter (ETKF) ensemble mentioned in Sect. 3.4. The case is characterised by an eastward moving cold front over the southern UK as shown in Fig. 3 of Baker et al. (2014). Rainfall maps constructed from the UK's network of radar instruments are shown in Fig. 1   3 Generating more members from an existing ensemble 3.1 The meteorological model and the existing ensemble prediction system The model used for this study is a 1.5 km grid-length version of the Met Office's Unified Model for a region over the southern UK (SUK-1.5 4 ). This model is no longer operational, but was used during the London 2012 Olympics Golding et al. (2014); Ballard et al. (2016). The model's initial conditions (ICs) and lateral boundary conditions (LBCs) were produced using a set 5 of nested forecasting systems. The starting point was the Met Office's global model, which produced the ICs and LBCs for an intermediate resolution regional model (the North Atlantic and Europe [NAE] domain), which in turn produced ICs and LBCs for the convective-scale SUK-1.5 model.
The existing EPS used for this study was the 24-member MOGREPS-G system (the global configuration of the Met Office Global and Regional Ensemble Prediction System (Bowler et al., 2008)). This was run with the global model, whose ensemble 10 members were updated on a 6-h cycling timescale using an ensemble transform Kalman filter (ETKF) -for perturbationsand a 4DVar analysis -for the control -which the analysis perturbations were centred about Caron, 2013). The standard 24 MOGREPS-G system comprised 23 perturbations and one control member, here denoted (23+1). 4 SUK-1.5 was the designation of this model in Bannister et al. (2011), although the model is also known as the Nowcasting Demonstration Project (NDP) by the Met Office.

Generating the 93 convective-scale ensemble members
Members of a forecast ensemble may be constructed in different ways. Perhaps the most justifiable method is to generate an analysis ensemble using an Ensemble Kalman Filter (EnKF), or ETKF (Ehrendorfer, 2007;Houtekamer and Zhang, 2016), and then to propagate this ensemble forward in time. As long as the EnKF is set-up in an optimal fashion, that the ensemble is large enough to neglect sampling errors, and that errors are close to Gaussian, such an ensemble should have the correct spread and 5 contain the correct correlations between variables, which reflect the true uncertainty of the system given the available data. This method relies on the existence of a forecast ensemble in the first place, so it cannot be used directly to generate an ensemble from scratch. Some methods are more suitable to generate an ensemble from scratch. The breeding method (Toth and Kalnay, 1997;Corazza et al., 2003;Wang and Bishop, 2003) starts with random perturbations to a state. It then uses the numerical model to 10 propagate the perturbed and unperturbed states over a defined time interval, and then scales-down the propagated perturbations to a specified size. This process is repeated until the states (which become the ensemble members) have lost memory of the initial random perturbations. The states generated are called bred vectors and recognise the non-linearity of the system. The singular vector method (Buizza and Palmer, 1995) on the other hand finds the perturbations to a linearised system that grow the fastest over a defined time interval. Each perturbation is constrained to have a specified covariance amongst its elements at 15 the start time (typically the analysis covariance is used, in which case the states are called Hessian singular vectors). Singular vectors though do not account for non-linearity of the model. The system simulation approach (Houtekamer et al., 1996) builds ensembles by performing multiple forecast and DA cycles for a set of initially random states, while attempting to vary all known uncertainties. Each state is fed through these cycles with perturbed observations, perturbed aspects of the model (e.g. the parametrisation schemes) and perturbed ancillary fields (e.g. sea surface temperatures). This is repeated until the ensemble 20 statistics become stable. The ECMWF uses a similar approach to maintain its ensemble system, e.g. Bonavita et al. (2012).
The initial members may be sampled from the climatological background error covariance matrix (as in e.g. Raynaud and Bouttier (2016)). Alternatively, a set of ensemble members that sample a climatological PDF may be taken from one long (e.g. multi-year) model run. In this method, the ensemble is built by extracting states over a specific season of this run. This is the approach used by Miyoshi et al. (2014) for their very large ensemble. 25 In this study we already have a 24-member ensemble and we develop the ability to generate additional members. The extra members are generated by negating existing perturbations. This method is conceptually simple, but is complicated by the nesting and sub-nesting of the various models involved. Three separate modelling systems are involved: the global model, the NAE model, and the SUK model. The procedure is illustrated in Fig. 2, and the upper case labels (A), (B), etc., in the following refer to this chart.   The red arrows indicate that DA is performed (ETKF with re-centring on the Var control state), the blue down-arrows labeled "x2" indicate a doubling of the number of perturbations by negation, and the green down-arrows indicate downscaling of the members to a finer grid.
The creation of the extra perturbations by negating the existing ones (as in steps 1(a) and 1(c)), can be continued to generate exponentially more ensemble members as required, and as resources allow.

Linear independence of the extra members
The extra perturbations are not immediately linearly independent, but it is reasonable to assume that independence will develop during the run of the non-linear models and that this will happen adequately during the 36 hours from the start of the (92+1)member MOGREPS-G run to the start of their downscaling to the NAE, and thereafter to the SUK model. Gilmour et al. (2001) showed that similar positive/negative perturbations in global models usually take no more than 48 hours (and often less than 24 5 hours) to show significant non-linearity. This is expected to happen much more quickly in a convective-scale model. Given a set of ensemble perturbations at a particular time the degree of independence can be checked by attempting to construct a set of orthogonal vectors. Considering each perturbation in turn, components of perturbations already considered are systematically subtracted. The size of the residual is then a measure of the linear independence of the perturbation considered with respect to the previous perturbations. Let x i be the perturbation of member i (with respect to the control member), and letx i be the 10 ortho-normalised vector satisfyingx i Tx j = δ ij .x i can be constructed from the Gram-Schmidt procedure: where (1)) are for normalisation, and α ij is chosen for orthogonality, α ij = x i Tx j /N i . IfN i is found to be unity then the perturbation x i is already perfectly orthogonal (linearly independent) to the previous perturbations, and ifN i is found to be zero then it does not contain a new direction in phase 15 space 5 . Figure 3 plots the value ofN i as a function of ensemble perturbation member number and time of day, considering separately a single level of (a) temperature and a single level of (b) specific humidity. Even though the ψ i should represent the whole state, rather than specific variables and levels, we assume that it is sufficient to look at this issue in relevant subspaces (a single variable/level has a subspace size of 360 × 288 = 103680). In Fig. 3 the first perturbation at the bottom of each column of 20 bars is always unity by construction, and so acts to set the scale. AlthoughN i broadly decreases with perturbation number for the temperature and specific humidity subspaces, the values remain significant, and no perturbation hasN i ≈ 0. This is a demonstration that each respective member has developed some unique information to add to the ensemble as a whole.

The ensembles used in this paper
Although the discussion in Sect. 3.2 emphasises the large ensemble, a number of ensembles are used in this paper. These are 25 summarised as follows: -The 'large ensemble', N = 93, uses all (92+1) members. 5 Strictly the orthogonality would be better defined in a norm that respects the PDF of forecast errors, but we use the Euclidean norm for simplicity. The degree of linear independence can differ significantly between these norms if the covariance of the forecast error PDF has a high condition number.
-The 'small ensemble', N = 24, has (23+1) members. These are sub-sampled from the large ensemble in the same way as for the intermediate ensemble. There are ∼ 3 × 10 21 unique sub-samples, but we again take either one or 1000 at 5 random.
-The 'ETKF ensemble', N = 24, is a single ensemble comprising the (23+1) member ensemble used in a previous study, where a convective-scale ETKF is used to update the state over an hourly cycle (Baker et al., 2014).
In the case of the sub-sampled ensembles, whether just one sub-sample is taken or 1000 depends on the context, and the control member is always included in sub-samples. How many sub-samples used is specified in each figure caption.  broader and less intense than the individual members. This is due to the averaging of members at slightly different locations, and results in a mean field that is not very useful (Ancell, 2013;Hollan and Ancell, 2015). There is arguably marginally more noise in the ensemble means for smaller ensemble sizes. This indicates qualitatively that to go from (23+1) to (92+1) members 10 does not achieve significant improvement in the ensemble mean precipitation forecasts for this case. This is consistent with the lack of sensitivity of ensemble mean forecast skill to ensemble size experienced by , and also discussed in Leith (1974).
These results are also compared with the ensemble mean rainfall from the (23+1) member ensemble of Baker et al. (2014), Fig. 4i. This is the ETKF ensemble (row F of Fig. 2), and is derived from a shorter forecast (T+3) initialised from an ETKF 15 and 3DVar at 12Z. This ETKF ensemble also does not show two rain bands (see Sect. 4 of Baker et al. (2014)) and the band that is forecast is also fairly comparable to that in the large ensemble, albeit with more intense rain and better alignment with the observed rain band.
4.2 Impact on the relative standard deviation Figure 5 shows the relative standard deviation (the ratio of the sample standard deviation to the sample mean where the 20 mean rain rate is greater than 0.01 mm/h) for each of the ensembles. Each map shows largest values close to the edge of its respective rain band in Fig. 4 and is well aligned with each band's main axis. This shows that the main rain band tends to have approximately the same orientation but slightly different locations amongst the members of all ensembles. The relative standard deviation for the ETKF ensemble shows more small-scale variability than the others and we do not know whether this is a consequence of the application of the ETKF at the convective-scale, or just the fact that this is a shorter forecast. 25 The large ensemble presents larger values, and a larger width of the standard deviation strip and is a consequence of the wider variability of the centres of the rain bands amongst its ensemble members. The qualitative similarity of the maps for the extended ensembles and the small ensemble in Figs. 4 and 5 suggests that the method used to generate extra members described in Sect. 3.2 produces a physically reasonable ensemble.
To assess how different the ensembles are at capturing higher moments of the 'true' rainfall rate distribution, Fig. 6 shows   do a similar job in representing some aspects of the non-Gaussian rainfall distribution, with the range of the small ensemble being smaller than that of the large ensemble as expected (note the logarithmic scale of the y-axis). Although we would expect the small ensemble to have a smaller spread than the large ensemble (as found e.g. in Fig. 5), the interquartile ranges are larger in the small ensemble for five of the ten times shown, namely at 2, 4, 6, 7, and 8-hour lead times (note that for a similar interquartile range of the large and small ensembles, the dashed lines marking the edges of each small ensemble's interquartile 5 range should lie at the centre of the thick red lines marking the edges of the large ensemble's range; note also the logarithmic scale). the 8-hour lead time corresponds to 15Z, which shows an increase in variance with more members in Fig. 5). This is an example of a case when the variance is not always a complete way of describing a non-Gaussian distribution. The variability in the mean between the large and small ensembles is insignificant compared to the spreads.  small ensemble (c), and the ETKF ensemble (d). The relative standard deviation rain rate is defined as the ratio between the sample standard deviation rain rate to the ensemble mean rain rate (defined where the mean rain rate is greater than 0.01 mm/h). The intermediate and small ensembles each represents a single sample from the large ensemble.

Verification of the rainfall spread
The rainfall forecast ensemble is now verified against the radar observations with rank histograms. The ensemble of rain rate forecasts (valid on 20th September 2011 at 15Z) provide the ranks at each point in the model's domain, which are populated by radar measurements that are above a specified threshold rain rate. The ranks are determined at each point according to the sorted ensemble of rain rates, where each is modified by a value drawn from the observation error distribution (Hamill, 2001).

5
This distribution is assumed to be Gaussian with a standard deviation of 0.316 mm/h, as used in Migliorini et al. (2011). Focusing first on the large ensemble, the histogram for the 0.0 mm/h threshold (panel a) suggests that the ensemble is overspread, and with a positive bias (the ensemble tends to over-estimate the observed rainfall). The picture changes after including only grid-points that have more than 0.02 mm/h rain rate (b). This histogram suggests that the ensemble is still over-spread, but to a much lesser extent than in (a). The histogram found by including only grid-points that have more than 0.2 mm/h rain rate (c) is similar to (b), but with evidence of a negative bias in rain rate. These mixed results mean that the ensemble exhibits It is well known that an ensemble of forecasts provides a means of estimating the probability of certain events happening according to region, and an available ensemble that has many more members than the operational ensemble is a resource to study whether probabilistic diagnostics can be improved by increasing the number of members. Figure 8 shows the probability of rain for the large and small ensembles for three rain rate thresholds, 0.02, 0.2 and 2 mm/h.

5
The regions of high probability do not change considerably between the large and small ensembles for all thresholds, although they do of course reduce with threshold. The main differences between the large and small ensembles are that the probability maps for the large ensemble are smoother and have slightly smaller values than for the small ensemble (a similar conclusion follows for other thresholds and lead times, not shown). As with previous results, these effects can be attributed in principle to the greater variation in positioning of the rainfall between members of the large ensemble than the small ensemble, rather than 10 fundamentally different behaviour of the extra members.
Comparing these probability plots to Fig. 1c allows us to assess whether the ensemble predicts a possibility of rain at all areas where there was rain measured by the radar. This is broadly true at the position of band 1 (but only in patches for the 2 mm/h threshold), but there is a low probability of rain at the position of rain band 2 ( 20% chance of rain at the 0.02 mm/h threshold, and virtually zero chance of rain at the 2 mm/h threshold). It is interesting to note that even though the probability 15 maps of any rain (0 mm/h threshold, not shown) are virtually indistinguishable from the 0.02 mm/h maps in panels (a) and (b), the rank histograms for 0 mm/h and 0.02 mm/h in Fig. 7 are strikingly different.

Sensitivity of dynamical aspects to ensemble size
It is important to check how this ensemble provides a dynamical description of the flow. An effective way to investigate this issue is to look at the kinetic energy (KE) spectrum as represented by the large and small ensemble systems. This is done using 20 the following procedure (broadly following Skamarock (2004), and omitting field position and time arguments for brevity).
1. Compute the weighted wind components (for each position, time, and ensemble member) as follows: and w i are respectively the zonal, meridional, and vertical wind components, ρ i is the density, and ∆ i is the grid-box thickness ( U 2 i + V 2 i + W 2 i /2 is then proportional to the KE of each grid box). 25 2. Modify each of U i , V i , and W i to remove the linear trend in the W-E direction. This is done separately for each ensemble member, time, vertical level, and latitude row, and is done to make the field periodic along the horizontal grid lines in preparation for step 3 (Errico, 1985).
3. Perform a discrete Fourier transform of U i , V i , and W i along the longitudinal direction only, for each ensemble member i. This produces fieldsŪ i ,V i , andW i , which are each a function of longitudinal wavenumber k, latitude, vertical level, 5. AverageĒ i over the domain in latitude, and over the free troposphere (between levels 30 and 51, i.e. between ∼ 3 and ∼ 9 km above sea level), in time (between 12 and 17Z), and over ensemble members. This results in a KE power spectrum,Ē (N ) (k), that is a function of k only, where N indicates the number of ensemble members. Furthermore, the 5 standard deviation,σ (N ) E (k), amongst the members is also computed.
Note that points within 10 grid-points of the lateral boundaries were omitted in all calculations. Fig. 9 (red dots). This shows a characteristic upturned tail at high wavenumbers typical of spectra from limited-area models (Skamarock, 2004). The variability of the spectrum,σ (N ) E (k) (red error bars) is also shown. From this figure, three main conclusions may be drawn.
1. The slope of the spectrum provides information about the type of turbulence. The slope is much closer to that of k −3 (characteristic of 2D turbulence, as expected at larger scales in the mid-latitude atmosphere) than that of k −5/3 (characteristic of 3D turbulence, as expected at convective-scales).
2. The effective spatial resolution of the ensemble -here defined as the wavelength at which the slope ofĒ (N ) (k) (in the log/log plot) starts to deviate from the best-fit slope at larger wavelengths (Skamarock, 2004) (by eye taken to be 15 that of the k −3 ) -is ∼ 8 km (∼ 5-6 grid-lengths). This is also the scale below which the uncertainty of the KE grows considerably. Scales smaller than 8 km are referred to as the unresolved scales.
3.Ē (N ) (k) has relatively little variability except at the unresolved scales.
The absence of a k −5/3 spectrum seems puzzling as the small-scale flow is not expected to be 2D stratified like mid-latitude large scale flow. These results though do appear to be consistent with Laprise et al. (2008) who also looked at the variance of 20 small-scale motion in nested models without small-scale driving information (provided by DA). They found that small-scale motion develops through the large-to-small-scale energy cascade over a period of a few days. Since our SUK model data used for Fig. 9 are a maximum of only ten hours old, this may explain why a small-scale (3D-like) energy spectrum is not seen. Raynaud and Bouttier (2016) also showed that perturbations downscaled into the 2.5 km grid-length AROME EPS took 9-12 hours to develop into realistic small-scale structures. This issue may be particularly relevant to our case study due to the passage 25 of a cold front, as opposed to a case where convection is the dominant source of precipitation (the latter of which may generate 3D motion more quickly).
To quantify how the ensemble size affects the sampled variability inĒ (N ) (k), the relative standard deviation,r     7 Sensitivity of forecast error (co)variances to ensemble size 10 Bayesian data assimilation techniques such as those based on variational or ensemble methods make use of short-range forecast uncertainty to regularise the problem of estimating the state of a system from a set of observations. A reliable estimate of the forecast error covariances are then essential to provide near optimal initial conditions for NWP. Such an estimate is often obtained by means of a forecast ensemble, which is run either in off-line mode (for climatological estimates of error covariances),  (2014)). It is therefore important to understand how sampling error affects (co)variance estimates for different sized ensembles.

Estimates of forecast error variances and their sampling errors
The top panels of Figs. 11 and 12 are the sample variances of temperature (model level 36) and specific humidity (level 10) 5 respectively, and in each case panel (a) uses the large ensemble and (b) uses the small ensemble. The temperature (T ) variances in Fig. 11 are higher towards the NW corner of the area (well behind the cold front), than elsewhere, and the contrast is stronger for the large ensemble. Furthermore the large ensemble has variances of higher values and is smoother. The specific humidity (q) variances in Fig. 12 show large values close to the Peninsula in the SW part of the domain (at and just behind the cold front), and again the large ensemble has variances of higher values. 10 By assuming that the variances from the large ensemble contain negligible sampling error, we can study the sampling error in the variances computed from the smaller ensembles. We assume that a field of estimated forecast error variances sampled : On notation: given a vector representation of a field, e.g. g, the value at position r i is denoted g(r i ); and in spectral space, e.g.ḡ, the value at a particular wavenumber isḡ(k j ). We swap between these notations depending upon which is the most appropriate in a given context. 7 Note that g (N ) is defined as the ratio between the sample variance error and the 'true' -rather than the sample -variance. This complies with the prescript "one should never, ever, put a random number in a denominator." (Penland, 2011

The covariances of sampling errors in the variances
The covariance of the above normalised variance errors can be written in the following way by adapting a result given as Eq.
(6) in Raynaud et al. (2009) (assuming that the sample variance estimates are unbiased): where C (∞) is the noise-free forecast error correlation matrix, and • denotes the Schur product. In (4), the correlation matrix of 5 variance sampling errors, C g , has been defined as C g ≡ C (∞) • C (∞) (note that the Schur product of two correlation matrices is also a correlation matrix (Gaspari and Cohn, 1999)). This result confirms that sampling error in the variance is expected to reduce with increasing N , and will have shorter length-scales than for forecast error correlations (due to the Schur product).
By defining another normalisation g (N ) = (N − 1)/2 g (N ) , Eq. (4) then informs us that the covariance cov g (N ) = C g , whose elements can be estimated using the Wiener-Khinchin theorem. The Wiener-Khinchin theorem says that the cor- , which is the matrix element (C g ) xi,xi+ri , here assumed to be independent of x i for homogeneity) is: where i, i represent position index, j represents wavenumber index, ı = √ −1, and n x is the number of points in the lon- 15 gitudinal direction. This theorem says that c (N ) g (r i ) is proportional to the inverse Fourier transform of the power spectrum The top panels of Fig. 13 show the power spectra S (N ) (k i ) for T (panel a) and q (b) for the small (blue) and intermediate (green) ensembles. Similar to the processing done for Sect. 6, the spectra are averaged latitudinally, and then vertically (for T this is between vertical levels ∼ 3 and ∼ 9 km above sea level, and for q this is below ∼ 3 km), but they are valid for the same 20 time as the plots in Figs. 11 and 12 (15Z). Before the Fourier transform, the fields are de-trended in the way described in point 2 of the first numbered list of Sect. 6. Since C g is in principle a constant matrix (i.e. independent of N ) the spectra should in principle be the same -any significant deviation being due to departures from the assumptions made for (4). The intermediate ensemble has systematically lower power than the small ensemble. This questions whether the approximation v (∞) ≈ v (93) used for (3) is well justified, (although we still assume that it is useful). to around a third in power. This decrease in sampling error is greater for the larger scales than for the smaller scales. The bottom panels of Fig. 13 show the correlation functions, c (5), found using the procedure using the inverse Fourier transform mentioned above for the small (blue lines) and intermediate (green lines) ensembles. These functions have a similar shape for T (panel e) and q (f), although q variance errors for the small ensemble do show a slight fluctuation in the correlation at around 200 km separation. The length-scale for variance errors in T is longer than that for q. If all assumptions made above are satisfied -that variance errors are additive as in (2), that v (∞) ≈ v (93) , and that (4) holds -a calculation of the 5 correlation would satisfy c (N ) g (r = 0) ≈ 1 (the fact that this is not an exact equality is due to sampling error). Applying (5), the value c (N ) g (r i = 0) is as follows: where σ g (N ) and g (N ) are the sample standard deviation and mean of g (N ) (x i ) respectively. In the limit N → ∞, we expect g (N ) 2 → 0, and (4) tells us that σ g (N ) → 1, leading to c (N ) approach unity in Fig. 13e and f as N increases indicates that one or more of the assumptions mentioned above is not true, 93) . We regard this diagnostic as a new potential test of whether an ensemble (in this case the large ensemble) is large enough or not. We find that for T the values are c (N ) g (0) < 1 for both of the smaller ensembles. Interestingly, this may suggest that v (93) > v (∞) (connected with the step going from (2) to (3)).  Gaspari and Cohn (1999)). Of these three forms, the exponential provided the best fit (by eye the fits are very close with the exception of the fluctuations of the blue curve in Fig. 13f for lengths greater than ∼ 100 km). 20 Note that the exponential form is useful because it allows for such errors in the variance to be modelled with a first order autoregressive process (see e.g. Sect. 11.3 of Evensen (2009)). Applied to g (N ) (x i ) (i.e. unprimed as in (2)), errors at subsequent points in the longitudinal direction can be modelled by

Exponential fit to the correlation functions
where is the exponential correlation = exp(−∆x/L), ∆x = 1.5 km, v (∞) (x i ) is the error-free variance, and (x i ) is white 25 noise with zero mean and unit variance. Equation (7) may be useful in other work to simulate variance fields for an N -sized ensemble (for a given 'true' variance field).
The best-fit parameters a, µ, L, and b of the exponential change according to the number of ensemble members used to compute the variance error. Figure 14 shows how a (top row) and L (bottom row) change with the number of ensemble members for T (left) and q (right) for 5 ≤ N ≤ 47. In Fig. 14  we would expect parameters to converge to a value with a small standard deviation. Convergence appears to have been reached for L for both T and q, although the standard deviations do not reduce significantly. Convergence appears not to have been reached for a for both T and q, although the standard deviations do reduce significantly. Based on these results for these test data it is difficult to judge whether the intermediate ensemble is 'large enough' to neglect sampling error in this context. An ensemble even larger than 93-members (as a proxy of the 'true' statistics) would be required to make a similar assessment up

Estimates of forecast error correlations and their sampling errors
The above arguments may be extended to the analysis of sampling errors in correlations. The analogue of (2) for covariances is as follows: where the matrix element indices x and x correspond to positions. Dividing each side by v (∞) x v (∞) x , and letting primed 5 versions of the matrix elements be such normalised versions of symbols in (8), gives the analogue of (3): where B (∞) xx is the error-free background error correlation 8 . The variance of the normalised covariance error, G xx , can be written in the following way by adapting a result given as Eq. (10) in Ménétrier et al. (2014): (∼ 500 m above sea level). For T (panel a) the air ahead of the cold front (SE corner) is about 8K warmer that the air behind the front at this level, and for q (panel b) a strip of moist air divides regions ahead of the front (moister by ∼ 2 g/kg than air ahead of the front at this level) from air behind the front (moister by ∼ 5 g/kg than air behind the front). The front is more advanced (further SE) at higher levels than at lower levels, which is characteristic of the structure of a mid-latitude cold front. 15 The second row are maps of the auto-correlations of forecast errors of T and q denoted as B The correlation length-scales for T errors are generally larger than those for q errors, which is consistent with the findings for 20 variance error correlation length-scales for these quantities shown in Fig. 13 (panels e and f) and in Fig. 14 (panels c and d), given the relationship between the statistics of forecast and variance errors in (4).
The third and fourth rows are maps of B xx and B xx respectively. Note that there is no guarantee that B (N <93) x x = 1 since the normalising variances are here always computed from the large ensemble -see (9). These maps show that the large-scale structure of the second row is maintained, but the progressively larger magnitude of sampling error as N is decreased -as in 25 (10) -introduces small-scale noise and more irregular demarcation lines.

Estimation of forecast error correlation length-scales
The final set of diagnostics to examine are the correlation length-scales, derived from the large, intermediate, and small ensembles. These length-scales are found in the following way: 8 It may be natural to use the notation C 2. Estimate the correlation by dividing by the field v (∞) x v (∞) x in the square region again using the approximation v x . This results in a field that is about unity at x , and mostly decays with distance from x . 5 3. Fit the correlation field to a two-dimensional exponential of the form: by varying the parameters x , y , a, b, θ, L 1 , and L 2 . Here (x, y) = x, a is the amplitude of the exponential function, b is its offset, θ is its orientation (from lines of constant latitude), and L 1 and L 2 are the length-scales along the principle 10 axes. An exponential form may not be the ideal choice describing the shape of a correlation function, but we assume that the good fit of exponential functions to the variance error correlations in Sect. 7.3 carries through to forecast error correlations.
Our interest is in the quantity max(L 1 , L 2 ). Maps of this quantity as a function of x are made for each ensemble size considered. Even though for the smaller ensembles this diagnostic would benefit from averaging over 1000 sub-samples, only one  The ultimate aim of the diagnostics when applied to a more substantial set of data is to help decide whether the number of ensemble members is 'enough' to neglect sampling error, or at least to help practitioners get a feel for how much sampling error is present. The authors have the application of high-resolution models and precipitation prediction systems in mind, although 10 most of the diagnostics are useful to a wider range of models.
For the purposes of this study it is assumed that the large ensemble gives reference diagnostics that are free of sampling errors, which presents an opportunity to estimate sampling errors of smaller ensemble sizes. Smaller ensemble sizes are studied by choosing members at random from the large ensemble, and the diagnostics then averaged over a set of such sub-samples.
Predominantly used were a (23+1)-member ensemble (the 'small' ensemble), and a (46+1)-member ensemble (the 'interme-15 diate' ensemble). Some diagnostics use only a single random sub-sample, as averaging over a large number of sub-samples would give diagnostics similar to those of the large ensemble. For other diagnostics, it was meaningful and practical to average over a larger number of sub-samples (1000). Only one case study is used in this paper, which is due to limited resources. We acknowledge that a single case is insufficient to make definitive conclusions regarding any system, but here it does serve to illustrate the methodology and the diagnostics developed. We now re-visit the aims as they are set out towards the end of Sect. 20

1.
How can linearly independent extra members be generated from an existing ensemble?
The large ensemble is developed from an existing small ensemble by creating additional negative versions of the existing perturbations in a set of nested models (23+1)→(46+1)→(92+1), and spinning-up the resulting model states (Sect. 3 and Fig. 2). The negative perturbations of course are not immediately linearly independent of the existing perturbations, but our 25 expectation that a degree of independence develops as the ensemble of non-linear model runs evolve is confirmed (Fig. 3).
How does the ensemble size impact the probabilistic forecasts of rainfall?
The purpose of studying how N changes the probabilistic forecast of rainfall is to gain a sense of how much information is added with the extra members. It is useful to do this for a range of precipitation rate thresholds to test an ensemble's ability to predict a sufficient range of precipitation outcomes that match the observations. In our case study the large, intermediate, and  small ensembles forecast non-zero probabilities of rain along rain bands 1 and 2 that were positioned inside the SUK domain ( Fig. 1), but a zero probability of heavy rain in band 2, even though in reality the rain is heavy there (Figs. 1 and 8). The spread of rainfall outcomes affects the probability diagnostics, and our test ensembles are found to be over-spread and biased (Fig. 7), and increasing the ensemble size does not improve the bias of the rainfall seen in the small ensemble. Adding extra members does increase the variance of the (already too high) rainfall rates as expected (Fig. 5), but this does not always appear in other 5 indicators of spread, such as the interquartile range (Fig. 6), which actually sometimes shows a decrease in the spread for the large ensemble. This latter finding is a non-Gaussian effect and serves to show that one should not rely on the variance alone as an indication of the spread.
The conclusions covering the above two questions suggest that adding extra members in this system does not have a positive impact on the diagnostics shown. On a positive note though, they do suggest that the method of generating the extra members 10 produces results that are consistent with the behaviour of the original 24 members.
How does the large ensemble resolve the kinetic energy spectrum?
The KE spectrum contains information concerning the nature of the turbulence in the model -i.e. whether the bulk behaviour is characteristic of a 2D system with a k −3 kinetic energy (KE) spectrum (as seen in mid-latitude large-scale flows), or of a 3D turbulent system with a k −5/3 KE spectrum (as might be expected at smaller scales). It is also able to provide information 15 about the effective resolution of the system, the scales below which the spectrum deviates from its bulk character, and where the estimated uncertainty grows considerably. By looking at estimated relative errors in the KE spectrum provides further information on how much sampling error contributes to degrading the effective resolution.
The large ensemble studied appears to show behaviour more characteristic of a 2D system rather than of a 3D turbulent system (Fig. 9). This may be due to the case study (which is not particularly convective), and to the small size of the domain, 20 not allowing enough opportunity for large-scale information from the boundaries to develop into full 3D motion. We estimate that the effective spatial resolution of the large ensemble is ∼ 8 km (Fig. 9). Interestingly, this agrees roughly with Pielke's definition of model resolution as four or more grid lengths (Pielke, 2001). The sampling error in the relative spread of the KE spectrum (KE spread error divided by ensemble mean KE) is computed for the small and intermediate ensembles. This shows that increasing the ensemble from 24 to 47 members is associated with an increase of effective resolution of 1-2 km and a 25 reduction of sampling error at all scales (Fig. 10).
How does ensemble size affect the estimates of (co-)variability of thermodynamic and moisture fields?
There are many diagnostics that are straightforward to produce and to interpret to help study the effect of sampling error in variance and covariance estimates. This includes the computation of sample variances, relative errors in variances, and spatial correlations as a function of N . These simple diagnostics are developed further. This includes the diagnosis of how 30 aspects like correlation length-scales, and their uncertainties, change with N (including the domain-averaged and positiondependent length-scales). It also includes studying the sampling errors in a spectral fashion to help understand how increasing the ensemble size affects errors at difference scales. The shape of the spatial correlation functions can also be studied as a function of N . Once the shape is determined, its fitting parameters (amplitude, length-scale, etc.) can be studied as a function of ensemble size to look for possible convergence to their 'true' values. Although these diagnostics are applied to thermodynamic and moisture fields they are not restricted to these kinds of fields.
For our case study there is a clear overall increase in the variances and their smoothness, and a reduction in the relative variance sampling error as N is increased (Figs. 11 and 12). Examining relative variance error, rather than just variance error, 5 is useful because it highlights sampling errors even when the actual variances are small.
Studying the power spectra of T and q relative variance errors showed that adding the extra ensemble members affected these quantities in similar ways. All scales (resolved and unresolved) are improved when going from the small to the intermediate converged over the ranges of N studied (Fig. 14). Even though these (domain averaged) length-scales do not change much with ensemble size over the ranges studied, the position-dependent length-scales do (Fig. 16). The correlation patterns in T and q do show reduced noise as the number of members is increased, which confirms a very well known result (Fig. 15).

15
Is the number of ensemble members in an ensemble enough to neglect sampling error?
It is difficult to say in this study whether it is reasonable to assume that sampling error could be neglected in the large ensemble.
One particular diagnostic developed is Eq. (6), which should tend to unity as N increases provided certain conditions are met, including that the variance of the large ensemble has no sampling error. Applying this diagnostic to the data suggests that this is not the case (Figs. 13e and f), pointing towards the need to study this further with larger ensembles. That said there are visible 20 reductions in sampling error in many of the fields.

Final comments
Although it is difficult to say for sure whether 93 members are enough to bring sampling errors to sufficiently small values, the fact that there are improvements means that the extra members do contain genuinely independent and useful information.
The reduction in sampling error would benefit data assimilation applications, but the lack of sensitivity on other aspects, like 25 rainfall probabilities, and biases suggests that the quality of probabilistic forecasts would not be improved. It is unwise to draw general conclusions though using only our particular (single case) test data.
There is still a need to study the statistics of ever larger ensembles (in convective-and large-scale EPSs). Various centres around the world have pushed the number of ensemble members to very large values (e.g. operationally N =256 at Environment Canada (Buehner et al., 2015;Caron et al., 2015), and non-operationally N = 256 (Yashiro et al., 2016) and even 30 N = 10240 (Kondo and Miyoshi, 2016) at RIKEN in Japan). The models used in these systems do not approach convectivescales though, so there is still much progress to be made to allow large ensembles to be used routinely with convective-scale models. It is hoped that the new diagnostics presented in this paper will be adopted by others in future studies. A selection of the software used in this paper is open-source and freely available on a Git Hub repository DOI 10.5281/zenodo.1013435 Author contributions. ACR and LHB adapted and ran the Met Office system to generate and integrate the extra members, and helped decide on some of the diagnostics used in the paper while they worked on the DIAMET project. SM and RNB developed and adapted the suite of 5 diagnostics used in this study. SM provided an outline of the paper and RNB finalised the diagnostics and completed the manuscript.
Competing interests. The authors declare that they have no conflict of interest. Disclaimer.
to thank the Met Office/NERC for use of the Met Office/MONSooN supercomputing facilities. This work was funded by DIAMET (NERC grant NE/I005234/1) and by the National Centre for Earth Observation.