A Minimal Model to Diagnose the Contribution of the Stratosphere to Tropospheric Forecast Skill

Many recent studies have confirmed that variability in the stratosphere is a significant source of surface sub‐seasonal prediction skill during Northern Hemisphere winter. It may be beneficial, therefore, to think about times in which there might be windows‐of‐opportunity for skillful sub‐seasonal predictions based on the initial or predicted state of the stratosphere. In this study, we propose a simple, minimal model that can be used to understand the impact of the stratosphere on tropospheric predictability. Our model purposefully excludes state dependent predictability in either the stratosphere or troposphere or in the coupling between the two. Model parameters are set up to broadly represent current sub‐seasonal prediction systems by comparison with four dynamical models from the Sub‐Seasonal to Seasonal Prediction Project database. The model can reproduce the increases in correlation skill in sub‐sets of forecasts for weak and strong lower stratospheric polar vortex states over neutral states despite the lack of dependence of coupling or predictability on the stratospheric state. We demonstrate why different forecast skill diagnostics can give a very different impression of the relative skill in the three sub‐sets. Forecasts with large stratospheric signals and low amounts of noise are demonstrated to also be windows‐of‐opportunity for skillful tropospheric forecasts, but we show that these windows can be obscured by the presence of unrelated tropospheric signals.


Introduction and Motivation
On sub-seasonal and seasonal timescales, coupling between the stratosphere and troposphere is a significant part of the available tropospheric forecast skill (e.g., Domeisen, Butler, et al., 2020;Scaife et al., 2016). The contrasting impact of stratospheric variability during recent winter seasons with similar stratospheric disturbances (Knight et al., 2021) has, however, brought into sharp relief the lack of a full quantitative understanding of the stratospheric contribution to tropospheric prediction skill. The Sudden Stratospheric Warming (SSW) which occurred in February 2018 has been clearly linked to enhanced sub-seasonal predictive skill of cold conditions in Europe during late-winter and spring (Karpechko et al., 2018;Kautz et al., 2020). In contrast, the SSW which occurred in January 2019 is not thought to have had a major surface impact or have contributed to enhanced skill . Early examination of the January 2021 SSW (Lee, 2021) suggest this event was strongly coupled to the surface Northern Annular Mode (NAM). Some authors have also proposed links to weather impacts in both Texas and Greece (Wright et al., 2021). The exceptionally strong and predictable polar vortex during the 2019/20 season also seems to have enhanced tropospheric seasonal forecast skill during late winter Abstract Many recent studies have confirmed that variability in the stratosphere is a significant source of surface sub-seasonal prediction skill during Northern Hemisphere winter. It may be beneficial, therefore, to think about times in which there might be windows-of-opportunity for skillfull sub-seasonal predictions based on the initial or predicted state of the stratosphere. In this study, we propose a simple, minimal model that can be used to understand the impact of the stratosphere on tropospheric predictability. Our model purposefully excludes state dependent predictability in either the stratosphere or troposphere or in the coupling between the two. Model parameters are set up to broadly represent current sub-seasonal prediction systems by comparison with four dynamical models from the Sub-Seasonal to Seasonal Prediction Project database. The model can reproduce the increases in correlation skill in sub-sets of forecasts for weak and strong lower stratospheric polar vortex states over neutral states despite the lack of dependence of coupling or predictability on the stratospheric state. We demonstrate why different forecast skill diagnostics can give a very different impression of the relative skill in the three sub-sets. Forecasts with large stratospheric signals and low amounts of noise are demonstrated to also be windows-of-opportunity for skillfull tropospheric forecasts, but we show that these windows can be obscured by the presence of unrelated tropospheric signals.
2 of 13 and spring (Lee et al., 2020). Explanations for these differences often focus on several different aspects of the stratosphere-troposphere coupling. On the one hand, there is evidence that not all SSWs produce the necessary lower stratospheric signals associated with strong coupling to the surface (Karpechko et al., 2017). Related work notes a difference in the tropospheric response to the morphology of SSW, vortex displacement or vortex split, with enhanced coupling following splitting events . Other recent work suggests tropospheric drivers of sub-seasonal skill, unconnected to the stratosphere, significantly influence the sign and predictability of the tropospheric response (e.g., Afargan-Gerstman & Domeisen, 2020;Knight et al., 2021). It has also been proposed that coupling between the stratosphere and troposphere is strongly dependent on the tropospheric state at the time of the stratospheric perturbation Domeisen, Grams, & Papritz, 2020;Maycock et al., 2020). These ideas relate, more generally, to the concept of intermittent "windows-of-opportunity" for sub-seasonal prediction (Mariotti et al., 2020). In order to critically assess and quantify the importance of these ideas, it would be helpful to have a simple "toy" or "null" model of how predictability in the stratosphere and troposphere is connected. By this, we mean a simple stochastic model that captures the relationships between the predictable signal in the stratosphere and troposphere with as few parameters as possible. A toy model with a minimal set of parameters is important, because it provides a quantitative lens through which to examine explanations of why some stratospheric perturbations appear to have a larger impact on tropospheric predictability than others. Put another way, are the differences in the impact on tropospheric skill of recent SSW and strong vortex events just an effect of random variability in stratosphere-troposphere coupling or do they reflect more complex dynamics? To the best of our knowledge, no such null model exists. When considering tropospheric predictability more generally, the simple models of, for example, Weigel et al. (2008) and Siegert et al. (2016) have been widely used for similar purposes. We choose to call this model a "minimal" model since it contains what we believe is the simplest description of the coupled distribution of forecasts and observations in the stratosphere and troposphere.
Having proposed a minimal model of the stratospheric contribution to tropospheric predictability, in the second half of the paper we perform a number of simple thought experiments to show the extent to which many of the characteristic properties of stratosphere-troposphere coupling can be reproduced by this simple model. Given the limited size of hindcast datasets that can be exploited to understand the properties of stratosphere-troposphere coupling, we also hope that this minimal model can be used to test and refine diagnostic tools for examining coupling in operational prediction systems.

Model Design
We start from the signal-noise model for ensemble forecasts developed by Charlton-Perez et al. (2019) from Siegert et al. (2016): In this model, ( ) is the observed time series of the parameter of interest for forecasts made at different times, .
( ) are the matching ensemble forecasts; ( ), ( ), 1( ), … , ( ) are independent standard normal random variables that are also independent over time (i.e., for different ); and are the climatological means. Key to the model is the shared "signal" term that is identical in the forecasts and observations, ( ) . The noise terms, ( ), 1, … , ( ) are uncorrelated with ( ) and with each other. The two parameters and scale the signal term, allowing for under or over confidence in the forecasts. The and terms similarly scale the noise components. This model can be further simplified by considering the case where the forecast and observations are normalised climate indices (e.g., the North Atlantic Oscillation index, NAO) with mean of zero and variance of one. In this case and are zero and can be removed. If the variance of ( ) and ( ) are known to be one, then the amplitude of the signal and noise terms are related (since the variance of the sum of uncorrelated variables is the sum of the variances of each variable): To extend this model to consider coupling between the stratosphere and troposphere we adopt a series of principles and assumptions: 3 of 13 1. The stratospheric observations and forecasts should have identical structure to the model above.
2. The predictable signal in the troposphere should have two uncorrelated components: a predictable signal linked linearly to the stratospheric state and one which captures skill from purely tropospheric processes, such as the Madden-Julian Oscillation (MJO) or from other Earth System processes like sea-ice, sea-surface temperature or soil moisture. 3. Coupling between the stratosphere and troposphere in the model forecasts is linked to the full stratospheric state in each ensemble member, not solely the predictable component. 4. The tropospheric and stratospheric variables should represent normalised climate indices. We consider these to be the NAM index in the lower stratosphere (for example 100 hPa) where forecast skill is high (Son et al., 2020) and the NAO index, but any normalised climate index would be equally appropriate.
By adopting these design choices, the model does not seek to explicitly consider upward coupling that is, the role of the tropospheric state in determining the predictable signal in the stratosphere. Anomalous stratospheric states have been shown to be a strong function of the integrated lower stratospheric meridional heatflux (Hinssen & Ambaum, 2010;Polvani & Waugh, 2004). The model does not seek to capture this process. Since on sub-seasonal and seasonal timescales, the lower stratosphere is the most predictable part of the extra-tropical atmosphere and predicting the tropospheric state is the ultimate goal of any forecasting system it is natural to design the model in this way. Quoting a reviewer of the paper, in a sentiment we agree with, this means we "do not seek to produce a model which can be judged "wrong" due to a lack of realism rather one which is relevant and useful." Other authors may wish to produce a model that includes other interesting effects such as upward coupling, interactions between the mean state and wave part of the flow or state dependence of stratosphere-troposphere coupling. Our choice here is to sacrifice model complexity in order to make the model easy to understand and conduct experiments with.
The proposed model is then as follows: where an added subscript S means stratosphere and T means troposphere. The predictable signal in the troposphere, which is common to the forecasts and observations but uncorrelated with the stratospheric signal ( ) is denoted ( ) . In the real world, some part of the tropospheric and stratospheric signal may be correlated with a common forcing, for example, from the tropics (Barnes et al., 2019), and so this is an additional way in which the model is minimal rather than realistic. and are the amplitude of ( ) in the observations and model respectively. ( ), 1( ), … , ( ) ( ) are the noise terms in the troposphere. They are uncorrelated with each other and with the noise terms in the stratosphere. and are the amplitude of the tropospheric noise terms. By substitution, the expressions for the tropospheric observations and forecast can then be expanded as: the variance, covariance and correlation structure of the model, along with expression for the signal-to-noise ratio, are shown in the Supporting Information S1. Since the forecasts and observations are normalised, with variance of one, the correlation and covariance between individual ensemble members and the observations are equal.

Model Parameters
To complete the minimal model, we need to set values of the parameters in Equations 5-8. As in Siegert et al. (2016), a Bayesian approach could be used to fit the model to a set of model hindcasts to determine these parameters. In this study, since the aim is to describe and analyze the simple model we do not take this approach but it is a clear extension. Instead, as a simple starting point, we can examine the correlation structure of four sets of example hindcasts available from the sub-seasonal to seasonal project database (S2S, Vitart et al., 2017). As a representative stratospheric index we chose the NAM at 100 hPa derived using the zonal mean principal component method of Baldwin and Thompson (2009). As a representative tropospheric index we chose the NAO, here defined as the mean sea-level pressure difference between a 2.5° × 2.5° grid box centered at 65N and 20W and one centered at 37.5N and 25W, that is, over Iceland and the Azores respectively. In both cases, these indices are chosen to be illustrative only and different choices would result in slightly different numerical values for the calculations. All hindcasts in the database initialized between November and February for the particular model version in question are considered. No attempt is made to standardize the period over which the forecasts are made. For the NCEP CFS model, a lagged ensemble is created by combining forecasts initialized over three consecutive days. The lead time dependent bias is removed, prior to analysis. For the week 3 forecast, Table 1 shows the correlation structure.
Broadly, the correlation structure in all four models is similar for these climate indices and this lead time: 1. The correlation between the stratosphere and troposphere in the observations, ( , ) , is approximately 0.45. 2. The correlation between the stratosphere and troposphere in the models, ( , ) is also approximately 0.45. 3. The correlation between the observed NAM and an individual ensemble member, ( , ) , is approximately 0.6. 4. The correlation between the observed NAO and an individual ensemble member, ( , ) , is approximately 0.25. Note. Here, Week 3 means the average of days 14-20 of the forecast, and correlations are calculated by first taking the mean value of the index over these days.

of 13
Using these four representative correlations, we can derive example parameters for the minimal model. For the remainder of the manuscript, all calculations and estimates assume these values, and no further reference to the four sets of sub-seasonal forecasts is made. We first assume that the amplitude of the signal in the observations and model is the same in the stratosphere ( = , a perfect model assumption which is relaxed later). We can use the representative correlation ( , ) = 0.6 above to calculate values for the stratospheric parameters from Equation 9 and Equations 3 and 4. In this case, = = 0.77 , = = 0.63 . This is equivalent to an identical signal-to-noise ratio in the model and observations of = = 1.22 (see the Supporting Information S1). Note that we have assumed positive values for and , although they could produce the same positive correlation if both values were negative. If we assume that the model has these parameter values, we can calculate a representative correlation between the ensemble mean forecast and the observations by assuming a value for the ensemble size, K (for details of the calculation of the ensemble mean correlation see the Supporting Information S1). For an example ensemble size of 51 members (which corresponds to the size of the operational ECMWF ensemble forecast), the correlation between the ensemble mean forecast and observations ( ( , ) ) is 0.77. Making a further assumption that the amplitude of the uncoupled part of the tropospheric signal is the same in the model and observations (i.e., that and are the same) and using Equation 12, we can derive the following parameters for the tropospheric part of the minimal model. The size of the tropospheric signal, = = 0.36 and the residual noise terms = = 0.82. These values give an overall signal-to-noise ratio in the troposphere, = = 0.58 and for an assumed ensemble size of 51 members the correlation between the ensemble mean forecast and observations ( ( , ) ) is 0.49. In the next sections, these model parameters are further perturbed to explore the importance of the stratosphere-troposphere coupling for sub-seasonal predictability. Later in the paper, the minimal model with the same parameters is used to generate synthetic forecast sets with 1 million forecast initialisations.

Thought Experiment 1: Impact of Stratospheric Skill Improvement and Increased Ensemble Size
A common question posed about the importance of the stratosphere for sub-seasonal prediction is understanding the trade off between spending additional computational resources to improve stratospheric skill (for example by increasing the complexity of the gravity wave drag parameterization or enhancing model vertical resolution) versus using the same resources to increase the ensemble size. While it is difficult to quantify the impact of any given model improvement on forecast skill, the minimal model does give some insight into this question. Assuming that improving skill in the model stratosphere does not affect tropospheric predictability from other sources (i.e., the terms remain constant) or the coupling between the stratosphere and troposphere (i.e., the terms remain constant), the impact of increased stratospheric skill ( ) is proportional to ( , ) multiplied by = 0.45 2 = 0.20 for parameters representative of the four sub-seasonal forecast models above. So (from Equations 9 and 12), The model can be used to anticipate the impact of this increased skill on the ensemble mean correlation skill as shown in the Supporting Information S1. Assuming all other parameters are fixed, for an ensemble size of 51 members, modifying ( , ) between 0 (no skill) and 1 (perfect skill) results in ( , ) increasing from 0.34 to 0.56. Further calculations of the impact of changes in stratospheric skill are shown in Figure 1a, with the case with 51 ensemble members shown in the solid line. The close to linear increase of ( , ) for the range of values around ( , ) = 0.6 typical of most S2S modeling systems does not depend strongly on the size of the model ensemble (see for example the dashed and dotted lines showing the cases with 101 and 11 ensemble members respectively). The contrasting impact of increasing ensemble size on ( , ) is shown in Figure 1b. The impact of the increasing ensemble size is relatively large as the ensemble size increases between 11 and 50 members but begins to saturate for much larger sizes. This effect is also not strongly dependent on the skill of the stratospheric forecast, with a similar dependence for ( , ) when ( , ) is equal to 0.4 (dotted line) or 0.8 (dashed line). Note that the quantification here is only relevant for forecasting systems with similar signal-tonoise properties as the minimal model in this configuration. Further experiments could use the same framework to explore how to target model investment in under or over confident forecasting systems, which we will consider next.

Thought Experiment 2: Differences in Signal-To-Noise Ratio Between Forecasts and Observations
Given that most forecasting systems are not explicitly designed to perturb the stratosphere (or target error growth there) then one could imagine a case whereby the stratospheric forecast is over confident. Equally there has been much discussion in the literature of under confidence of forecast systems on seasonal and longer timescales, particularly in the North Atlantic and for the NAO (Eade et al., 2014). Using the minimal model, there are a number of different ways to explore the relationships between correlation and signal-to-noise ratio in this simple system, diverging from the perfect ensemble approach in Thought Experiment 1. One approach would be to fix the value of the signal-to-noise ratio in the observations and perturb the signal-to-noise ratio of the forecast ensemble members (not shown). Another approach, is to fix the value of the correlation between the forecasts and observations in the stratosphere and examine the different values of signal-to-noise ratio and tropospheric correlation which are consistent with these values. This set of experiments is reported here, since we assume that the correlation between forecasts and observations in the stratosphere ( ( , ) ) is mostly robustly known (note that in the four systems above, there is little variation in this parameter). By assuming the correlation is known, this means that different cases can only be generated by varying the SNR in both the observations and forecasts, so in the extreme cases the SNR in the observations is either much greater or much smaller than is assumed in the base case in thought experiment 1.
In the base case above, we assume that the size of the signal in the model stratosphere is equal to that of the observations, that is, = = = √ ( , ) .
We can relax this assumption, without changing ( , ) by introducing a parameter, c, which represents the fractional change in the size of for the base case above that is, = and = 1 .
The impact of model under or overconfidence on the skill of tropospheric forecasts, for an ensemble size of 51 members, is shown in Figure 2. For clarity, when we refer to model under confidence, this means that the signalto-noise ratio of the model is smaller than the signal-noise ratio of the observations and vice versa for model over confidence. As might be expected, the tropospheric signal-to-noise ratio somewhat follows the signal-to-noise ratio in the stratosphere. Where the model is under confident in the stratosphere (blue part of the lines) it is also under confident (although to a lesser degree) in the troposphere. The relative size of the different terms mean that for the range in the center of the plot, the impact in the troposphere is modest. Figure 2c demonstrates the impact of stratospheric under or over confidence on the correlation skill of the tropospheric ensemble mean. Compared to the case in which the amplitude of the stratospheric signal-to-noise ratio is correct in the model (black dot), ( , ) is reduced for an ensemble with an over confident stratosphere and increased for an under confident stratosphere.Due to the fact that: ( , ) = + √ 1 − −1 2 + 2 2 , and since in this experiment, all parameters apart from and are fixed and the product is fixed, this dependence is related to the value of . In the over confident case, is small, increasing the denominator and therefore decreasing the correlation in the expression above. In the limiting case, where stratospheric forecasts have no noise (but there is still an unpredictable, noise component in the observations), = 0 and ( , ) = ( , ) √ 1− −1 ( 2 ) = 0.41 .

Thought Experiment 3: Perfect Stratospheric Forecasts
An increasingly common method used to interrogate the stratosphere-troposphere coupling in models is to add artificial physics to the model to nudge the state in the model stratosphere towards the observed state (e.g., Douville, 2009;Hitchcock & Simpson, 2014). The minimal model can be used to think about what these experiments might reveal. Although the minimal model is, of course, much simpler than the real world it can be used to think about limits to the gain in correlation skill that nudging experiments could yield.
This thought experiment starts from a hypothetical case of a perfectly predictable stratosphere in which there is no noise. In this case, = and so ( , ) = 1 . The only way this can be achieved in the minimal model is if , = 1 and = 0 . This then implies that ( , ) = + .
It follows that for the same set of parameters used in Thought Experiment 1, ( , ) = 0.33 and ( , ) = 0.56 with 51 ensemble members, for a perfect, deterministic stratospheric forecast. In practice, stratospheric nudging could never achieve this perfect forecast state, partly because the strength of nudging required would lead to 8 of 13 numerical instability, but it predicts an upper limit for tropospheric correlation that might be achieved in a series of nudging experiments. This value should be compared to the tropospheric ensemble mean correlation derived from the model with standard parameters, for an identical ensemble size of 51 ensemble members this value is 0.49 as stated in Section 3. In other words, the minimal model predicts a maximum increase of 0.07 in correlation skill for the week 3 sub-seasonal forecast for a set of forecasts with strong stratospheric nudging, compared to free running control experiments.

Synthetic Experiment 1: Skill for Weak, Neutral and Strong Vortex Cases in the Stratosphere
A common method to explore the impact of the stratosphere on prediction skill is to separate forecasts into categories where the initial stratospheric state has a weak, strong or neutral vortex (as in Domeisen, Butler, et al., 2020;Sigmond et al., 2013;Son et al., 2020;Tripathi et al., 2015). One approach to quantify the differences in skill between the forecasts is to use the correlation skill score: In this and subsequent expressions, M indicates the number of forecast initialisations in each subset. An alternative is to measure the correlation between the observations and ensemble mean forecasts within each sub-selected ensemble, which we call here the sub-set correlation (SSC) Here, [ ], [ ] are, respectively, the mean of the observations and ensemble-mean forecasts within each sub-set. A further measure used to quantify the differences in skill of the different sub-sets of forecasts (Domeisen, Butler, et al., 2020) is the Root Mean Square Error.
To simulate these calculations, and explore their relationship with the predictable signal, we can sub-set synthetic forecasts by the observed stratospheric state ( ). In the studies referenced above, sub-setting is normally performed on the observed state at the start of the forecast. Since the minimal model does not simulate the time development of the forecast, this experiment assumes that the state during week 3 is well correlated with the initial state, which is a reasonable assumption given the long autocorrelation timescale in the lower stratosphere.
As an illustrative example, we define weak and strong cases as being below the 20th percentile or above the 80th percentile of the index, and generate one million synthetic forecasts using simple random draws for the signal and noise terms in Equations 5-8.
The minimal model reproduces the behavior seen in real forecast ensembles. The CSS for the weak and strong sub-sets is significantly larger than the neutral sub-set in both the stratosphere and troposphere. Note also that the CSS and SSC are equal in the neutral sub-set but that the CSS is substantially higher than the SSC in the weak and strong subsets. This difference doesn't reflect greater correlation within each sub-set. Rather it represents a shift of the PDF of the signal term. Since the signal term is common to the observations and forecasts, this results in a larger CSS for the weak and strong cases. Put another way, the larger CSS in the weak and strong sub-sets reflects their larger signal-to-noise ratio. Even more simply put, forecasts made during weak and strong stratospheric states can have greater CSS than forecasts made during neutral states even when the correlation between the stratosphere and troposphere is identical for all the sub-sets as is explicitly the case in the minimal model.
These results shed new light on the results of Sigmond et al. (2013), who compared the forecast skill of a group of forecasts initialized during SSW events and a control case with no SSWs. The CSS reported in that study for 9 of 13 the NAM at 1,000 hPa is 0.55 for the SSW case and −0.01 for the control case, a difference that is statistically significant (p 0.01). We reanalyzed the Sigmond et al. (2013) results and found that the SSC is 0.23 for the SSW case and −0.02 for the control case, a difference that is not statistically significant. This implies that the difference in CSS reported in that study is mainly due to a shift of the PDF of the signal term, and is not the result of a (significantly) greater correlation within the SSW sub-set compared to the control sub-set.
In contrast, the RMSE in the weak and strong sub-sets is much larger than the neutral sub-set for the stratospheric forecasts. The difference in RMSE between weak/strong and neutral sub-sets is replicated in the troposphere, although the relative size of the difference between the sub-sets is smaller. Since the signal term is common to the observations and forecasts, the RMSE for the stratosphere depends only on the properties of the noise terms, ( ) and ∑ =1 ( ) . Sub-setting the forecasts by the magnitude of means that the distribution of ( ) is biased towards negative or positive values in the weak or strong sub-sets (see Supporting Information S1). There is no corresponding bias in ∑ =1 ( ) . As the RMSE is a property of the distribution of − and the distribution is dominated by the distribution of ( ) this leads to the difference in RMSE demonstrated in Figure 3.
Another approach to assess skill in different states is to make the sub-sets on the basis of the ensemble mean forecast rather than the observed state ( Figure 4). Sub-setting on this basis, produces a very similar result to the one in Figure 3 for CSS and SSC, but with a clean separation of the three states by the size of the signal term through the effective elimination of the noise term when taking the ensemble average. Since there is no bias in ( ) in the three sub-sets introduced by this method, the RMSE is identical for the three sub-sets and very close to .

of 13
When analyzing real hindcast data, there are arguments for using either of these methods to compare skill in different sub-sets. When comparing forecasts sub-set based on the initial state in the stratosphere, this analysis shows that caution is needed when interpreting simple measures of forecast skill, something well-known in the forecast verification community but perhaps not so widely appreciated when considering atmospheric processes, such as stratosphere-troposphere coupling. The arguments presented in this section are not unique to coupling between the stratosphere and troposphere. Our model could also be applied to other cases in which a predictable component is weakly coupled to the extra-tropical troposphere. Examples might include coupling of the MJO and El Niño Southern Oscillation to the North Atlantic.

Synthetic Experiment 2: Windows of Opportunity
There has been a lot of recent interest in understanding when there might be "windows of opportunity" (Mariotti et al., 2020) for sub-seasonal forecasts that are more skillfull than on average based on the presence of a particular dynamical forcing. The minimal model can be used to explore to what extent the stratospheric signal can provide windows of opportunity when this stratosphere-troposphere coupling is linear and independent of the tropospheric state.
Since there is no widely agreed diagnostic of a window of opportunity, here a simple diagnostic from Ziehmann (2001) is used. First observations and forecasts are assigned to categorical bins (in this case based on terciles of the observed tropospheric state). The forecast "state" for each forecast is the modal category (i.e., the one with most forecasts, with random assignment for rare bimodal cases). A forecast is counted as successful when the forecast and observed states are the same, with skill measured simply as the fraction of forecasts for which this is true. Windows of opportunity can be explored by sub-setting the forecasts based on the occupation frequency of the modal category. A more confident forecast is one in which the number of forecasts in the modal category is large, a less confident forecast where the number of forecasts in the modal category is small. Figure 5a shows the result of this calculation. The dashed black line shows the average fraction of successful forecasts. The green solid line shows the success rate of forecasts with high confidence, as a function of the percentile of the number of members in the modal category used to define the "high confidence" sub-set. Forecasts with high confidence are much more likely to be successful than an average forecast. Conversely, forecasts with low confidence, shown in the brown line, are much less likely to be successful. In other words, the forecast spread, as quantified here by the number of members in the modal category, is a good predictor of forecast windows of opportunity. Figure 5. (a) Shows the fraction of tropospheric forecasts which are correct for two categories of forecast, those classified as highly (green) opoorly (brown) predictable. Results are presented as a function of the percentile used to define each class. For example, the value plotted at 20% on the green line is for all forecasts with modal occupation frequency greater than the highest 20% of forecasts and the value plotted at 20% on the brown line is for all forecasts with modal occupation frequency smaller than the lowest 20% of forecasts. The dashed lines show the same calculation, but with forecasts sub-set based on the modal category of the stratospheric forecast. (b) Shows 2D histograms of and for the two sub-sets based on the tropospheric forecasts for the 20% case (dots in the left panel). Dashed black lines show the values of the observed state that define the weak, neutral and strong forecast categories. (c) Shows histograms of the part of the tropospheric signal due to the stratosphere (on the x-axis) and intrinsic to the troposphere (on the y-axis).

of 13
How much are the windows of opportunity due to the signal present in the stratosphere? One way to quantify this effect is to repeat the calculation but sub-set the forecasts based on the size of the modal category in the stratospheric forecast. This is shown in the dashed lines in Figure 5a. While not as good a predictor of skill as the size of the modal category of the tropospheric forecast, this diagnostic can also be used to identify windows of opportunity. The set of forecasts in the high and low confidence sub-sets for the case shown by the dots in Figure 5a are shown in Figure 5b. Forecasts with high confidence (green) are those in which the ensemble mean is generally large and positive or negative. Since the signal and noise terms are uncorrelated by construction, when the signal term is large, the likelihood of more members of the ensemble being in the same category as the ensemble mean forecast and the observations is increased. Confident forecasts generally have a large signal resulting from the stratosphere with a large tropospheric signal of the same sign, as shown in Figure 5c. Forecasts with low confidence include both those with little signal from either process, and cases where there is an opposing signal from the stratosphere and troposphere (brown points). Forecasts in which there is a large stratospheric signal are therefore windows of opportunity for skillfull tropospheric sub-seasonal predictions. Sometimes, as seen for example, in the contrasting forecasts of the 2018 and 2019 SSW events, opposing stratospheric and tropospheric signals might mask this predictability.

Conclusions
In this study, we have attempted to define and investigate a minimal model which describes how skillfull forecasts in the stratosphere contribute to forecast skill in the troposphere. The model is developed from the earlier toy model of Siegert et al. (2016). The key addition to the model to allow the link between the stratosphere and troposphere to be examined is a term coupling the observed and forecast indices in the troposphere with those in the stratosphere. This coupling is independent of the state in the stratosphere or troposphere (i.e., it doesn't depend on the value of or ), and should be thought of as the simplest possible representation of stratosphere-troposphere coupling. There is no reason to think that this model rules out the need for more complex explanations of the contribution of the stratosphere to tropospheric forecast skill, but it should be regarded as a minimum standard that more complex explanations should be judged against. The simplicity of the model, and its similarity to other simple models of forecasts and observations mean that the conclusions about conditional skill and windows of opportunity that we highlight below are likely to be familiar to a reader who is well versed in thinking about ensemble forecasts and skill scores. Indeed, many of the conclusions below could be derived from a model like that of Weigel et al. (2008) or Siegert et al. (2016) in which only the relationship between a set of forecasts and observations are considered. A key message of this paper is that this body of understanding and literature is also relevant to the case of stratosphere-troposphere and other types of coupling if that coupling is independent of the underlying state of the system. For the broader community involved in analyzing and understanding weak coupling between a relatively well predictable component of the Earth system like the stratosphere and a more weakly predictable component like the extra-tropical troposphere we hope that the model provides a useful interpretive framework with which to consider sets of real sub-seasonal forecasts.
The model reproduces a number of features of the observed properties of real sub-seasonal and seasonal prediction systems, particularly when considering sub-sets of forecasts during weak, neutral and strong lower stratospheric NAM states. The increased CSS for these sub-sets reflects the greater signal-to-noise ratio in these cases. By construction, and as demonstrated by the calculations of SSC, correlation between forecast and observed states is identical within each sub-set. The analysis of the minimal model also demonstrates that care should be taken when constructing sub-sets of forecasts. Choosing to sub-set based on the observed stratospheric state can lead to biases in the RMSE because this method inherently chooses cases with larger than average noise. An alternative approach for analysis of sub-seasonal forecasts could be to sub-set based on the ensemble mean forecast in the stratosphere since this better isolates cases with a large predictable signal.
In a similar vein, windows-of-opportunity for skillfull tropospheric forecasts can be identified by considering the spread of forecasts in the stratosphere. Results from the simple model suggest that focusing detailed dynamical analysis on stratospheric forecasts with high confidence could be a way to identify windows-of-opportunity for skillfull sub-seasonal and seasonal forecasts. Often, confidence in stratospheric forecasts is largest once the signal of vortex disturbances is present in the upper and middle stratosphere. For the parameter choices used in this study, the similar size of the tropospheric signal derived from coupling to the stratosphere and from other unre-12 of 13 lated tropospheric processes mean that there can often be confounding between the two signals. If this model is a good representation of real forecasting systems, this means that on the sub-seasonal timescale, the development of methods to disaggregate these signals could be an important forecast post-processing tool.
In the future, we aim to use Bayesian methods to fit this model to sub-seasonal and seasonal hindcast datasets in order to compare and contrast different prediction systems. The synthetic forecast datasets that can be easily generated by the minimal model could be used to test other ideas about the predictability associated with stratosphere-troposphere coupling. Obvious examples are tests of the windows-of-opportunity for tropospheric forecasts produced by models that have lower signal-to-noise ratio than the real world and comparison of a broader range of skill metrics to test their performance in detecting model skill derived from stratosphere-troposphere coupling.

Data Availability Statement
Sub-seasonal forecast data used in the study was obtained from the sub-seasonal to seasonal (S2S) archive which can be found at: https://apps.ecmwf.int/datasets/.