Tests of Sunspot Number Sequences: 2. Using Geomagnetic and Auroral Data

We compare four sunspot-number data sequences against geomagnetic and terrestrial auroral observations. The comparisons are made for the original SIDC composite of Wolf-Zurich-International sunspot number [$R_{ISNv1}$], the group sunspot number [$R_{G}$] by Hoyt and Schatten (Solar Phys., 1998), the new"backbone"group sunspot number [$R_{BB}$] by Svalgaard and Schatten (Solar Phys., 2016), and the"corrected"sunspot number [$R_{C}$] by Lockwood at al. (J.G.R., 2014). Each sunspot number is fitted with terrestrial observations, or parameters derived from terrestrial observations to be linearly proportional to sunspot number, over a 30-year calibration interval of 1982-2012. The fits are then used to compute test sequences, which extend further back in time and which are compared to $R_{ISNv1}$, $R_{G}$, $R_{BB}$, and $R_{C}$. To study the long-term trends, comparisons are made using averages over whole solar cycles (minimum-to-minimum). The test variations are generated in four ways: i) using the IDV(1d) and IDV geomagnetic indices (for 1845-2013) fitted over the calibration interval using the various sunspot numbers and the phase of the solar cycle; ii) from the open solar flux (OSF) generated for 1845 - 2013 from four pairings of geomagnetic indices by Lockwood et al. (Ann. Geophys., 2014) and analysed using the OSF continuity model of Solanki at al. (Nature, 2000) which employs a constant fractional OSF loss rate; iii) the same OSF data analysed using the OSF continuity model of Owens and Lockwood (J.G.R., 2012) in which the fractional loss rate varies with the tilt of the heliospheric current sheet and hence with the phase of the solar cycle; iv) the occurrence frequency of low-latitude aurora for 1780-1980 from the survey of Legrand and Simon (Ann. Geophys., 1987). For all cases, $R_{BB}$ exceeds the test terrestrial series by an amount that increases as one goes back in time.


Introduction
The article by Svalgaard and Schatten (2016) contains a new sunspot-group number composite. The method used to compile this data series involves combining data from available observers into segments that the authors call "backbones", which are then joined together by linear regressions.
We here call this the "backbone" sunspot-group number [R BB ] to distinguish it from other estimates of the sunspot-group number. What is different about the R BB composite is that instead of the recent grand maximum being the first since the Maunder Minimum (circa 1650  1710), as it is in other sunspot data series, it is the third; there being one maximum of approximately the same magnitude in each century since the Maunder minimum. In itself, this is not such a fundamental change as it arises only from early values of R BB being somewhat larger than for the previous sunspot number or sunspot group number records. However, the new series does suggest a flipping between two states rather than a more sustained rise from the Maunder minimum to the recent grand maximum, with implications for solar-dynamo theory and for reconstructed parameters, such total and UV solar irradiances. Note that several of these features of the R BB data composite are also displayed by the second version of the composite of Wolf/Zürich/International sunspot number [R ISNv2 ], recently generated by SIDC (the Solar Influences Data Centre of the Solar Physics Research department of the Royal Observatory of Belgium); however, unlike R BB , R ISNv2 does not extend back to the Maunder minimum.
The standard approach to calibrating historic sunspot data is "daisy-chaining", whereby the calibration is passed from one data series (be it a backbone or the data from an individual observer) to an adjacent one, usually using linear regression over a period of overlap between the two. Svalgaard and Schatten (2016) claim that daisy-chaining was not used in compiling R BB . However, avoiding daisy-chaining requires deployment of a method to calibrate early sunspot data, relative to modern data, without comparing both to data taken in the interim: because no such method is presented in the description of the compilation of R BB , it is evident that daisy-chaining was employed. Another new sunspot-group number data series has recently been published by Usoskin et al. (2016): these authors describe and employ a method that genuinely does avoid daisy-chaining because all data are calibrated by direct comparison with a single reference data set, independent of the calibration of any other data.
As discussed in Article 3 (Lockwood et al., 2016b), there are major concerns about the use of daisy chaining. Firstly, rigorous testing of all regressions used is essential and Lockwood et al. (2016b) show that the assumptions about linearity and proportionality of data series made by Svalgaard and Schatten (2016) when compiling R BB cause both random and systematic errors. The use of daisychaining means that these errors accumulate over the duration of the data series. Another problem has been addressed by Usoskin et al. (2016) and Willis, Wild and Warburton (2016), namely that the day-to-day variability of sunspot-group data make it vital only to compare data from two observers that were taken on the same day. Hence the use of annual means by Svalgaard and Schatten (2016) is another potential source of error.
Other sunspot data composites are also compiled using daisy-chaining, such as the original sunspotgroup number [R G ] generated by Hoyt, Schatten and Nesme-Ribes (1994) and Hoyt and Schatten (1998); versions 1 and 2 of the composite of the Wolf/Zürich/International sunspot number [R ISNv1 and R ISNv2 ], and the corrected R ISNv1 series [R C ], proposed by Lockwood, Owens and Barnard (2014). Some of these series also employ linear regressions of annual data. Hence these data series, like R BB , have not been compiled with the optimum and most rigorous procedures and so also require critical evaluation.
These problems give the potential for calibration drifts and systematic errors, which means that uncertainties (relative to modern values) necessarily increase in magnitude as one goes back in time. By comparing with early ionospheric data, Article 1 (Lockwood et al., 2016a) finds evidence that such calibration drift is present in R BB as late as Solar Cycle 17, raising concerns that there are even larger drifts at earlier times.
It is undesirable to calibrate sunspot data using other, correlated solar-terrestrial parameters because the regression may well vary due to a factor, or factors, that were not detected above the noise in the study that determined the regression. Such factors could introduce spurious long-term drift into the sunspot calibration. In addition, the independence of the two data series is lost in any such calibration, which takes away the validity of a variety of studies that assume (explicitly or implicitly) that the two datasets are independent. Article 1 (Lockwood et al., 2016a) discusses this point further and presents some examples. On the other hand, sunspots are useful primarily because they are proxy indicators of the correlated solar-terrestrial parameters and phenomena. Hence if the centennial-scale drift in any one sunspot number does not match that in a basket of solar-terrestrial activity indicators this would mean that either i) there is calibration drift in the sunspot-number data or ii) sunspot numbers are not a good metric of solar-terrestrial influence on centennial timescales.
From the above, we do not advocate using ionospheric, geomagnetic, auroral, and cosmogenic isotope data to calibrate sunspot data but note that a sequence is most successful, as a way of parameterising and predicting the terrestrial parameters, if it does reproduce their long-term drift. In this article we study the consistency of four sunspot-number sequences with geomagnetic and auroral data. The sunspot data sequence used here are: i) the original composite of Wolf/Zürich/International sunspot number generated by SIDC [R ISNv1 ]; ii) the group sunspot number [R G ] of Hoyt, Schatten and Nesme-Ribes (1994) and Hoyt and Schatten (1998); iii) the new "backbone" sunspot-group number [R BB ] proposed by Svalgaard and Schatten (2016); and iv) the "corrected" sunspot number [R C ] proposed by . Figure 1 shows annual means of these data: it also shows (in black) the variation of the new version of the composite of Wolf/Zürich/International sunspot number recently issued by SIDC [R ISNv2 ], which uses some, but not all, of the re-calibrations of the original data that were derived to generate R BB .
Note that, in order to aid comparison, R BB is here scaled by a constant factor of  BB = 12.6, which makes the mean values of  BB R BB and R ISNv1 (and hence by its definition R C ) the same over the calibration interval of 1982  2012 that is used here. The designated factor of 0.6 is used in the case of R ISNv2 . (mauve) the new "backbone" sunspot-group number [R BB ] proposed by Svalgaard and Schatten (2016), here multiplied by a normalising factor of  BB = 12.6 that makes the averages of  BB R BB and R ISNv1 (and hence R C ) the same over the calibration interval adopted here (1982  2014); (black) the new (version 2) SIDC composite of Wolf/Zürich/International sunspot number [R ISNv2 ], here multiplied by the designated 0.6 scaling factor. Background white and grey bands denote, respectively, even and odd sunspot cycles (minimum to minimum) which are numbered near the top of the plot. The light-cyan band marks the Maunder minimum.
There are two major concerns in relation to the different behaviour of R BB evident in Figure 1. The first is the stability of the calibration of each backbone over the interval it covers, and the second is the regression fits used to daisy-chain the backbones. Even for very highly correlated data segments, the best-fit regression can depend on the regression procedure used (see Article 1) and it is vital to ensure that the most appropriate procedure is employed (Ryan, 2008). Options include median least squares, Bayesian least squares, minimum distance estimation, non-linear fits as well as the ordinary least squares (OLS) that was used to generate R BB . Even the OLS fits can be carried out in different ways in that they can either minimise the sum of the squares of the verticals (appropriate when the x-parameter is fixed or of small uncertainty such that the dominant uncertainty is in the y-parameter) or they can minimise the sum of the squares of the perpendiculars (usually more appropriate when there are uncertainties of comparable magnitude in both x and y).
It is very important to test that fits are robust and the data do not violate the assumptions of OLS least-squares fitting procedure: Q  Q plots can be used to check the residuals are normally distributed, the Cook-D leverage parameter can test for data points that are having undue influence on the overall fit, and the fit residuals can be checked to ensure they are "homoscedastic" (i.e. that the dependent variable exhibits similar variance across the range of values for the other variable).
All these can invalidate a fit because the data are violating one or more the assumptions of the regression technique used (Lockwood et al., 2006). Any daisy-chaining used to generate a longterm sunspot number sequence is of particular concern because if the random fractional uncertainty of the i th intercalibration is  i , then the total fractional uncertainty will be ( n i=1  i 2 ) 1/2 , where n is the number of intercalibrations (provided the uncertainties  i , are uncorrelated). Even more significantly, systematic fractional errors at each intercalibration  i , will lead to a total systematic fractional error of  n i=1  i . Both will inevitably grow larger as one goes back in time. Hence considerable uncertainties and systemic deviations are both possible for the earliest data compared to the modern data for any sunspot number sequence compiled by daisy-chaining. The ability for these uncertainties to become amplified as one goes back in time makes it vital to check that the regressions are not influenced by an inappropriate fit procedure. None of the compilers of daisychained data series have investigated these potential effects, for example by using a variety of regression procedures, and instead implicitly trust the one procedure that they adopt. In the absence of tests against other procedures, comparison with other solar-terrestrial parameters becomes important as a check that the daisy-chained calibrations have not led to a false drift in the sunspot calibration.

Analysis
In this article, we compare the long-term drifts inherent in sunspot-data series with indices derived from terrestrial measurements that have been devised to vary in a manner that is as close to linear as Because we are interested in long-term drifts, we here average all data series over full solar cycles (from minimum for minimum), ensuring successive data points are fully independent. To normalise the data we then divide these means by the value for Cycle 19. This cycle is chosen because it is the largest in the series and because much of the interest in the new sunspot series [R BB and R ISNv2 ] is in the relative sizes of the peaks in the secular variation and, in particular the relationship of earlier peaks to Cycle 19. Figure 2 shows the results for R ISNv1 , R C , R G , R BB , and R ISNv2 . It can be seen that as we move to earlier times, from Cycle 19 back to Cycle 14, R ISNv1 decreases most rapidly whereas R G and R BB decrease the least rapidly. It is noticeable that R G and R BB are both group numbers and so the definitions may have something to do with the difference in behaviour. This interval (Cycles14  18) includes the Waldmeier discontinuity (see Articles 1 and 4 ), which influences both Wolf numbers and group numbers generated in Zürich, but not necessarily in the same way. It is an allowance for this discontinuity that gives the difference in behaviour between R ISNv1 and R C . Moving to yet earlier times, the difference between R BB or R ISNv2 and the other estimates (with the exception of R G ) remains roughly the same over Cycles 13, 12, and 11 but grows considerably over Cycle 10.

Figure 2.
Example of the comparison method adopted in this article. Solar-cycle averages (minimum-to-minimum) of the various sunspot sequences are shown. To facilitate comparison, each has been normalised to its value for Solar Cycle 19 (i.e. <R> Cn /<R> 19 is shown where <R> Cn is the mean of cycle number C n ) which has the advantage that scale factors (such as the 0.6 for R ISNv2 and  BB for R BB ) cancel out. Background white and grey bands denote, respectively, even-and odd-numbered sunspot cycles (minimum to minimum) which are numbered near the top of the plot.
In this article, we apply the same analysis as in Figure 2 to indices derived from terrestrial measurements that have been designed, or found, to vary monotonically, and as close as possible to linearly, with sunspot numbers. This enables us to compare like-with-like when we assess the longterm variations. We use the IDV (Svalgaard and Cliver, 2005) and IDV(1d) (Lockwood et al., 2013a; b; 2014a; b) geomagnetic indices. One application of these geomagnetic indices exploited here is an empirical, statistical property (one which varies with the phase of the solar cycle so allowance must be made for that factor) . More from geomagnetic activity are valid and valuable has been presented by Owens et al. (2016). These authors showed that both the statistical and theoretical relationships between the geomagneticactivity indices and sunspot numbers mean that the sunspot numbers and the geomagnetic-activity indices, including both IDV and IDV(1d), give reconstructions of the near-Earth interplanetary magnetic field that are almost identical. Lastly, we look at the annual occurrence of low latitude aurorae [N A ] compiled by Legrand and Simon (1987). In this case we have no quantitative theoretical relationship to exploit, although we do have a good qualitative understanding (Lockwood and Barnard, 2015), and simply compare the variations in the normalised averages of N A and sunspot numbers.

Tests Using the IDV(1d) and IDV Geomagnetic Indices
The IDV and IDV(1d) indices are both based on Bartels' u-index (Bartels, 1936)  were very similar indeed if IDV was used.  analysed the known correlations between the IDV and IDV(1d) geomagnetic indices and the square root of the sunspot number [R]. This arises from the approximate correlation between the near-Earth IMF, B, and R 1/2 , when averaged over the solar cycle that was noted by Wang, Lean, and Sheeley (2005). Because the IDV and IDV(1d) indices depend primarily on B (Svalgaard and Cliver, 2005;Lockwood, 2013;, correlations over a solar cycle with R n (with n  0.5) are also expected. This correlation is found in annual-mean data but  have shown that this relationship is more complex than it first appears because it depends on the phase of the solar cycle. A different manifestation of the same property was found by Owens et al. (2016) who showed that the best fit over the solar cycle is different from that on centennial timescales.  showed that the scatter in the relationship between IDV(1d) 1/n and sunspot number (they used R C ) is much larger than that in a plot of R IDV(1d) has been designed to vary linearly with sunspot number and so its long term variation can be compared to that in sunspot number. Lockwood, Owens, and Barnarrd (2014b) used all of the data in the IDV(1d) data series (since 1845) to derive the required coefficients s, c, and n and the function F(Φ). This is repeated in the present article but using only a 30-year "training" interval of 1982  2012. The coefficients obtained are then applied to the whole IDV(1d) data sequence to derive R IDV(1d) . The "training" is the evaluation of s, c, and n and F(Φ) and this is here done separately using the sunspot numbers R C , R ISNv2 , R G , and R BB (note R C and R ISNv1 are, by the definition of R C , identical over the training interval used). identical plots are obtained using R C , R G , and R ISNv2 . (Note that R ISNv1 is identical to R C over the training interval). Figure 3b shows the values of R IDV(1d) derived from IDV(1d) using Equation (1) with the above coefficients derived from Figure (1) with the coefficients n, s and c derived from the calibration interval using: R BB (gives R IDV(1d),BB shown by the mauve line); R G (gives R IDV(1d),G shown by the green line); R C (gives R IDV(1d),C shown by the blue line) and R ISNv2 (gives R IDV(1d),ISNv2 shown by the black line). The grey band shows the 2 band for the combination of these four records and allows for the uncertainties in the fitted n, s, and c in each case. Note that for most years the series agree to within less than a line width and so only the line plotted last (the mauve one) can be seen.
The above procedure for "training" the algorithm to generate R IDV(1d) using R BB over the interval 1982  2012 was repeated using R ISNv2 , R G , and R BB (not R ISNv1 because R C and R ISNv1 are the same over the training interval used). Figure 4 shows that the results are almost independent of the sunspot-number series used to train the algorithm and hence the variation depends almost exclusively on the IDV(1d) data and not on the training procedure. The difference between the various lines in Figure 4 is usually smaller than the plot line width and the most visible is the mauve line because it was plotted last (and so is on top of the others). The correlation coefficients [r] (and their significance levels [S] evaluated against the autoregressive AR1 red noise model) are given in the first column of Table 1. The correlation using R C is very slightly lower than the others and Figure 4 shows that R IDV(1d),C (the notation used means that this is sunspot number derived from IDV(1d) using R C during the training interval) tends to very slightly overestimate the values at each sunspot minimum. This is almost certainly due to the fact that R C has not been corrected for the effects of the recently-revealed calibration drift in the Locarno sunspot data (Clette et al., 2016): After 1981, the international sunspot number was compiled using data from the Specola Solare Ticinese Observatory in Locarno, which took over from Zürich in 1981 as the main reference station. The calibration drift is between 15% and +15% in modern-day R ISNv1 values and has been corrected for in R BB and R ISNv2 but not in R C and is not relevant to R G (which is based on different data and only extends to 1979). The best evidence for this drift is the large number of observing stations that show the same variation relative to the Locarno station, but we note that Article 1 (Lockwood et al., 2016a) provides independent support for this correction as the ionosonde data agree better with R BB than R C over the training interval. It can be seen from Table 1 that the correlations are all high and comparable, and we assign the four sets of fits equal weight.
The same procedure as was used by  was then employed to combine these four sets of results into an optimum R IDV(1d) reconstruction with two- uncertainties. Specifically, the p-value distributions were generated from each of the four estimates of R in any one year. From the correlation of the proxy used with R, we can evaluate the p-value distribution of the R-values derived from an annual mean of that proxy. This was repeated for all four sunspot number estimates (R BB , R C , R G , and R ISNv2 ) and, making the simplest assumption that the resulting four p-value distributions are independent, this allows them to be combined into a single distribution by multiplying them together. The optimum value is the peak of this combined distribution and the error limits are taken to be the 2 points. The 2 uncertainty values determined this way delineate the grey area shown in Figure 4. The black line in Figure 5a shows the cycle means of the optimum values of R IDV(1d) , normalised to the value for Cycle 19 as in Figure 2, as a function of the cycle number. The grey area is bounded by the same variations for the maximum R IDV(1d) (at the +1 level) and the minimum R IDV(1d) (at the 1 level). Also shown in Figure 5(a) are the corresponding variations for R ISNv1 , R C , R G , and R BB , as in Figure 2. A comparison of these variations will be described and discussed later.
We need to present a very important caveat about this test. It is based on a purely empirical relationship between IDV(1d), sunspot number and the phase of the solar cycle. The relationship appears to work well for the interval for which we have IDV(1d) data (1845  present) and over which we here apply the test. However, because it is a purely empirical relationship this does not mean it would necessarily work well for other intervals (and the Maunder minimum in particular).
The same is equally true for any application of the purely empirical relationships between the IMF B and R n and the IDV index and R n . Note that the test presented here was also carried out using the IDV index (not shown) and the results were the same on all important points.  Legrand and Simon (1987). The grey areas is the mark the variations for the optimum values 1 uncertainties. (See text for details).

Test Using OSF Derived from Geomagnetic Indices and a Continuity Model
The open solar flux (OSF) [F S ] is the total "signed" flux (i.e. we here define it as of one toward/away magnetic polarity) threading a nominal source surface in the solar corona. It was first reconstructed from historic geomagnetic activity data by Lockwood et al. (1999). OSF is a much more satisfactory parameter on which to base a comparison with sunspot numbers because it is, like sunspot number, a global property of the Sun rather than a local parameter specific to near-Earth space (such as the near-Earth IMF or near-Earth solar-wind speed or geomagnetic indices, including IDV(1d), which depend on these near-Earth interplanetary conditions). OSF has been reconstructed for 1845  2014 by  using four pairings of geomagnetic indices: aa C and IDV, aa C and IDV(1d), IHV and IDV,and IHV and IDV(1d), where aa C is a version of the aa-index that has been corrected using the Ap-index (and extended back to 1845 using comparable "range" data from the Helsinki observatory)  and IHV is the "Inter-Hour Variability" index introduced by Svalgaard et al. (2003) and developed by Svalgaard and Cliver (2007). The IDV and IDV(1d) indices were discussed in the last section. Note that recently Holappa and Mursula (2015) have suggested that errors in the geomagnetic data make the To relate OSF to sunspot numbers, in this section we use the model of Solanki et al. (2000), based on the continuity equation for OSF: where S is the global OSF source rate and L is its global loss rate. In the first application of this model by Solanki et al. (2000), the loss rate was assumed to be linear so that L = F S / where  is the loss time constant. From Equation (2) where S S is the OSF source term derived using the Solanki et al. (2000) formulation of L. S S can be estimated using Equation (3) for 1845  2014 using an OSF reconstruction from geomagnetic activity. We here use the most accurate and robust reconstruction which is by  where A f and A s are the areas of faculae and sunspots on the solar surface, the ratio of which is given by a polynomial in R (their Equation 3) which is incorporated into Equation (4)  The black line in figure 5b gives the optimum value of normalised cycle averages of R OSF1 and the grey area around it the 1 uncertainty band, in the same format and derived in the same way as for R IDV(1d) in the previous section.

Test Using OSF Derived from Geomagnetic Indices and a Second Continuity Model
The correlation between R BB and R OSF1,BB [R OSF1 derived from the training using R BB ] shown in Figure 6a is 0.907. The method employed means that a regression fit in Figure 6a is never used; however, it is not ideal that the scatter is not homoscedastic and larger at larger values. There is also a suggestion of some non-linearity in the dependence. Hence in this section we investigate a second version of the OSF continuity model by Owens and Lockwood (2012). This model is also based on the continuity Equation (2) where I HCS is the current-sheet tilt index, the fraction of longitudinally adjacent pixels of opposite field polarity on the coronal source surface (defined from magnetograph observations mapped up to the coronal source surface using the potential field source surface method). I HCS has regular variation with solar-cycle phase [] and f() is the best-fit function to I HCS : k 1 and k 2 are constants. Owens and Lockwood (2012) showed that this loss rate gave good fits to reconstructed OSF for a simple linear relationship between S OL and sunspot number but optimum fits were obtained by Lockwood and Owens (2014) using the more complex form given by their equation 8). This was originally based on the idea that much OSF emerges through the source surface as a result of CME eruptions and F was the estimated from the average OSF enhancement associated with each event.
However, Wang, Lean, and Sheeley (2015) have recently pointed out that the rapid rise in OSF in the second half of 2014 does not appear to have been accompanied by a corresponding rise in CME occurrence (although the possibility of a smaller number of CMEs each causing unusually large emergence cannot be discounted). The requirement here is to equate OSF emergence rate to sunspot numbers and the 2014 rise in OSF did indeed follow a rise in sunspot numbers. A readily invertible equation for S OL that gives higher correlations with observed annual OSF data (including 2014) is where F = 2.110 14 Wb. Inverting Equation (6) yields a sunspot-number estimate that we here call R OSF2 . This is then processed in exactly the same way as was R OSF1 in the previous section. Figure   6b shows the scatter plot of R OSF2 , derived using R BB for the training interval, as a function of R BB and Figure 5c shows the variation of cycle averages derived using all four independent sunspot data series (R C , R ISNv2 , R G , and R BB ) in the same way as was done for R IDV(1d) and R OSF1 . Note that the difference between the results using the different training sunspot number series is always smaller than the one- uncertainties that are set by the procedure used to extrapolate to times before the training interval. This is true for both R OSF1 and R OSF2 as well as for R IDV(1d) (see Figure 4).

Tests using Occurrence Frequency of Low-Latitude Aurora
The last comparison made here is much simpler. The long-term variation of the occurrence frequency of low-latitude aurora has been studied by many authors using many sources (see reviews by Silverman, 1992;Lockwood and Barnard, 2015;and Vasquez et al., 2016). The number of auroral nights at low geomagnetic latitudes [N A ] has long been known to vary with sunspot number, but the correlation in annual means is not high. This is largely because low-latitude aurorae can be generated after the Earth intersects both coronal mass ejections (CMEs) and co-rotating interaction regions (CIRs) and so N A peaks at sunspot maximum because of the effects of CMEs, but can be almost as high during the declining phase because of CIRs, despite the lower sunspot numbers. This solar cycle variation in the relationship between N A and sunspot numbers lowers the correlation in annual means but is averaged out when solar cycle means are taken. (Table 2, discussed below, shows that correlations for annual means are in the range 0.64-0.68 whereas for solar-cycle means they are 0.88-0.96). In addition, we have no quantitative theory to elucidate the connection between N A and sunspot number despite our good qualitative understanding of the link (see Lockwood and Barnard, 2015). We here make use of the variation of N A from the auroral catalogue by Legrand and Simon (1987). Figure 4d shows the variation of solar cycle means of N A (normalised to the value for Cycle 19) and compares them to corresponding the variations for various sunspot-number sequences.     Table 2 show that dividing the data into these two phase bins does not, in general, significantly alter the correlation coefficients. Note that although the scatter in both the red and blue dots in Figure 7 is still large, there is no suggestion of non-linearity in the relationship and so it is reasonable to compare the long-term variations of N A and sunspot numbers.  Figure 8 repeats the comparison of Figure 5d for these two halves of the solar cycle separately.
Note that the cycle means of all the sunspot numbers have also been averaged over the half of the solar cycle around solar maximum and minimum in Figure 8a and 8b respectively. Table 2b gives the associated correlation coefficients or the solar-cycle means: note that there are just 18 pairs of data points and the autocorrelation function at lag 1 is high (high data persistence) for both data series and so significance levels against the AR1 red-noise model cannot be computed. 3. Discussion Figure 5 shows that R BB becomes increasingly larger than the other sunspot-number estimates as one goes back in time. None of the series derived here from geomagnetic or auroral activity [R IDV(1d) , R OSF1 , R OSF2 , and N A ] reproduce this behaviour. In each case, extrapolating back in time from the algorithm training period (1982  2012) gives a time-series that lies closest to the variations for R C and R ISNv1 . In each case, R BB lies above the extrapolation in almost all years by an amount that exceeds the two- uncertainty (the grey bands). This trend is seen for all series back to the start of the geomagnetic activity data in 1845 and is consistent with the findings for cycle 17 in Article 1 (Lockwood et al., 2016a).
The scatter plots for the training interval indicate that the best proxy sunspot number, in terms of the correlation coefficient, is R IDV(1d) . However, this is a purely empirical relationship. It is useful to compare with the results from the proxies R OSF1 and R OSF2 , which are based on the physical continuity equation for OSF. R OSF1 and R OSF2 , like R IDV(1d) , both depend on empirical fit parameters, but the use of the continuity equations means that the fits are more constrained than is the case for R IDV(1d) . In addition, OSF is more satisfactory because it is a global solar parameter, like the sunspot number, whereas IDV(1d), and hence R IDV(1d) , are local parameters relating to the near-Earth heliosphere.
In addition, whereas using R IDV(1d) means that one has to assume that the IDV(1d) geomagnetic index depends only on the simultaneous sunspot number, R OSF1 and R OSF2 both allow for the effect of persistence in the data series (see Lockwood et al., 2011;Lockwood, 2013), whereby the current value also depends upon recent history, to a degree that is defined by the best-fit parameters. For the training period the correlation of all sunspot numbers with R OSF1 is consistently slightly lower than with R OSF2 (Table 1) and R OSF2 reveals lower scatter and heteroscedasticity (shown in Figure 6b for the comparison of R OSF2,BB with R BB but also true for all other series tested). Hence R OSF2 provides the most satisfactory test, which is shown in figure 4c. Note that the training procedure for R OSF2 , R OSF1 , and R IDV(1d) all employ four sunspot number series [R BB , R C , R G , and R ISNv2 ] that give almost identical variations. All are here given equal statistical weight.
The auroral data show the same tendency extends back to 1780, which means it covers the Dalton minimum (around Solar Cycle 6) and before. Dividing these data by solar cycle phase reveals an interesting feature of the data ( Figure 8): for both the solar-maximum and solar-minimum data the long-term variation in N A is closer to those in R C and R ISNv1 while R BB is consistently larger. It is noticeable that the variations for sunspot minimum and sunspot maximum have similar forms for R C , R ISNv1 , R G , and N A . However, R BB is different. For cycles before the Dalton minimum (Solar Cycles 5 and before) the sunspot minimum values exceed those seen in modern times (the normalised cycle averages values frequently exceed unity, whereas the same cycles are giving values near unity for solar maximum). Thus the drift to larger values in R BB is greater in the solarminimum values than it is in the solar-maximum values. This implies that the cause of the drift in R BB is more than the effect of the calibration observer k-factors as they would influence the values around solar minimum and around solar maximum to the same fractional extent.
The consistency with which the geomagnetic and auroral data give lower values (normalised to modern values) than R BB , and the way that the difference grows as one goes back in time, strongly suggests that there may be calibration drift in the values of R BB . In particular this calls for a check on the compilation of R BB . This could be done by repeating it with different regression procedures as the necessary daisy-chaining of calibrations means that both systematic and random errors will be amplified as one goes back in time. Article 3 is this series (Lockwood et al., 2016b) shows that the inflation of R BB as one goes back in time is consistent with the effect of regressions and the assumptions made by Svalgaard and Schatten (2016), in particular that the sunspot group counts by different observers are proportional. This assumption of proportionality was initially made by Wolf (1861) when he devised sunspot numbers because he envisaged the k-factors as being a constant for each combination of observer and observing instrument. However, in 1872 he realised that this was an invalid assumption (Wolf, 1873), and thereafter observer k-factors were computed either quarterly or annually (using daily data) at the Zürich observatory: Wolf also re-calculated all prior calibrations the same way (see Friedli, 2016). It is also important to recognise that the common practice of taking ratios of different sunspot numbers or sunspot-group numbers either to make or to test calibrations of sunspot observers inherently assumes proportionality and will also give misleading values.