High-resolution global climate simulations: representation of cities

Ensemble runs of high-resolution ( (cid:1) 10 km; N1280) global climate simulations (2005 – 2010) with the Met Office HadGEM3 model are analysed over large urban areas in the south-east UK (London) and south-east China (Shanghai, Hangzhou, Nanjing region). With a focus on urban areas, we compare meteorological observations to study the response of modelled surface heat fluxes and screen-level temperatures to urbanization. HadGEM3 has a simple urban slab scheme with prescribed, globally fixed bulk parameters. Misrepresenting the magnitude or the extent of urban land cover can result in land-surface model bias. As urban land-cover fractions are severely under-estimated in China, this impacts surface heat-flux partitioning and quintessential features, such as the urban heat island. Combined with the neglect of anthropogenic heat emissions, this can result in misrepresentation of heat-wave intensities (or cold spells) in cities. The model performance in urban areas could be improved if bulk parameters are modelled instead of prescribed, but this necessitates the availability of local morphology data on a global level. Improving land-cover information and providing more flexible ways to account for differences between cities (e.g


| INTRODUCTION
Cities are key to global climate change mitigation strategies (e.g., Mi et al., 2019 and references therein), and, at the same time, have to respond to the local impacts of climate change through adaptation measures (e.g., Landauer et al., 2019). Rapidly growing urban populations (UN, 2019) increase the need to conduct projections of future city climates to inform and develop integrated urban climate services, for example, to provide guidance for sustainable urban adaptation and planning (e.g., Cortekar et al., 2016;Baklanov et al., 2018;Grimmond et al., 2020).
The volume and density of buildings and the thermal and radiative properties of urban materials affect the surface-energy balance in cities, for example, through augmentation of heat storage, enhanced radiative trapping and increased aerodynamic roughness. A wellstudied manifestation of this is the urban heat island (Oke, 1973), that is, warmer cities compared to their rural surroundings. The prevalence of impervious surfaces also affects the water balance in cities, with increased surface runoff and strongly reduced potential for infiltration and evaporation (Grimmond, 2007).
Such localized impacts of urbanization can be represented in atmospheric models through the use of specialized urban schemes in land-surface models. Urban land-surface models (ULSM) parameterise the urban surface energy balance (e.g., Oke et al., 2017), where the net all-wave radiation, Q N = (K # -K " )+ (L # -L " ), is driven by the net-radiative forcing from incoming ( # ) and outgoing ( " ) shortwave (K) and longwave (L) radiation. Additional energy input from anthropogenic heat emissions (Q F ) depends on factors such as population density, traffic volume and seasonal heating/cooling demands, and hence can vary strongly in space and time (e.g., Sailor, 2011;Iamarino et al., 2012;Gabey et al., 2019). In some situations, Q F can be the dominant energy source in Equation (1), for example, in midlatitude cities in winter when Q N is low and space heating widely used (e.g., Hamilton et al., 2009). A portion of the energy received is stored (ΔQ S ) in the urban volume (e.g., walls, streets) and the ground. The remaining energy, partitioned into turbulent sensible (Q H ) and latent (Q E ) heat fluxes, is essential to drive boundary-layer dynamics. When the ULSM is coupled to a large-scale atmospheric model, the net horizontal advection of energy into the system (ΔQ A ) is accounted for in Equation (1) (right-hand side). With increasing resolution of both regional and global climate models, cities and conurbations start to become resolvable on the model grid (if using adequate land-cover information), so that local and regional effects of urbanization can, in principle, be represented in climate simulations (e.g., Oleson et al., 2008aOleson et al., , 2008b. However, global simulations still rarely model urban effects explicitly (cf. discussions in Daniel et al., 2019). This requires not only adequate urban land-cover information, but for more complex urban models also the availability of bulk information on building morphology (e.g., mean building height) and anthropogenic heat emissions (Grimmond et al., 2009), which are not readily available at the global scale.
In this study, we address the following questions: • How well do simulations with a simple urban (bulk) scheme of recent climate (2005)(2006)(2007)(2008)(2009)(2010) agree with surface observations of screen-level temperatures?
• What are the main factors affecting the representation of urban signals in these simulations? • What data/modelling capabilities are needed to improve the representation of urban signatures in climate projections and to provide integrated urban climate services for future planning scenarios?
Data from high-resolution (10 km; N1280) global climate simulations with the Met Office HadGEM3 model (Section 2.1) are analysed over highly urbanized and populous metropolitan regions (Section 2.2): (a) the south-east UK (London) and (b) south-east China (Yangtze River Delta region including Shanghai, Hangzhou and Nanjing).

| Setup
The frontier climate simulations of the H2020 PRIMA-VERA project (Vidale et al., in prep.), with a global horizontal resolution of 0.1 (10 km; N1280), use the Met Office HadGEM3 climate modelling system (Hewitt et al., 2011). The atmosphere-only simulations for a recent 6-year period (2005)(2006)(2007)(2008)(2009)(2010) use the Global Atmosphere (GA7.1) and Global Land (GL7.0) science configurations for the Met Office Unified Model (UM; Walters et al., 2019, Wiltshire et al., 2020 and the Joint UK Land Environment Simulator (JULES; Best et al., 2011. Sea surface temperatures and sea ice concentrations are used as boundary conditions. These are obtained from the HadISST2.1.1 dataset (Titchner and Rayner, 2014) and are modified to provide daily increments suitable for high-resolution simulations.
Given the large computational costs of the simulation, the model is initialised from a converged state at coarser horizontal resolution (25 km, N512; Roberts et al., 2019). To help eliminate sensitivity to initial conditions, an ensemble of high-resolution simulations is constructed. We analyse three ensemble members, labelled HAD 1-3 , available from the high-resolution HadGEM3-PRIMAVERA N1280 suite. The ensemble members are spawned by randomly perturbing the initial conditions of the surface temperature field and by applying to each member a further stochastic perturbation to the full set of initial and restart conditions as they are read in. Stochastic perturbations are needed as the model uses stochastic physics. These are additionally injected at regular intervals during the simulation, with different time scales depending on the spatial scale (Sanchez et al., 2016).
The JULES land-surface model employs a tile approach for the sub-grid-scale heterogeneity of the land surface in a grid-box (Best et al., 2011). In the JULES-GL7.0 science configuration, the built part (i.e., buildings, roads) of urban areas is represented by a single tile (Best-1T model;Best, 2005) with only one globally constant set of surface parameter values for heat capacity, albedo, emissivity and surface roughness. Thus, they do not reflect morphometric differences between cities and across a city (e.g., Grimmond and Oke, 1999;Kent et al., 2019). The values used are (Wiltshire et al., 2020): aerodynamic roughness length of z 0 = 1 m, bulk urban albedo of α = 0.18, bulk urban emissivity of ε = 0.97 and surface heat capacity of C = 0.28 MJ K −1 m −2 . The ratio of roughness length for heat (z h ) and momentum (z 0 ) remains constant at 10 −7 (Best et al., 2006). As buildings are assumed to be below the surface, a zero-plane displacement is assumed to be unneeded (i.e., 0 m).
In these simulations, irrigation or other anthropogenic moisture sources are not modelled for urban or vegetation tiles. Following precipitation, the built surface sheds most of the water immediately (i.e., urban run-off rates are large; Hertwig et al., 2020) as the waterholding capacity of the impervious tiles is limited (Best and Grimmond, 2016a). Hence, in most situations (i.e., in the absence of rainfall), the turbulent latent heat flux (Q E ) is limited to the vegetated tiles, which did not receive any urban runoff. Anthropogenic heat emissions (Q F ) in the HadGEM3-PRIMAVERA runs are assumed to be 0 W m −2 everywhere. JULES allows Q F to be prescribed monthly for urban tiles, but these values are spatially unvarying as, for example, used in UK operational numerical weather prediction with the UM (Lean et al., 2011). While global Q F models exist (e.g., Allen et al., 2011;Lindberg et al., 2013), JULES currently only allows the anthropogenic forcing to be prescribed, which requires access to local energy-use statistics. These are difficult to obtain with appropriate timeliness at the global scale. Hence, including Q F in global HadGEM3-JULES climate model runs is not currently viable.
Vegetation in HadGEM3-JULES is represented using five plant functional types (PFT): broad-leaf tree, needleleaf tree, shrubs, C3 and C4 grass. Urban vegetation (if resolved) is modelled with one of these vegetation tiles, but without representing interactions with impervious surfaces. Leaf-area index and canopy height for the vegetation tiles are prescribed as monthly values that vary at the grid-scale (Wiltshire et al., 2020). Neither crop dynamics nor its irrigation are modelled.
Evaluations of HadGEM3 (coupled to an ocean model, Roberts et al., 2019;and/or atmosphere-only, Vannière et al., 2019) have shown that increasing horizontal resolution from 130 (N96) to 25 km (N512) reduces global biases of near-surface temperatures and improves rainfall patterns and amplitudes. Detailed evaluations of the HadGEM3-PRIMAVERA N1280 runs are ongoing (Vidale et al., in prep.; Volonté, pers. comm.) as the model continues to be developed. Biases identified at 25-km model resolution are not expected to fundamentally shift with the change to 10 km, consistent with Vellinga et al. (2016) for the switch from 25 to 12 km.

| Analysis domains
Model output and surface observations are compared for domains centred on the metropolitan areas of London (Figure 1a,b) and Shanghai (Figure 1c,d). As London is central to development and evaluation of the urban models in JULES (Best, 2005;Porson et al., 2010;Bohnenstengel et al., 2011), the urban parameters used in HadGEM3-PRIMAVERA are expected to be most representative of this urban area.
As London is the largest metropolitan area in Europe, it covers several HadGEM3 grid cells (Figure 1b). In central London, the urban land-cover fraction is at or close to 100% (i.e., vegetation and the River Thames are missing; Table 1a). Shanghai, China's most populous city, and surrounding regions of the Yangtze Estuary have undergone rapid urbanization over the last decades (Yin et al., 2011;Cui and Shi, 2012;Tan et al., 2015). However, this is not evident in the HadGEM3 land cover. The vast cities of Shanghai and Hangzhou (Figure 1c), for example, are only represented by few urbanized grid-boxes ( Figure 1d). The urban fraction is too low compared to GUF data (Table 1b), with the maximum f Urban across the domain being 60.2% (inner-city Shanghai). In both domains, C3 grass is the dominant PFT with 90% of grid-boxes having tile-fractions larger than 59% (south-east UK) or 46% (south-east China), while less than 10% of the grid-boxes have trees.
Similarly, large contrasts are evident between IGBP f Urban and GUF in other highly urbanized and populous areas across China ( Figure S1 for Beijing and Chongqing). Given this clear bias in China, this study focuses only on a small-area comparison centred on Shanghai and cities located in the lower reaches of the Yangtze

| Output
Surface turbulent sensible and latent heat fluxes, Q H and Q E , and net all-wave radiation (Q N ) are available as daily averages from three ensemble members (HAD 1-3 ). These are analysed for the entire 6-year simulation period (2005)(2006)(2007)(2008)(2009)(2010). Hourly samples of air temperature (T air ) at screen-height (1.5 m above local orography) are available from two ensemble members (HAD 1,2 ). T air is a model diagnostic obtained from interpolating air temperatures between the first model level and the surface using Monin-Obukhov similarity theory (Essery et al., 2001;Bohnenstengel et al., 2014). As ensemble member HAD 2 has no output for 2010, the seasonal and diurnal comparisons are only for 5 years (2005)(2006)(2007)(2008)(2009)). Given the model years have only 360 days, the comparisons with observations (Section 3) cannot be for individual short periods (e.g., hourly) but need to be based on aggregated data (e.g., monthly, seasonal; Section 4.2) and use frequency distributions and occurrence of particular conditions (e.g., hot/cold days; Section 4.3). Comparison of model statistics of T air with the observations (Section 3) are conducted for common data periods. Frequency distributions (Section 4.3) use the same sample frequency for model and observations; that is, the hourly model output is sampled to match the 3-hourly observation frequency available for sites in the Shanghai domain (for the UK sites it is hourly; see Section 3). Data are analysed by month and season: summer (June, July, August; JJA), autumn (SON), winter (DJF), and spring (MAM). The model bias is analysed as the difference between model and observations.

| OBSERVATIONS
Evaluation of modelled T air uses single-site surface observations from weather stations. For the south-east UK, hourly samples of screen-level air temperature from the Met Office MIDAS surface observations archive (Met Office, 2006) are used at four sites ( Figure 1b; Table 1a). In the south-east China domain, ground-based observations from NOAA's National Climatic Data Center (NCDC) archive (Climate Data Online; https://www7. ncdc.noaa.gov/CDO/cdo) at five sites ( Figure 1d, Table 1b) are selected based on data availability during the evaluation period. From NCDC, 3-hourly T air samples are available at 0, 3, …, 21 UTC. Both MIDAS and NCDC T air data have a resolution of 0.1 C.
Urban land-cover characteristics of the surface stations are summarized (Table 1) using both GUF (25 2 and 2.5 2 km 2 footprints around the sites) and HadGEM3 land cover (Section 2.2). The latter is given for a 3-by-3 grid-box area (30 2 km 2 ) centred on the evaluation sites. For some stations, the GUF 2.5 km and GUF 25 km differences are large ( Figures S2, S3). For example, the immediate surroundings of WIS (UK) are primarily rural (f Urban = 3.5%) compared to the wider regional average (23.1%). The latter is close to the HadGEM3 value (26.1%).
The opposite occurs for LIY (China), where GUF 25 km indicates settlements in rural surroundings, while the site's immediate neighbourhood is highly urbanized (81.9%). For SHA (Baoshan/Shanghai), the close proximity to the Yangtze River causes GUF 25 km to be low, whereas GUF 2.5 km (f Urban = 87.3%) indicates extensive urban surroundings to the station, and of the inner-city area south of the site. However, f Urban in the HadGEM3 land cover is zero in the 3-by-3 grid-box area around the LIY, XIA and LUK sites (China), in strong contrast to GUF (Table 1b).

| Surface forcing
Before evaluating screen-level temperatures (Section 4.2, 4.3), modelled turbulent heat fluxes at the surface from the three model ensemble members (HAD 1-3 ) are analysed, as these directly impact the boundary-layer temperature profiles. The nature of the energy partitioning into sensible and latent turbulent heat flux (Equation 1) is often strongly altered in cities compared to rural areas in response to the prevalence of impervious surfaces and reduced amounts of vegetation (e.g., Grimmond and Oke, 2002;Goldbach and Kuttler, 2013;Ward et al., 2016). The relative dominance of Q H or Q E for a site can be assessed by the evaporative fraction where β = Q H /Q E is the Bowen ratio, another widely used method to characterize surface energy-flux partitioning.  (Table 1a). Patterns of enhanced (reduced) JJA Q H (Q E ) are clearly linked to urban fractions for individual HadGEM3 surface grid-boxes ( Figure 1b). The Greater London area stands out distinctly from its rural surroundings. An observed heat-flux climatology for central London (Kotthaus and Grimmond, 2014a) informs expectations of model behaviour. With only daily-mean model fluxes available, diurnal variability is unknown and the magnitudes are small (cf. hourly-mean observations). In central London, the observed JJA 25th and 75th percentiles of Q H at the time of day when the median Q N is largest can be between 150-250 W m −2 and at night between 50-100 W m −2 when Q N is lowest and negative (see Figure 6 in Kotthaus and Grimmond, 2014a). Hence, the JJA model median of Q H < 80 W m −2 in central London is lower than expected.
In grid-boxes with f Urban > 90% in central London (Figure 1b), Q E predominantly occurs immediately after rainfall, but JULES eliminates subsequent urban tile evaporation through large runoff rates, which results in JJA median daily-mean Q E of ≤10 W m −2 (Figure 2b). Earlier studies demonstrate that even small amounts of vegetation in cities can have a strong impact on the local surface energy balance (e.g., Grimmond, 2012a, 2012b;Best and Grimmond, 2016b). The steep spatial gradients of Q E (Figure 2b) as f Urban decreases away from central London demonstrate this, as well. Like Q F , anthropogenic moisture sources are not modelled in JULES for urban tiles. However, these can be important in some cities. Recent studies in Beijing (Dou et al., 2019) and Shanghai (Ao et al., 2018) demonstrate that irrigation of vegetation, street cleaning and/or wetting (e.g., to reduce dust, to cool) have a noticeable, non-negligible effect on observed Q E .
The median JJA evaporative fractions (Equation 2; Figure 2c) for the model grid-boxes covering central London are clearly dominated by Q H with values between 0.075 < EF < 0.3. Hence, the corresponding Bowen ratio (β) is between 12 and 2.3, which is high, but not unreasonable for central London. Kotthaus and Grimmond (2014a) report monthly median hourly β between 5 and 10 in London's central business district. In rural grid-boxes (f Urban = 0), modelled EF is larger (0.65 < EF < 0.75; i.e. β ≈ 0.54-0.33) and more spatially homogeneous.  To illustrate the sensitivity of JULES surface fluxes to land cover and urban parameter choices, JULES is run offline mimicking the HadGEM3-PRIMAVERA GL7.0 setup at a site in central London (KCL) for which highquality flux measurements are available for 3 years (2011Ward et al., 2016, Hertwig et al., 2020. A description of JULES settings, parameters (Table A1) and observations are given in Appendix A. The response of the Best-1T urban model (Section 2.1) is tested in two configurations: (a) a control case using high-resolution land-cover data, observed roughness and radiative parameters and realistic (modelled) Q F (hereafter CTRL-  (Table A1), leading to a reduction of energy input (Q N ) through an increase of K " (Figure S4a,c). In both configurations, Q H has a substantial phase delay (rise and peak times), as a result of the large thermal inertia (through C and z h ) of the urban slab impacting the temporal response of surface temperatures (Hertwig et al., 2020). The phase delay, also present in the diurnal cycle of L " (Figure S4b), impacts diagnostics like T air (Section 4.2).
In CTRL-1T (Figure 3b), vegetation (13% land-cover fraction within a 500-m radius around the KCL site; Table A1) and the River Thames (21%) influence Q E , while in HAD-1T there is no vegetation and only 2% water. This results in a median Q E close to 0 W m −2 in HAD-1T (Figure 3b). This echoes the patterns of very low Q E in central London in the HadGEM3-PRIMAVERA simulations (Figure 2b,e). The river influences the eddy-covariance heat-flux source areas only for some wind directions (Kotthaus and Grimmond, 2014b). As the relative location of land cover is not captured in the JULES tiling, CTRL-1T overestimates the effect of the river on Q E , which in turn leads to reduced Q H (Figure 3a).
The impact of over-or under-representation of urbanization in the HadGEM3-PRIMAVERA land cover becomes more apparent when comparing modelled heatflux characteristics in the London domain (LWC, WIS; Table 1a) Table 1b). Comparison is undertaken using the monthly ensemble median (2005)(2006)(2007)(2008)(2009)(2010) surface heat fluxes and EF from the HadGEM3-PRIMAVERA climate simulations for the three ensemble members (Figure 2d-f). To account for radiative forcing differences in the two regions, the daily-mean surface heat fluxes are normalized by the local daily-mean net all-wave radiation (Q N ; Figure 2d,e). The central London LWC site (HadGEM3 f Urban = 97.7%) clearly has the largest 'urban' response, with bulk characteristics similar to the HAD-1T offline test (e.g., negative Q H occurs in DJF; too low Q E ). The results for the WIS background station are dominated by the large fraction of non-urban surfaces (C3, C4 grass: 66.7%, bare soil: 7.2%), causing much larger Q E in summer (cf. LWC). As the model output for fluxes are diurnal means, the winter Q N values in the UK are small and often negative ( Figure S5a), impacting the sign of ratios (Q E /Q N ; Q H /Q N ). The daily-mean DJF for Q E is mostly positive ( Figure S5c) and for Q H mostly negative ( Figure S5b). Similarly, EF (Equation 2) is negative when the daily-mean Q H < 0 W m −2 and jQ H j > jQ E j, resulting in larger variability in the monthly statistics (Figure 2d-f) in winter at the UK sites (cf. China).
With the SHA grid-box (Baoshan district of Shanghai; Table 1b) having only 6.3% built land cover but 79.2% vegetation (grass and shrubs), it is unsurprising that the heat fluxes are similar to those modelled at the rural WIS site (UK), especially in MAM and JJA (Figure 2d,e). Similarly, the model fluxes at both LUK (Nanjing/Lukou airport) and XIA (Hangzhou airport) with zero built land cover in HadGEM3 show a typical rural response with the monthly median EF > 0.5 year-round. Here, the model assumes 86.3% (XIA) and 90.9% (LUK) vegetated surfaces. As a point of reference, Ao et al. (2016a) report observed monthly mean daytime Bowen ratios between 2 and 4.7 (i.e., EF between 0.33 and 0.18) for a central business district in Shanghai (XJH site; Xujiahui district). The modelled median EF at SHA is much higher and ranges between 0.42 (August) and 0.68 (June; Figure 2f). Observed mean daily Q H peaks in Shanghai can exceed 290 W m −2 in the early afternoon in JJA, while Q E is low (65 W m −2 ; Ao et al., 2016a). Misrepresenting the energy partitioning over cities in such a way will negatively impact any use of this data, such as for climateservice applications that use heat-flux ratios to assess urban heat stress for health of citizens or irrigation demands for maintaining green infrastructure or reducing dust resuspension.

| Urban heat-island intensity
Turbulent heat fluxes (Section 4.1) and surface temperatures play an important role in determining and co-modulating local boundary-layer dynamics in atmospheric models over cities and therefore impact characteristics of near-surface air temperatures (e.g., Omidvar et al., 2020). Distinct canopy-layer air-temperature differences between urban and rural areas have long been observed worldwide (e.g., literature reviewed in Oke et al., 2017 andStewart, 2019), particularly a few hours after sunset when the urban heat island is strongest. Radiative and thermal properties of prevalent urban materials, together with the density and volume of buildings, result in heat being effectively stored (ΔQ S ) during the day and released at night when Q N becomes negative (Equation 1). This, together with Q F , partially offsets radiative night-time cooling in cities; a process that it is critical to represent in land-surface models. Figure 4 shows a comparison of seasonal median diurnal T air cycles between observations and climate model (HAD 1,2 ; Section 2. With nearly 100% built surface cover (Table 1a), the response of the urban Best-1T scheme in JULES dominates the T air characteristics at LWC (Figure 4a). Only in summer (when Q N is largest) does the model not underestimate the median T air . However, large differences exist in both magnitude and timing of modelled and observed median JJA T air peaks. Compared to the observations, both ensemble members show a 1 h delay in the morning temperature rise and a further phase shift of the afternoon peaks of 1-2 h, together with a positive offset of the maximum median T air of 2 C (HAD 1 ) and 1 C (HAD 2 ). These phase delay features are observed offline in central London (HAD-1T) in Q H (Figure 3a). The positive afternoon T air bias in JJA at the highly urbanized sites (Figures 4a and S6a) can be partially explained by an over-prediction of surface temperatures (see also JJA L " bias in Best-1T offline tests in central London; Figure  S4b). This feature of the urban model affects the grid-box T air less at sites with lower f Urban (LHR, WIS; Table 1a). While at these sites in JJA HAD 1 performs much better ( Figure S6b,c), the positive bias persists for HAD 2 , showing that the initial conditions (Section 2.1) and the resulting response of the atmospheric model in the region have an influence.
During colder seasons, especially winter, there is a persistent negative bias in modelled T air at LWC (Figure 4a). This is stronger at night and in the early morning (up to 2 C in DJF). As discussed in Section 4.1, this is likely related (in part) to the climate model Q F being 0 W m −2 , whereas observations at the highly urbanized LWC are impacted by Q F . In central London in winter, Q F can be as large as, or larger than, Q N and can therefore be the main driver of the surface energybalance (Hamilton et al., 2009;Kotthaus and Grimmond, 2012). Offline tests with the JULES Best-1T model in central London (KCL site; Appendix A) show that switching off Q F in CTRL-1T can account for up to 1 C difference in DJF T air ( Figure S7). This agrees with anthropogenic temperature increments of 0.5-1 C determined by Bohnenstengel et al. (2014) for other sites across London. Compared to SJP, located in a highly vegetated park and therefore less impacted by Q F (cf. Figure S2), LWC observes consistently warmer median night-time T air of 0.6 C year-round ( Figure S9), in agreement with Jones and Lister (2009). As both sites are within the same 0.1 model grid-box, these local differences are not represented in the model. Consequently, the comparison of the HadGEM3-PRIMAVERA T air with the statistically cooler SJP site (cf. LWC) has a better agreement of median night-time temperatures (except summer), but an exacerbated positive model bias of JJA afternoon temperatures ( Figure S6a). These differences in model assessment depending on site choice (within the same model grid-box) show that it is crucial to have measurements available in characteristically urban settings that reflect the added effects of Q F and storage heat flux from the building volume.
The intensity of the UHI can be assessed by the difference between urban and rural air temperatures. Figure 5a shows the modelled and observed seasonal median diurnal cycle of screen-level air temperature differences (ΔT air ) between LWC and the rural WIS station 35 km south-west of LWC (Figure 1a,b). Expectedly, the observed median ΔT air is highest (and nearly constant) at night, with largest values in JJA (2.5 C) and lowest in DJF (1.5 C). The magnitudes agree with previous long-term observations (Jones and Lister, 2009). In all seasons, the climate model underestimates the median UHI intensity between sunset and sunrise by up to 1 C (DJF), while in MAM and JJA the median ΔT air is overestimated in the afternoon and early evening by up to 0.5 C. Similarly, the large magnitude and seasonal variability of the observed inter-quartile range of ΔT air is not reproduced by the model, indicating a smaller sample spread. Strong differences also exist in the seasonal 90th percentiles of ΔT air , which range between 3.5 C (DJF) to 4.9 C (SON) in the observations and 1.4 C (DJF) and 3.1 C (JJA) in the model ensembles, implying that more extreme urban-rural temperature contrasts are not captured.
Observed T air differences between central London (LWC) and the less urbanized Heathrow airport (LHR; f Urban = 48.3% in GUF 2.5 km , 33.1% in HadGEM3; Table 1a) are noticeably smaller compared to LWC-WIS (Figure 5b). The observed median ΔT air between sunset and sunrise is relatively constant throughout the year F I G U R E 4 HadGEM3-PRIMAVERA (HAD 1,2 ) and observed seasonal median diurnal cycles of screen-level air temperature (T air ) with inter-quartile range for (a) LWC (London Weather Centre; Table 1a) and (b) SHA (Shanghai/Baoshan; Table 1b). Observations are (a) hourly and (b) 3-hourly. See Figure S6 for SJP, LHR and WIS (UK); Figure S8  with only small variations around 1 C, while the model has larger seasonal variability of ΔT air with nocturnal maxima of only 0.5 C in DJF, but up to 1.8 C in JJA. These magnitudes are very similar to the modelled ΔT air between LWC and the rural WIS site (Figure 5a). While more vegetation surrounds LHR (cf. LWC), the site is also affected by anthropogenic heat emissions related to the airport infrastructure and operations.
Systematic differences between HAD 1,2 and observations also exist at the SHA site (Figure 4b). Nocturnal and early morning median T air are under-estimated in all seasons except winter (DJF), while T air peaks are overestimated in all seasons except for spring (MAM). This occurs at all sites in the Shanghai domain ( Figure S8) and can be clearly traced in the bias patterns of the median 3-hourly T air in each season ( Figure 6). As the model f Urban severely under-represents the actual built land-cover fraction at all sites (Figure 1d; Table 1b), the nocturnal negative model bias tendency partially relates to the lack of urban heat storage/release. For sites in populous cities (e.g., SHA, XIA) with a large building volume, anthropogenic heat emissions are also likely to affect the near-surface air temperatures. While in London and other higher-latitude cities Q F typically peaks in winter when space heating is needed, subtropical Shanghai and surrounding cities like Hangzhou have Q F peaks in summer from air-conditioning use. While space cooling is used in both commercial and residential buildings, heating during winter can occur in offices, but is uncommon in residences (Ao et al., 2016a). This could explain the overall better agreement of nocturnal T air in DJF between model and observations, as both sites are situated outside of high-rise central business districts with predominantly commercial buildings and instead are dominated by low-rise buildings (residences and industry).
F I G U R E 5 HadGEM3-PRIMAVERA (HAD 1,2 ) and observed seasonal median (lines/markers) diurnal cycles of screen-level air temperature differences (ΔT air ) with inter-quartile range (shading/error-bars) for (a) LWC-WIS and (b) LWC-LHR (Table 1a) and (c) SHA-DON (Table 1b), with differences (vertical bars) of the median ΔT air between model ensemble members (HAD 1,2 ) and observations (a, b) hourly and (c) 3-hourly; and seasonal 90th percentile of the data (arrows). See Figure S10 for ΔT air for SHA-XIA, SHA-LIY and SHA-LUK (China) [Colour figure can be viewed at wileyonlinelibrary.com] Of the five evaluation sites in the Shanghai region (Table 1b), none qualify as truly rural background stations ( Figure S2). However, it may be expected that differences in urbanization levels between SHA, with the highest f Urban , and the other sites are to a degree reflected in ΔT air . However, given the large geographical distances, ΔT air primarily reflects regional climate variations and differences between coastal sites (SHA, XIA), that can be affected by land-sea breeze circulations, and inland stations.
Seasonal diurnal cycles of ΔT air between SHA and DON (Figure 5c; SHA: f Urban = 87.3%, DON: 36.1% based on GUF 2.5 km ; Table 1b) have large magnitudes and variability in both observations and model. As the model f Urban is very low at both sites (SHA: 6.3%; DON: 9.7%), the modelled ΔT air mainly reflects climatological differences. The observed median ΔT air is consistently higher than the modelled equivalent, with particularly large differences at night in MAM and SON and during the day in JJA. This trend agrees with the expected role of urban ΔQ S and Q F for SHA. Similar trends are seen in the observed ΔT air between SHA and the other sites in the domain ( Figure S10), with the overall smallest differences occurring between the coastal SHA and XIA stations.
At the inland sites (DON, LIY, LUK), a large overestimation of the median JJA screen-level temperatures occurs (Figures 6 and S8). This is smaller for the coastal SHA and XIA stations. Unlike at LWC (Figure 4a), this cannot be attributed to features of the Best-1T model given the largely missing f Urban characterization in the model, but must relate to the regional weather pattern representation by the atmospheric model. Note that differences between surface elevations reported for the stations and the model orography are small.
An increased resolution of global climate models can modify the hydrological cycle simulated, with an increase in precipitation over land . In June and July, the analysis region centred on Shanghai is affected by meso-scale convective systems associated with the passage of the Meiyu front, which results in very intense localized rainfall for short durations (e.g. Guan et al., 2020). While the climate model can resolve the Meiyu frontal system and its passage through the region (A. Volonté, pers. comm.), differences in timing, intensity and location of associated convective precipitation compared to the observations (not shown) may play a role in the JJA temperature bias (Figure 6).
The urban model behaviour (offline and online) in highly urbanized central London suggests the positive JJA model bias of T air in the Shanghai domain will likely be exacerbated once more realistic urban land cover is used. Hence, improving land-cover characteristics is expected to cause model performance to deteriorate at these sites, which prompts the need to further investigate reasons for the bias in the atmospheric model to improve the model performance.
In some seasons, there are large quantitative differences between T air from both ensemble members. Compared to HAD 2 , HAD 1 for the UK domain has a larger 75th percentile and sometimes larger median T air (e.g., DJF, JJA; Figures 4a, S6) at all sites across the common period of data availability. Whereas for the China domain, HAD 1 has a lower 25th percentile in MAM and SON and lower night-time medians (Figures 4a, S8). These ensemble differences can be as large as 1-2 C, and hence can have a meaningful impact on potential uses of these data, such as for local climate assessment and planning, if extremes are of interest (Section 4.3). Similarly, the HAD 1 surface heat fluxes in the UK have larger differences to the other ensemble members, especially in summer, while HAD 2 and HAD 3 are more similar (not F I G U R E 6 Bias (model-observations) of median 3-hourly T air by season at sites in the Shanghai region (Table 1b) for two ensemble members (HAD 1,2 ) [Colour figure can be viewed at wileyonlinelibrary.com] shown). As the implications for urban services can be significant, this raises more general questions about the use of probabilistic versus deterministic forecasts.

| Occurrence of temperature extremes
The frequency of occurrence and magnitude of very warm (e.g., heat waves) or cold air temperatures in cities can be affected by the UHI (e.g., Lemonsu et al., 2015;Ramamurthy and Bou-Zeid, 2017;Ao et al., 2019) and can be further exacerbated by feedback mechanisms with Q F (e.g., through air-conditioning; Takane et al., 2020). Representing such processes in climate projections for cities is crucial for various urban applications, such as related to thermal comfort/heat stress and associated mitigation strategies (e.g., urban greening; Zölch et al., 2016).
While the HadGEM3-PRIMAVERA simulation period is too short to derive statistically robust occurrence likelihoods of heat waves, the occurrence frequency of recent temperature extremes as represented in the model and captured in the observations can be compared for the available output periods. Modelled and observed seasonal frequency distributions (normalized) of T air are compared for daytime and night-time periods in the UK and China domains (Figure 7; Figures S11, S12). For the UK, the analysis of hourly data varies daylength between seasons (all UTC): DJF 09:00-16:00; MAM 07:00-20:00; JJA 06:00-21:00; and SON 08:00-18:00. In the Shanghai domain, daytime length varies much less through the year and the observation frequency is only 3-hourly. Hence, a constant daytime period (06:00-18:00 UTC + 8 h) is used. Common periods and data frequencies for both the observations and model output are used (see Section 2.3).
The overall shape of the observed T air frequency distributions during day and night is captured by the model in both domains, with a mostly good agreement for the range of values and skewness patterns (Figure 7). However, some of the discrepancies between the diurnal air temperatures ranges (identified in Section 4.2) are reflected in the tails of the distributions. The higher observed DJF air temperatures at the London city-centre site LWC (Figure 4a) are reflected in the cool-end tail of the T air distributions (Figure 7a). Similarly, the overprediction of JJA and SON daytime temperatures by the model impact the warm-end T air tail (stronger for HAD 2 ). At the rural WIS site (Figure 7b), minimum nocturnal temperatures are over-predicted throughout the year, in agreement with characteristics of the 25th percentiles of the seasonal diurnal cycles ( Figure S6c), likely linked to the higher model f Urban (Table 1a).
At the SHA (Shanghai/Baoshan) site in the China domain (Figure 7c), the observed nocturnal maximum T air in MAM and JJA are slightly warmer than in the model (see also Figure 4b). This could be connected to the limited nocturnal heat release from building volumes, with the low urban tile weighting in the grid-box (Table 1b). The largest difference between the T air tails in the Shanghai domain is found in the JJA daytime maximum temperatures at the inland sites (Figure 7d for DON; Figure S12b,c for LIY, LUK), with a strong overprediction (much longer tails) from both ensemble members. This agrees with earlier discussions (Section 4.2). Nocturnal distributions of temperatures at these sites are much better modelled in all seasons, but as for the coastal SHA and XIA sites the warm-end tail is slightly underpredicted in MAM and JJA.
The frequency of days with extreme T air for each simulation year is analysed (Figures 8, 9) using observed and modelled daily maximum and minimum temperatures at each site. Cold and hot thresholds are set to be below/ above the 25th/75th percentiles of observed temperatures (DJF, JJA, respectively) at all sites. For both domains, the same two cold thresholds are used: 0 and 5 C. However, the hot extreme differs between domains, with London thresholds (25/30 C) lower than for the Shanghai region (30/35 C). A day is counted as cold (hot) if T air over 24 h is detected below (above) these limits. Qualitatively, the results (Figures 8 and 9) are insensitive to the exact threshold chosen (as long as they are within the distribution tails). The frequency is relative to the total number of days in the year that have valid data, considering both missing observations and model output. In the UK, hourly T air data are analysed, whereas in China 3-hourly samples (Section 3) of the hourly model output are used to match the observations (Section 2.3).
As the two central London sites (LWC, SJP; Figure 8) are within the same model grid-box, the observed frequencies of hot and cold days reflect the climatological differences between the stations (Section 4.2). The park station (SJP) has a notably higher occurrence of cold days cf. LWC with its more extensive built/impervious surfaces and slower nocturnal radiative cooling ( Figure S9). The observed relative occurrence of hot days is only slightly higher at LWC (cf. SJP), in agreement with the expected smaller role of UHI intensity type processes during the day.
The modelled cold days for HAD 1,2 agree better with SJP (cf. LWC). In 2007 and 2008, the modelled occurrence of both T air < 5 C and T air < 0 C at LWC are too high. Whereas for rural WIS the model ensembles in all years underestimate the occurrence frequency of T air < 0 C. This agrees with overestimation of the 25th percentile of T air in DJF (cf. observations, Figure S6c). This may be partly caused by f Urban being too high in the model relative to the actual surroundings of the site (26.1% and a maximum of 75.5% in a 3-by-3 grid-box area versus 3.5% and 23.1% in GUF 2.5 km and GUF 25 km ; Table 1a).
In the China domain, model performance differs between the coastal, highly urbanized sites (SHA, XIA) and the sites inland and to the north (DON, LIY, LUK). This is reflected in the temperature extremes (Figure 9). For the latter, in all years, the model noticeably overpredicts the occurrence of hot (T air > 30 C) and very hot (T air > 35 C) days compared to the observations, in agreement with the JJA bias patterns (Figure 6). Cold days at these sites are better predicted, with no clear inter-annual trend of the model performance. At SHA and XIA, hot days are over-predicted in some years, while in others (2005,2007 at XIA) the occurrence frequency is under-predicted. Both the model and the observations have a lower occurrence of cold days at the more coastal stations. However, the model results do not reflect the typical urban response of the land-surface at these sites due to the severe under-representation of urban land F I G U R E 7 HadGEM3-PRIMAVERA (HAD 1,2 ) and observed normalized frequency distributions of T air (0.5 C bin size) per season from daytime and night-time samples for (a) LWC (London Weather Centre), (b) WIS (Wisley), (c) SHA (Shanghai/Baoshan) and (d) DON (Dongtai). Samples in (a,b) are hourly, in (c,d) 3-hourly with model output frequency reduced to match the observations. See Figure S11 for results at SJP and LHR (UK); Figure S12 for XIA, LIY and LUK (China) [Colour figure can be viewed at wileyonlinelibrary.com] cover (Secions 2.2 and 3). Hence, some of the better agreement observed here may deteriorate if more realistic land cover is used.

| CONCLUSIONS
High-resolution (10 km; N1280) global climate simulations (2005-2010) with the Met Office HadGEM3 model are analysed over large urban areas in the south-east UK (London) and south-east China (Shanghai, Hangzhou, Nanjing region) to study the response of modelled surface heat fluxes and diagnostic screen-level temperatures (T air ) to urbanization levels. Modelled T air is evaluated using weather station data. The climate model uses a simple urban slab scheme with prescribed, globally fixed parameters (JULES-GL7.0) and land cover derived from IGBP.
While any detected T air bias could be partially attributed to bias in the large-scale atmospheric model (and that needs to be investigated further), differences can also be linked to both the model land cover and the specifications used in the urban land-surface model. We draw the following conclusions regarding the key factors affecting representation of urban signals in the simulations and the potential for improvements if model output is intended to inform applications in urban areas (e.g., urban climate services): • The representation of urban land cover is identified as the primary source of bias in the JULES land-surface model. Comparisons of recent (2011)  heat fluxes that drive boundary-layer dynamics are predominantly rural/non-urban in nature. This is inevitably reflected in the near-surface air temperatures, as urban (e.g., heat-island) effects are absent. These, for example, can impact heat-wave intensity climatologies. For some sites analysed in the Shanghai region, it is anticipated that resolving the land-cover characterization may increase the JJA T air bias, prompting the need to further investigate reasons for the bias in other parts of the model system. The land cover of Greater London is more realistic in the model, but non-built surface covers (vegetation, water) are too low in central London (model f Urban ≈ 100%) causing bias in the energy partitioning (e.g., Q E too small). Thus, it is concluded both too small and too large f Urban negatively impact model results and therefore, the use of these data for applications such as urban climate services (e.g., heat stress assessment or external water use requirements). Hence, use of appropriate (current and future) urban land-cover information is crucial, and perhaps the part of the modelling chain that is easiest to fix as high-resolution satellite-derived global land-cover products have become more widely available in recent years (e.g., high-resolution (300 m) global land-cover data of the European Space Agency's Climate Change Initiative, ESA-CCI). High-resolution climate simulations in future climates need to include potential future changes in land cover, such as urban expansion and/or changes in land-cover types (e.g., from enhanced green infrastructure; e.g., Li et al., 2017, Carter, 2018. • Urban anthropogenic emissions of heat and water are absent in the current simulations. This causes biases in turbulent heat fluxes and T air . In both London and Shanghai, Q F plays an important role in the surface energy balance. Currently, JULES has the capability to prescribe Q F to the urban tile as spatially unvarying monthly values. Offline tests of the HadGEM3-PRIMAVERA configuration in central London showed that even this simplistic representation can improve the model performance if suitable magnitudes of Q F are used. As retrieval of local-scale energy-consumption information is challenging at the global scale, Q F should ideally be modelled based on more accessible parameters like population density and temperature-dependent heating/cooling demands (e.g., Sailor and Vasireddy, 2006;Lindberg et al., 2013). Anthropogenic water emissions also can have a non-negligible effect on the urban energy balance in some cities (e.g. street cleaning, irrigation; Ao et al., 2018, Dou et al., 2019, but currently only vegetated tiles of JULES can be irrigated. • Urban scheme physics and parameters used in HadGEM3-PRIMAVERA (JULES-GL7.0) have fixed singular parameter values, making appropriate choices to represent cities worldwide challenging. For example, the selected default urban albedo (0.18) is large compared to central London (0.11 from observations; Kotthaus and Grimmond, 2014a) and central Shanghai (0.14; Ao et al., 2016b). This reduces the energy input into the urban system (Q N ), which, combined with the absence of Q F in the model, can cause underprediction of Q H . In London, the large thermal inertia of the Best-1 T urban scheme causes a 1-h delay of temperature increase in the morning and up to 2-h delay in the afternoon T air peak in the climate model output. Furthermore, modelled JJA daytime temperatures in central London are overestimated by up to 2 C. This is partially explained by the large heat capacity and roughness length for heat used in the scheme. The JULES two-tile canopy model MORUSES, with separate surface-energy balance calculations for roofs and street canyons, can improve this by explicitly modelling the bulk radiative, thermal and aerodynamic parameters as a function of building morphology (Hertwig et al., 2020). At grid-box scale, the fast response of the (insulated) roof tile to radiative forcing can partially offset the large heat storage and correspondingly delayed sensible heat flux of the canyon tile (Porson et al., 2010). However, urban canopy models require more characteristics than built land-cover fractions, which are currently not available globally; for example, roof and street fractions need to be separated and mean building heights, height-to-width ratios of street canyons and radiative/thermal material characteristic of built surfaces need to be known. It is also noted that the impact of very tall buildings is not captured at all in the JULES urban schemes. In central Shanghai, for example, over 1200 buildings are taller than 100 m and extend to 632 m (Tan et al., 2015). Hence, the buildings are much larger than the local topography (mean elevation above sea level is 4 m).
Given the increasing verticality of cities worldwide, more research is needed to better understand the impacts of tall buildings on the urban surface-energy balance and how these can be represented in urban models (Barlow et al., 2017).