Structural validity of the Eating Disorder Examination—Questionnaire: A systematic review

Abstract Objective The main aim was to perform a systematic literature review of studies investigating the factor structure of the Eating Disorder Examination‐Questionnaire (EDE‐Q), a widely used measure of eating pathology. Secondary aims were to summarize the quality of reporting of latent variable (factor) analyses in these studies and review support for different factor solutions. Method Literature was identified through Scopus, Medline, PsycInfo, and ProQuest databases published up to February 23, 2022 and outreach via an international listserv. All studies published in English reporting factor analysis of the EDE‐Q were included with few restrictions. Sixty studies including 63,389 participants met inclusion criteria. Results The originally proposed four‐factor solution received little empirical support, although few alternative models have been robustly evaluated. Items assessing shape and weight concerns frequently coalesce in factor solutions, suggesting that these constructs are closely related. Investigations of brief versions of the EDE‐Q have produced more consistent findings, suggesting that these measures, particularly a seven‐item version, might be useful alternatives to the full version. Quality of studies was reasonable, with important methodological elements of factor analysis often reported. Discussion The findings are of relevance to practitioners and researchers, suggesting that the “original” factor structure of the EDE‐Q should be reconsidered and that use of a seven‐item version is to be encouraged. Public Significance Self‐report questionnaires are widely used in the assessment of disordered eating. The current study found that there is little consensus about the structure of a common measure of eating psychopathology. There is more consistent support for a brief, seven‐item, version assessing dietary restraint, body dissatisfaction, and overvaluation of weight and shape.


K E Y W O R D S
assessment, eating disorder, Eating Disorder Examination-Questionnaire, factor analysis, patient-reported outcome measures, psychometric 1 | INTRODUCTION Efficient assessment of eating pathology is integral to experimental studies, epidemiological work and clinical settings, and several psychometric measures have been designed for this need. Whilst EDs have traditionally been considered as discrete "categories" (e.g., American Psychiatric Association, 2013), continuous measures can capture the full variation in eating pathology that is seen in both clinical and nonclinical samples (e.g., Luo et al., 2016). Such an approach is consistent with a "network perspective" to conceptualizing mental health problems, whereby EDs, as with other mental health problems, are seen as occurring on a spectrum and demonstrate patterns of interacting symptoms with multifactorial causes, rather than existing as discrete disease entities (e.g., Borsboom, 2017;Monteleone & Cascino, 2021).
A number of self-report measures are widely used in the assessment of eating pathology, one of the most popular of which is the Eating Disorder Examination-Questionnaire (EDE-Q; Fairburn & Beglin, 1994. In line with the semi-structured interview from which it was derived (the Eating Disorder Examination, or EDE; Cooper & Fairburn, 1987;, the EDE-Q assesses a variety of behaviors and cognitive features relevant to eating pathology, the latter of which are summarized by four subscale scores ([Dietary] Restraint, Eating Concern, Shape Concern, and Weight Concern; Fairburn & Beglin, 2008) obtained from item scores. From a psychometric point of view, although there is support for the reliability and validity of the EDE-Q in the assessment of ED symptoms (Berg et al., 2012), the suggested factor structure of the measure has proven difficult to corroborate (Grilo et al., 2013), perhaps as the items and constructs of both measures were developed based on "rational rather than empirical grounds" (Cooper et al., 1989, p. 809).
The EDE-Q includes definitions and time frames for key symptoms and typically takes a few minutes to complete. Twenty-two "attitudinal" items are scored on a 0-6 scale based on either: (a) number of days in the previous 28; or (b) "Not at all" to "Markedly." They include questions such as "Have you had a definite fear that you might gain weight?" and "How dissatisfied have you been with your shape?." Six further "behavioral" items assess the frequency of disordered eating behaviors, such as binge eating and self-induced vomiting, and are scored on a ratio scale. These items are typically excluded from calculations of subscales, although some authors have included them due to their centrality in the diagnosis of EDs, often by adapting them to a Likert (ordinal) scale (e.g., Hrabosky et al., 2008;Lev-Ari et al., 2021).
Critiques of these measures have included a bias towards the assessment of bulimia nervosa (Thomas et al., 2014), with a similar criticism that ED measures in general have often been developed with young-adult females in mind (Forbush et al., 2013;Mitchison & Mond, 2015). Concepts integral to EDs, such as weight and shape concerns, can be problematic to assess and are often difficult for respondents to understand, even when prompted (Thomas et al., 2014). The EDE-Q aligns with popular cognitive-behavioral models of eating pathology, presenting items and scoring that reflect the theory that a drive for thinness underpins much eating pathology.
However, such an assumption may not hold for those from non-Western cultures (e.g., Mitsui et al., 2017) or male samples, who typically report lower scores (Schaefer et al., 2018). For example, a study of 1150 adult men suggested that, whilst the concept of body image was relevant, a focus on the "thinness ideal" is restrictive and undervalues the role of muscularity concerns (Forrest et al., 2019). Such difficulties can affect the interpretation of scores as well as affecting the computation of scales assumed to reflect single constructs (e.g., Weight Concern or Shape Concern), with an additional risk that the strength of association between certain items and latent factors (e.g., factor loadings) varies across groups (e.g., Serier et al., 2018;cf. Machado et al., 2018).
As noted above, the attitudinal items of the EDE-Q can be used to compute four subscales although Eating Concern was not included as a distinct subscale in the original description of the EDE-Q (Fairburn & Beglin, 1994). A Global score can also be computed by summing the scores of the four subscales and dividing the resulting total by the number of scales (i.e., four) . Widely used in research, the EDE-Q is also recommended as an outcome measure within the United Kingdom National Health Service (National Collaborating Centre for Mental Health, 2019). However, a lack of support for the structural validity of the subscales of the EDE-Q can lead to inconsistency around what outcomes are reported, with many studies and clinical services reporting outcomes according to the originally proposed subscales (e.g., for population norms [Hilbert et al., 2012] and treatment studies [e.g., Fischer et al., 2014]). Further, the discriminant validity of the EDE-Q has been limited by variable item loadings and inconsistent identification with a latent factor (Forbush et al., 2013), and a significant proportion of individuals with anorexia nervosa report Global EDE scores in the "normative" range at pre-treatment (Thomas et al., 2014). Given the frequent reliance on self-report measures in evaluating outcome from treatment and assessing symptoms, clarification of the constructs being assessed, and accurate measurement thereof, is vital (Flake & Fried, 2021;Mokkink et al., 2018 andPrinsen et al., 2018).
Internal structure is directly related to scoring and interpretation (Messick, 1995) and the absence of structural validity might undermine support for the (construct) validity of a measure (Keith & Kranzler, 1999). Whilst the clinical utility of the EDE-Q has often been promoted as a strength, this is likely to be more reliable if a consistent factor structure of the measure can be established.
Investigations of the underlying factor structure (and hence the EDE-Q's structural validity) have produced inconsistent findings and there has been little systematic evaluation of data-driven models. Rand-Giovannetti et al. (2020) evaluated alternative models of the EDE-Q factor structure in a sample of 940 undergraduate students.
They concluded that a four-factor model (without a higher order factor representing the "Global" score and with some differences to the "Original" model) provided the best fit, although fit statistics were similar across several competing models (Sellbom & Tellegen, 2019).
To define models for their study, they identified 24 studies looking at the factor structure of the EDE-Q (generating almost as many unique latent structures). Aside from a four-factor model of attitudinal items (often labeled the "Original" model), alternative factor solutions have combined items from two factors ("Shape Concern" and "Weight Concern"; e.g., Peterson et al., 2007), provided different interpretations of the full scale (e.g., Becker et al., 2010;Friborg et al., 2013), or reduced the number of items by removing those which do not consistently load onto a factor (e.g., Gideon et al., 2016;Grilo et al., 2015;Hrabosky et al., 2008). In many studies, a novel interpretation of latent dimensions is presented, oftentimes departing only slightly from existing suggestions. Whilst sample differences, for example, might explain heterogeneity in findings regarding the factor structure of the EDE-Q, it is also possible that methodological differences, such as how factor analysis was performed, may account for discrepancies in proposed solutions (e.g., Vogt et al., 2017), an issue not considered in depth by Rand-Giovannetti et al. Methods to determine model fit, for example, can be susceptible to multivariate non-normality (Fabrigar et al., 1999) and processes for establishing how many factors to retain in exploratory factor analysis (EFA) are often debated (Preacher et al., 2013).
Although the summary provided by Rand-Giovannetti et al. (2020) is helpful, there have been more than 25 investigations of the EDE-Q's factor structure since the initial online publication of this paper, and so a systematic review of all existing literature is warranted. Further, it is unclear how systematically and comprehensively the literature was reviewed-given that this was not the primary aim of their study-and some previous studies that may be relevant (e.g., Machado et al., 2014) do not appear to have been included. A review of the EDE-Q's structural validity, which is the primary aim of the current study, would help focus efforts to refine use of the EDE-Q and to suggest where the weight of evidence lies regarding an optimal factor structure and recommendations for its use in both clinical and research settings. A further goal of the current study is to formulate recommendations for the conduct of future studies, as has been done in other areas (e.g., DiStefano & Hess, 2005;Jackson et al., 2009).
Whilst establishing the clearest factor structure of a measure is only part of an evaluation of its utility, this is necessary for the appropriate assessment of internal consistency  as well as for tests of measurement invariance, which afford (mean) comparison across different groups.

| Protocol and registration
Studies were eligible if they provided full-texts in English and included latent variable analysis (EFA or confirmatory factor analysis, CFA) of the EDE-Q. All versions of the EDE-Q (i.e., where all or some of the items from the measure are included) were considered for inclusion in the review, although youth and child versions were not included as these were developed based on "major changes" (Goldschmidt et al., 2007, p. 462) to the EDE-Q, which, alongside possible developmental differences (Forsén Mantilla et al., 2017), might affect psychometric properties. Searches were conducted from 1993 (just before the EDE-Q was first published) to February 23, 2022. The protocol was registered on PROSPERO and can be accessed at https://www. crd.york.ac.uk/prospero/display_record.php?RecordID=245357 and the review followed the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) guidelines (Page et al., 2021; Table S1).

| Search strategy
Three electronic databases were searched (Scopus, Medline, PsycInfo) using the following search terms in All Fields: "EDE-Q" AND "factor analysis" OR "exploratory factor analysis" OR "confirmatory factor analysis" OR "factor structure." A search of gray literature was also conducted by using the same search terms through ProQuest and posting a request for relevant literature on an international eating disorders listserv. Where Abstracts indicated that latent variable analysis (LVA) of the EDE-Q was conducted, full texts were subsequently reviewed. Following the database search, reference lists of identified studies were searched for additional studies to be included in the review. Abstracts were collated into an electronic document and duplicates were removed.

| Selection criteria
There were no restrictions on the population covered (e.g., gender and participant nationality). Studies were only included if they described an investigation of the structural validity of the EDE-Q (including some or all of its original items), either as a primary objective or as part of a wider investigation of its psychometric properties.
When the EDE-Q was translated into another language, this was included as long as the previous criterion was met. Similarly, studies using multi-group CFA (e.g., testing for measurement invariance) were included if an analysis of factor structure was reported, and only findings relating to structural validity are discussed in this review.
The current review collated studies of the structural validity of the EDE-Q. Assessments solely of unidimensionality (i.e., the structural validity of a single subscale) were not considered. Key information about EDE-Q studies using factor analysis was summarized and findings organized to inform recommendations for the most appropriate subscales to report. The review also aimed to summarize the factor solution(s) with the most consistent evidence.

| Data extraction
Around 80% of abstracts were assessed by both authors, suggesting good agreement (κ = .86). The authors independently selected full texts for inclusion in the review , noting the: type of analysis used (e.g., EFA and CFA); population sampled (including country, gender, age, race/ethnicity, and socioeconomic status); sample size; and language used (Tables 1, S2-S5). In the case of EFA, details of the software used, association matrix (e.g., correlation), estimation method, nature of rotation (e.g., varimax and promax), and criteria for factor selection (e.g., scree plot and Kaiser-Guttman criterion) were recorded (e.g., Henson & Roberts, 2006). Presence of a pattern matrix and reference to communalities were also noted. For CFA studies, the estimation procedure (e.g., maximum likelihood), software used, matrix analyzed, and whether more than one model was tested were recorded (e.g., Jackson et al., 2009). Whether studies made relevant statements about missing data, normality, and fit indices (including whether cutoffs were reported a priori) was also noted. The strategy was piloted on three papers to refine the extraction template and to ensure consistency across reviewers, following which independent reviews of the remaining 75 papers were performed.
Following independent selection of full-texts, the authors compared responses and identified any discrepancies or omissions (e.g., where only one author had recorded a methodological element of the study). Inter-reviewer agreement for inclusion of studies was good (κ = .85). The full-texts were re-read to ensure that the information was, in fact, presented and this was recorded on an electronic database of all studies. If crucial information was unclear, an attempt was made to contact the corresponding author of the study.

| Quality assessment and data synthesis
Assessment of the methodological quality of the studies included elements of COSMIN standards  and reporting of information based on guidance for EFA (Henson & Roberts, 2006) and CFA (Jackson et al., 2009). As many previous studies have assessed other measurement properties of the EDE-Q (e.g., construct reliability) and the current study looks in detail at structural validity, the full COSMIN risk of bias tool (and a potential 116 items) is not appropriate. For example, questions assessing the relevance of each questionnaire item or whether a comparator instrument was included were felt not to be pertinent and some COSMIN items covering methodological quality criteria differ from suggestions from EFA-or CFA-specific guidance (e.g., sample size and missing data). In addition, although COSMIN guidance provides one section concerning structural validity , one of three relevant questions affords a higher score (and thus lower risk of bias) to studies which have included CFA as opposed to EFA. Given the aims of this study to appraise both EFA and CFA studies, it was decided to adapt COSMIN items on sample size and internal consistency. More detail is provided in Table S2 but, briefly, studies were accorded a score of either 1 or 0 for 10 items (seven each for EFA and CFA, and three across all studies) assessing elements of factor analysis reporting. A total score was therefore taken as an indicator of study quality. Where one paper reported both EFA and CFA, two separate scores were computed. considered but decided against due to the wide inclusion criteria (e.g., age, geography, and methods) which would have introduced significant "clinical" and "methodological" heterogeneity and potentially obscure genuine differences across samples . Recognizing that reporting findings for different subgroups might be of interest, Table 1 presents study findings according to sample characteristics. Cohen's κ was computed for some key binary outcomes to estimate inter-rater reliability of the coding scheme; two were assessed for EFA (Use of parallel analysis, κ = 1.00; Total variance reported, κ = 1.00) and two for CFA (Discussion of missing data, κ = .81; Cutoff criteria reported a priori, κ = .83).

| Study selection
The results of the search and selection process are presented in Figure S1. After removing duplicates, 1410 papers were identified, of which 60 were included after screening full-texts. Three studies of note were excluded from the systematic review-all because the full texts were published in languages other than English, and it was therefore not possible to make a full assessment of their methods (Gu et al., 2017;Hilbert et al., 2007;Pennings & Wojciechowski, 2004).
One further study (Richter et al., 2018) was excluded for the same reason but seemed to offer a narrative review of measures rather than LVA of the EDE-Q. The study of Mohd Taib and Khaiyom (2020) was included, although it was stated in the paper that this was a pilot study preceding another using different participants (Mohd Taib et al., 2021). The majority of studies included exclusively (k = 16; 27%) or predominantly (k = 53; 88%) female participants. Aside from the study of Peterson et al. (2020), which recruited a sample of transgender youth, there were three studies which explicitly stated genders other than male or female, with .1% (Rand-Giovannetti et al., 2020), .25% (Jenkins & Davey, 2020), and 1.3% (Zickgraf et al., 2020) of the respective samples comprising other gender identities. Some were more evenly balanced across genders (e.g., Klimek et al., 2021), although predominantly non-female samples tended to be purposively sampled (e.g., Peterson et al., 2020;Scharmer et al., 2020). Only a minority of studies (around one-quarter) recruited from clinical settings, with nearly half of these including both clinical and non-clinical participants.

| Latent structures
Where tested, studies using CFA failed to find support for the "original" four-factor structure of the EDE-Q, with two exceptions (Franko et al., 2012;Villarroel et al., 2011), although several caveats should be noted. Using a Spanish translation of the EDE-Q in college women, Villarroel et al. (2011) reported "satisfactory" fit, noting that they four studies appeared to offer support for either a three-or fourfactor solution, depending on the criteria used to determine eligible factors, and one used a "forced" four-factor solution in EFA which resulted in different interpretation from the Original.
The Weight Concern and Shape Concern subscales have been found to be highly correlated, and several studies in the current review generated latent structures through EFA whereby items of these subscales were considered under a "Weight and Shape Concern" subscale (e.g., Carey et al., 2019;Darcy et al., 2013;White et al., 2014). There was mixed evidence for the presence of a "Global" index of eating pathology, with several studies (Friborg et al., 2013;Klimek et al., 2021;Rand-Giovannetti et al., 2020) generating conflicting findings regarding higher-order models, suggesting that interpretation of the Global score might remain cautious, particularly in non-female or ethnic minority groups (Goel, Burnette et al., 2022).
Similarly, whilst some studies found that a one-factor solution emerged from EFA (e.g., Peterson et al., 2020), others failed to find support using CFA (e.g., Calugi et al., 2017;Penelo et al., 2013). Investigations of the "full" (i.e., 22-item) measure using CFA (Table S4) offered some support for a three-factor model (Peterson et al., 2007) and a four-factor model departing from the "Original" (Goel, Burnette et al., 2022;Friborg et al., 2013), although further work in different samples is needed. Interestingly, these models show similarities, such as combining Weight and Shape Concern into one factor (Rand-Giovannetti et al., 2020).
Studies of briefer versions of the EDE-Q tended to report positive results in terms of model fit, often suggesting preference over longer alternatives (e.g., Machado et al., 2020). The version proposed by Grilo et al. (2010), originally for the EDE but since applied to the EDE-Q (e.g., Grilo et al., 2013), comprises seven items from the original EDE-Q and provides three subscales: Dietary Restraint (three items); Shape/Weight Overvaluation (two items); and Body Dissatisfaction (two items). This model has been supported across several studies and samples (Table S6) and seems particularly well-suited to assessing aspects of eating pathology in university student populations (Jenkins & Davey, 2020). A proposed alternative to this which has received some support is a one-factor solution, developed in a sample of adolescent female twins by Wade et al. (2008), comprising eight items. However, the items within this are very similar to the "Weight and Shape Concern" subscales suggested by Friborg et al. (2013) and Peterson et al. (2007), suggesting that Weight and Shape Concern is a reliable construct that can be assessed through several items of the EDE-Q.

| Study quality and certainty of evidence
As noted above, studies were generally of moderate-good quality (interquartile range for EFA = 4.25-8; for CFA = 6-8) and overall quality ratings suggested that many studies adequately reported a number of key elements of LVA. Those less frequently reported include the input matrix and communalities for EFA (46% and 15% of studies, respectively) and a relevant statement about normality and the matrix analyzed for CFA (57% and 26%).
Several studies included LVA as a secondary aim, often to establishing norms (e.g., Villarroel et al., 2011). However, there has been little replication of latent structures of the EDE-Q, with some studies failing to find support with CFA and subsequently generating a novel version of the EDE-Q using EFA.

| DISCUSSION
The current review included 60 studies comprising over 60,000 participants, confirming that the EDE-Q is a widely used self-report measure for the assessment of eating pathology. The structural validity of the EDE-Q has been investigated across a range of BMIs and ages, across five continents and 19 languages. Validation has included individuals with varying dietary choices (e.g.,  and gender identities (e.g., Peterson et al., 2020), and both adolescents and adults have been studied, often in mixed samples. However, despite this wealth of research, the four-factor solution commonly reported has not been consistently supported.

| Structural validity of the EDE-Q
The lack of support for the "Original" structure (Restraint, Eating Concern, Shape Concern, and Weight Concern) was perhaps unsurprising.
The assertions of previous authors regarding flaws in the EDE-Q seem, therefore, to be partially supported, although some of these "serious limitations" (Forbush et al., 2013, p. 861) may be driven by inconsistent interpretation of the factor structure of the "full" EDE-Q. For example, the "linear dependency" between the Shape Concern and Weight Concern subscales (Parker et al., 2016, p. 567) suggests that they are measuring the same variable (or that there is little to discern worries about shape or weight), possibilities highlighted by the developers of the EDE (e.g., Cooper et al., 1989). This conclusion is supported by several studies in the current review endorsing aggregation of relevant items under a "combined" subscale (e.g., Barnes et al., 2012;Rand-Giovannetti et al., 2020). There was limited evidence supporting the constructs of Restraint and Eating Concern (the latter of which was not included in the original conceptualization of the EDE-Q; Fairburn & Beglin, 1994), with some studies suggesting removal and/or reclassification of these items (e.g., Parker et al., 2015;Penelo et al., 2013;White et al., 2014).
For full-item models, strongest support appears to exist for those of Friborg et al. (2013) and Peterson et al. (2007) and, with briefer models, that of Grilo et al. (2010Grilo et al. ( , 2013 has been investigated across several samples (Table S6). Given that many studies have made modifications to latent structures, it is difficult to say whether differences in factor structure are consistent across subgroups such as gender or diagnostic status, although some items of the EDE-Q appear to lack measurement stability, particularly across groups (e.g., Compte et al., 2019;Rand-Giovannetti et al., 2020).
Turning to the identification of a "Global" score, a bifactor (or "nested") model, where a latent 'Global' factor reflecting common variance across all items is orthogonal (uncorrelated) to the EDE-Q subscales, performed well compared to models with correlated subscales (Friborg et al., 2013), suggesting that the EDE-Q Global score represents a useful measure of eating pathology and may thus be a valid indicator of treatment outcome (Tatham et al., 2015). However, given that few studies have explicitly addressed this issue, further work is required in light of other work challenging the computation of a "Global" score (Rand-Giovannetti et al., 2020), perhaps through greater correspondence with other clinical indicators (Goel, Burnette et al., 2022).
More consistent support was found for a brief, seven-item measure (the EDE-Q7; Grilo et al., 2013), which seems to circumvent some of the issues with the longer measure (although admittedly sacrifices a degree of thoroughness). Interestingly, the EDE-Q7 seems to demonstrate structural validity even when "behavioral" items

| Study quality
Looking at the quality of studies, sample sizes were often presented alongside justification and/or discussion and methodological details of EFA such as stating the estimation method and rotation and providing a pattern matrix. Of note, five EFA studies reported using principal component analysis (PCA), not EFA, which are conceptually (and mathematically) distinct procedures (Fabrigar et al., 1999) and several studies based factor extraction on the Kaiser-Guttman criterion (often referred to as the "Eigenvalues >1 Rule"), despite recommendations against this (Fabrigar et al., 1999;Henson & Roberts, 2006;Osborne, 2014). Thus, future research should continue to report important details of EFA procedures, use multiple criteria for factor extraction (e.g., Fabrigar et al., 1999;Henson & Roberts, 2006), and employ oblique rotation methods, given high inter-item (e.g., Hilbert et al., 2012) and inter-scale correlations.
Findings were similar in CFA studies, with issues such as internal consistency, normality, and discussion of missing data often mentioned. Reporting cutoffs for fit indices was common (but see Clark & Bowles, 2018), and, in general (e.g., Jackson et al., 2009), future studies should include both incremental and absolute measures of fit.
Reporting of some indices (e.g., goodness-of-fit index [GFI]) are recommended against (Jackson et al., 2009, p. 10) and should perhaps be phased out.
Several studies generated novel latent structures using EFA, with few having subsequently been subject to rigorous evaluation through CFA. Several studies have set out to compare the performance of different models through CFA rather than generating additional novel solutions in future samples, particularly given the exploratory, and at times volatile, nature of EFA (Osborne, 2014). Such studies (e.g., Calugi et al., 2017;Goel, Burnette et al., 2022;Machado et al., 2020;Rand-Giovannetti et al., 2020;Scharmer et al., 2020) are usually preferable to those evaluating the fit of only one model (Jackson et al., 2009) and, although more data are needed-particularly in under-represented groups-findings appear to suggest (statistical) superiority of briefer models, particularly that attributed to Grilo et al., 2015) (Table S6).
Whilst it should also be borne in mind that a "perfect" latent structure of the EDE-Q may not emerge, further (confirmatory) validation of 22-item (e.g., Friborg et al., 2013) and brief versions (e.g., Gideon et al., 2016; Grilo et al., 2013) seems warranted, as well as greater investigation into the optimal construction of a "Global" score.

| Recommendations for use of the EDE-Q in clinical and research settings
As has been previously argued (e.g., Friborg et al., 2013) (Osteen, 2010), and to assess the performance of a "Global" score in predicting treatment outcome, for example. Further comparison of alternative versions, particularly in clinical groups, seems warranted, in addition to critical evaluation of the predictive validity of subscales and investigation in more diverse samples.

| Recommendations for reporting of factor analyses
Whilst the overall quality of reporting was good, the matrix (for both EFA and CFA) was not usually specified, although it could be inferred in some cases (e.g., through reference to specific software). This finding is common in methodology reviews of factor analysis as many statistics programmes have this as a default, but should nonetheless be stated explicitly (Jackson et al., 2009). Researchers should continue to report key elements of factor analysis methods (mindful of the influence of "default" program settings) and also note matrices and estimation methods wherever possible. Consistent with the recommendations of methodologists, we encourage researchers to consider their aims and choose appropriate strategies for employing factor analysis to ensure that the results are both generalisable and interpretable (Osborne, 2014;Preacher et al., 2013). Finally, given the ordinal nature of "attitudinal" items on the EDE-Q, appropriate robust estimation methods should be used in CFA (e.g., Rhemtulla et al., 2012).

| Limitations
There were some limitations of this review which bear mention.
Behavioral items were typically excluded from factor analyses, and hence this review-partly as the EDE-Q suggests a ratio (rather than ordinal) scoring for these items. Future work might consider how these items can be integrated into a consistent scoring framework (e.g., Forbush et al., 2013;He et al., 2021). Detailed discussion of structural validity was limited to EFA and CFA, although some alternatives (e.g., Rasch analysis) were identified in the searches and are noted (e.g., Gideon et al., 2016;He et al., 2021). Three articles were found through reviewing reference lists which, although a minority of those included in the final review (5.0%), were not identified through database searching.
Although the latent structure of the EDE-Q has been investigated in many countries, none from the continent of Africa was identified.
Studies of EDs in (particularly Southern) Africa since the 1970s suggest that their presence is more complex than simple "Westernization" and requires greater cultural understanding (Szabo & Le Grange, 2001), indicating that replication attempts in African samples would be informative. Perhaps surprisingly, relatively few studies have included exclusively clinical samples, with some including these as part of a larger sample for LVA (e.g., Machado et al., 2014). As a result, the factor structure of the EDE-Q in clinical samples remains under-studied. Information on participants' socioeconomic status was reported in 25% (15/60) of included studies, usually according to different criteria (e.g., parents' highest education, household income).
Although most studies (k = 44, 73.3%) included information on race or ethnicity, sample characteristics were sometimes unclear and, despite the wealth of studies, there remains a need for future research on samples with greater diversity, particularly regarding gender and ethnicity, to enhance generalizability to historically under-represented groups (Goel, Jennings Mathis, et al., 2022). In line with reporting in treatment trials , data on race/ethnicity were often focused on "White," often including a binary distinction between "White" and "Other." Moving forward, studies should collect (and report) detailed data rather than broad categories , and provide data on all represented races/ethnicities, not just the majority group. Papers not in English were excluded from the review and one highly cited paper in particular (Hilbert et al., 2007) may have been helpful to include as it seems to have influenced several subsequent empirical studies. Lastly, translated versions of the EDE-Q were included and it is possible that this may have influenced the findings, for example, due to errors in translation (Hawkins et al., 2020).

DATA AVAILABILITY STATEMENT
The study protocol has been published. As a systematic literature review, data were extracted from existing papers and tabulated and synthesized (see Tables and Supplemental Material). The quality appraisal tool used is described in the manuscript and Supplemental Material.