Established by Australia’s federal government in 2006, headspace provides physical and mental health services, drug and alcohol services, and vocational assistance to people aged 12–25 years with emerging and established mental health problems. Because headspace delivers a range of services informed by principles underpinning early intervention, service users may present with a range of psychosocial problems with varying degrees of severity, although mood and affective symptoms predominate . Clinical services are delivered by general practitioners, psychologists, psychiatrists, and other allied health professionals, and are largely subsidised through publicly funded health-care schemes. Most young people either self-refer or are referred by family, friends, health professionals, or school counsellors.
All young people who attended one of four headspace services in Melbourne or Sydney, Australia, between January 2011 and August 2012, spoke English, and were capable of providing informed consent, were approached to seek their participation in a longitudinal cohort study examining the course of psychiatric disorders in this population . Three of the centres are in outer-city suburbs characterised by socioeconomic disadvantage and limited private sector investment in mental health. The fourth centre is in a relatively affluent inner-city suburb. It should be noted, however, that since there are no defined geographical catchment areas for headspace centres, young people can attend a service irrespective of their place of residence. Prospective participants included those who were receiving clinical services at the time of the study as well as those who were waitlisted. Those who were significantly intellectually impaired (i.e., IQ < 65) and could not either provide informed consent or complete the assessment tasks were excluded from the study, while those who were acutely suicidal (as determined by their assessing or treating headspace clinician) were not approached to participate until their risk was reduced.
The human research ethics committees at the University of Melbourne and the University of Sydney approved the study. Following assessment by a headspace Access Team clinician or completion of their first treatment session, prospective participants were contacted by telephone or in person by research assistants (RA) with a minimum four-year graduate psychology degree to discuss the aims and nature of the study and to determine their interest in participating. Participants aged 15 years and older provided written informed consent, whereas those aged 12–14 years assented and written informed consent was provided by a parent or guardian. The RAs conducted semi-structured interviews with each participant using a range of clinical measures. Registered psychologists trained the RAs in the use of the measures, such that the RAs achieved very good inter-rater reliability on each measure (kappa ≥ 0.8) before recruitment commenced. Following the interview, the RAs provided participants with an iPad or laptop on which they completed several self-report measures. Participants received a $20 gift voucher to compensate the time associated with completing each assessment.
Demographic and socioeconomic information
Participants’ age, sex, country of birth (and their parents’ country of birth), languages spoken, relationship status, accommodation status, living arrangements, education and employment status, financial difficulties, and social welfare entitlements were ascertained using questions adapted from the national census  and other published sources [38, 39].
Cultural and linguistic diversity
Researchers and practitioners alike often describe people as being culturally and linguistically diverse (CALD) if their primary language, cultural norms, and values differ from those of the mainstream community in which they reside . The term is typically applied to those from non-English speaking backgrounds. Here, we have used the term to represent “people of colour”, a term which generally excludes White/Caucasian populations with largely European ancestry. As such, the CALD group in this study largely comprises people who were born (or whose parent[s] was born) in Africa, Asia, Pacific Islands, Latin America, the Caribbean, or the Middle East. This was to ensure that our CALD group was likely to be from a visibly non-White minority group in addition to possessing non-English speaking ancestry. Those who were born (or whose parents were born) in a primarily European-language-speaking country (including Australia), but reported primarily speaking a non-European language were also considered to be CALD. We considered people who identified (or whose parents identified) as Maori to be CALD. Although Maori people are Indigenous to New Zealand, the term Indigenous in the Australian context solely refers to those of Aboriginal and Torres Strait Islander ancestry. Aboriginal and Torres Strait Islanders are typically not included as CALD in the Australian context and as such were not included in the CALD group in this study. Because of the strong European heritage of both countries, those with Argentine or Uruguayan backgrounds were not considered to be CALD unless their primary language suggested otherwise (i.e., spoke a non-European language). Although Zulu is the language spoken most often in South Africa, English is the principal language of most South African migrants to Australia. Therefore, people with South African backgrounds whose self-reported primary language was either English or Afrikaans were not considered to be CALD. However, South African migrants who reported primarily speaking a non-European (and non-Afrikaans) language were considered to be CALD.
The Kessler 10 (K-10)  was used as a broad measure of psychological distress. The scale comprises 10 questions that enquire about the respondent’s negative emotional states experienced during the past four weeks. The degree to which each item is experienced is measured on a five-point scale. Item scores are summed and range between 10 and 50. Scores between 25 and 29 suggest the presence of a moderately severe mental disorder, while scores of 30 and above are indicative of more severe psychopathology.
The generalized anxiety disorder scale (GAD-7)  is a seven-item self-report measure of the most salient diagnostic features of generalised anxiety disorder. The frequency with which each of the symptoms is experienced in the previous two weeks is rated on a four-point scale. Research has suggested that the GAD-7 is a valid screening tool for GAD in primary care settings and for assessing its severity in clinical practice and research . Scale scores range from 0 to 21 with higher scores indicating more severe psychopathology. Scores of five, 10, and 15 indicate the presence of a “mild”, “moderate”, and “severe” anxiety disorder, respectively. The overall anxiety severity and impairment scale (OASIS)  consists of five items that assess the frequency and severity of anxiety, use of avoidance behaviours, and the extent to which anxiety interferes with the respondent’s social and occupational functioning. Items are scored on a five-point scale and represent the respondent’s self-reported experience over the past week.
The clinician-rated quick inventory of depressive symptomatology (QIDS-C16),  assesses the presence, during the previous seven days, of the major symptoms of depression as defined by the fourth edition of the diagnostic and statistical manual of mental disorders (DSM-IV). Its 16 items, reflecting depressed mood, sleep disturbance, appetite/weight disturbance, diminished interest, lowered energy/fatigue, poor concentration, self-criticism, and suicidal ideation are rated on a four-point scale and summed to provide a score ranging from 0 to 27. Scores above 16 are considered to indicate the presence of a severe depressive disorder.
Rumination was assessed using a 10-item questionnaire  derived from a longer, validated scale . Respondents indicated the extent to which they experienced each item on a four-point scale.
The young mania rating scale (YMRS)  measures the nature and severity of core manic symptoms experienced within the past 48 h. Ratings for the 11 items are based on the interviewee’s subjective report of his or her clinical condition. Additional information is based on the interviewer’s clinical observations. Seven items are scored on a four-point scale. The remaining four items are scored on an eight-point scale. These latter items are weighted more heavily to compensate for poor cooperation by those who are severely ill. Scores are summed and range from 0 to 60.
The risk of psychosis was assessed using the comprehensive assessment of the at-risk mental state (CAARMS) , a semi-structured interview designed for use by mental health professionals to assess the presence and severity of psychotic symptoms over the past 12 months. The schedule measures symptoms across several domains: positive symptoms, concentration and attention, emotional disturbance, negative symptoms, behavioural change, motor abnormalities, and general psychopathology. The Positive Symptom scale, used in this study, comprises four subscales: unusual thought content, non-bizarre ideas, perceptual abnormalities, and disorganised speech. These subscales were rated according to the intensity, frequency, and duration of the symptoms, their relationship to substance use, and associated level of distress. Pre-set thresholds on both the intensity and frequency of these symptoms were used to classify participants as “psychotic”, “at risk” for psychosis (based on their subthreshold psychotic symptoms), or “not at risk” for psychosis.
The behavioural inhibition/behavioural activation system (BIS/BAS)  was used as a broad measure of participants’ subjective personality style. It comprises 24 items that index the person’s behavioural inhibition, responsiveness to reward and punishment, drive, and fun-seeking. The items are rated on a four-point scale and summed.
Participants’ recalled whether they had ever received a diagnosis of a mental (i.e., neurocognitive, emotional, or behavioural) disorder during childhood.
The alcohol, smoking and substance involvement screening test (ASSIST)  is an eight-item self-report measure of tobacco, alcohol, and illicit drug use, and associated concerns. The responses to six of the items are summed and used to indicate the level of risk (i.e., “lower”, “moderate”, or “’high”) associated with each substance. Responses to the final item are not used in these calculations, but when endorsed, indicate the recency of injecting substance use (which may itself indicate elevated risk of substance-related harm).
Each participant was assigned a clinical stage based on criteria established by McGorry et al. (2006) . The clinical staging model comprises six discrete stages: stage 0 (asymptomatic people at risk of a disorder who have not yet presented for care); stage 1a (help-seekers with mild symptoms and functional impacts); stage 1b (people with attenuated syndromes, often with mixed or ambiguous symptomatology and moderate or severe functional impacts); stage 2 (people with discrete disorders [i.e., those presenting with clear psychotic, manic, or severe depressive episodes); stage 3 (people with a recurrent or persistent disorder); and stage 4 (people with a severe, persistent, and unremitting illness). Staging decisions are informed by clinical assessment of the person’s current symptomatology (severity, frequency, and type); characteristic mental features; age of onset and course of illness before presentation to health services; previous “worst ever” symptoms and treatment episodes (inc., hospital admissions); current level of risk of harm due to the person’s illness; previous suicide attempts or other risky behaviours; and current (compared to premorbid) levels of social and occupational functioning.
Rated on a seven-point scale, the single-item clinical global impressions scale (CGI)  indicates the severity of the person’s illness with higher scores indicating a more severe presentation. The item is completed by the interviewer and represents his or her clinical impression of the interviewee’s presentation and level of functioning, derived from all available information obtained during the assessment.
Occupational functioning and disability
The social and occupational functioning scale (SOFAS)  is an observer-rated scale that provides a global assessment of one’s social and occupational functioning. Scores range from 0 to 100. Higher scores indicate a superior level of functioning. For the purpose of this study, scores were calculated based on the participants’ lowest level of functioning in the past year. The assessment of the person’s level of functioning is independent of the severity of his or her symptomatology and includes impairments that are caused by either physical or mental disorders. To be considered, impairments must be caused by the illness per se rather than a reflection of a lack of opportunity or environmental limitations. The 12-item version of the World Health Organization Disability Assessment Schedule (WHODAS 2.0)  was used to examine participants’ difficulties in performing daily life activities during the past 30 days. Items are rated on a five-point scale and reflect six domains of functioning: cognition, mobility, self-care, social interactions, life activities (e.g., domestic responsibilities, leisure, work, and school), and participation in community activities. Total scores represent the simple sum of the 12 items and range from 0 to 48. Scores can also be scaled with a range from 0 to 100. Higher scores indicate a greater degree of disability.
Quality of life
Participants’ perceptions of their overall quality of life in the preceding four weeks were assessed using a single Likert item derived from the World Health Organization Quality of Life (WHOQOL) .
Family history of mental disorder
Participants were asked whether, to the best of their knowledge, any immediate family members (i.e., parents or siblings) had (1) ever experienced a serious psychological or emotional problem or (2) died by suicide . Those who had no knowledge of their biological family did not complete this measure.
Participants were asked whether they had ever been (1) charged with a criminal offence; (2) convicted of a criminal offence; or (3) a victim of crime. Affirmative responses were followed up to clarify the nature of the crime and the outcome of any charges or convictions. Violence—whether perpetrated or experienced—was defined as any intentional behaviour involving threatened or actual physical harm (e.g., sexual or non-sexual assault, threats to kill or inflict injury). All other criminal behaviour (including theft, drug use or possession, property damage) was considered non-violent.
The childhood trauma questionnaire (CTQ)  was administered to assess participants’ experience of abuse and neglect during childhood and adolescence. Its 28 items measure the experience of three forms of abuse (physical, sexual, and emotional) and two forms of neglect (physical and emotional); a three-item scale is used to detect false-negative reports. Respondents rate the extent to which they experienced each item on a five-point scale. Higher scores indicate more frequent maltreatment; defined thresholds are used to classify cases according to their severity (i.e., “none”, “low”, “moderate”, or “severe”).
A short version of the parental bonding instrument (PBI)  was used to assess participants’ recollection of their parents’ behaviour toward them. The scale comprises 18 items, which measure both maternal and paternal care, overprotection, and authoritarianism. Respondents rate the frequency with which they experienced each item during childhood on a four-point scale and the scores are summed.
A 20-item questionnaire  was used to assess the quality of participants’ interactions with their family, friends, and partner (where applicable). The items are scored on a four-point scale and summed to yield six indexes that measure either positive or negative qualities within these domains. The composite scores are scaled such that they theoretically range from 0 to 1 with higher scores reflecting greater endorsement of the underlying items.
Three items were adapted from the discrimination scale in the quality of life in newly diagnosed epilepsy instrument (NEWQOL) . Participants were asked whether or not, because of their mental health problems, others: (1) are uncomfortable with them; (2) treat them as inferior; or (3) prefer to avoid them.
All analyses were conducted using R version 4.0.2 . For all standard analyses, two-tailed tests were used with a significance level of α < 0.05. We corrected for multiple comparisons using a false discovery rate (FDR) of 5% ; q-values were calculated using the qvalue package .
Of the 1615 eligible help-seekers approached by the RAs to participate, 806 consented (of whom four subsequently withdrew), representing a participation rate of 49.9%. The majority of those records retained for analysis (662 [82.5%]) were incomplete. Nearly three-quarters (483 [73.0%]) of these records were missing ≤ 5.3% of their values (range, 0.9–84.7%). Only three (2.7%) variables were complete. The number of missing values across variables ranged from one (0.1%) to 319 (39.8%).
Missing indicators were created for each variable with missing data. Large correlations among variables with missing values suggested that the propensity for missingness in a given variable was related to the presence of missing values in other variables. Fitted binary logistic regression models with the indicator variables as outcomes demonstrated that missingness was associated with observed values, suggesting they were at least partly missing at random .
We estimated missing values using multiple imputation by chained equations (MICE) [64, 65], a technique that involves imputing each missing value with multiple values estimated from the posterior predictive distribution of the missing data conditional on the observed data. Variables were imputed using flexible additive regression models as implemented in the aregImpute function from the Hmisc package . aregImpute accounts for all aspects of uncertainty in the imputations by using the bootstrap to approximate the process of drawing predicted values from a full Bayesian predictive distribution. We generated eighty complete datasets. Trace plots demonstrated that the results were stable across the iterations for each imputation, suggesting convergence had been achieved. Density plots showed the distributions of the imputed and complete data were similar.
Numeric variables were expressed as an arithmetic mean and standard deviation. Categorical variables were expressed as counts and percentages. The distributions of characteristics of CALD and non-CALD help-seekers were compared using independent-sample t-tests and Pearson chi-square tests of independence. Mean differences (MD) and odds ratios (OR), and their associated 95% confidence intervals, were reported as the primary measures of effect.
Correlates of CALD status
Sixty-six variables of interest were classified to one of nine domains: demographic, financial security, mental disorder, substance use, functional impairment, forensic history, child maltreatment, social support, and stigma. Given the large number of variables, we performed a series of principal components analyses (PCA) using the psych package  to reduce the dimensionality of the dataset and eliminate redundancy among the covariates. We used the Kaiser–Meyer–Olkin (KMO) measure of sampling adequacy  to determine the proportion of variance in each set of variables that might be explained by underlying factors and Bartlett’s test of sphericity  to assess the suitability of the data for PCA. Principal components with eigenvalues greater than 1 were retained. Oblique (promax) rotation was performed on the variables to improve interpretability (see supplementary information). Component scores were computed for each participant and retained for subsequent analyses.
We constructed several logistic regression models using the lrm function in the rms package  to identify those factors that were associated with CALD status. Participants were categorised into CALD and non-CALD status as delineated earlier. We performed univariate analyses to examine the association of CALD status and each candidate variable. Then, we examined the association of CALD status and each of the eight variable domains by developing a series of restricted multivariate models comprising only those variables classified to each domain. We assessed model fit using a likelihood ratio test comparing each model to the null model and assessed model performance using the Nagelkerke R2 . Calibration was determined by computing the mean squared deviation of each predicted probability from the true observed value of the outcome (i.e., the Brier score) . The Brier score can be considered a weighted loss function in which increasing distance between predicted and observed values is penalised by a quadratic measure. It is a proper score function that ranges from 0 to 1. Although its numerical value has no direct meaning, lower scores indicate better performance. Internal validation was performed using bootstrap resampling (5000 replicates). Finally, we quantified the importance of each domain by comparing the fit of the restricted models to those of a fully adjusted multivariate model, in which all variables were entered simultaneously, using a likelihood ratio test.
Following multiple imputation (MI), each complete dataset is typically analysed separately using standard methods. Then, the M parameter estimates and their associated variances are pooled using Rubin’s rules to provide a single parameter estimate that incorporates both between- and within-imputation variability, thereby enabling correct inference . However, problems arise when using MI in the context of PCA. Because of the variability in the imputed values, there is no guarantee that the eigenvector corresponding to a given eigenvalue is comparable across datasets. Consequently, pooling the eigenvectors (principal axes, factor loadings) using the order or the obtained eigenvalues of the covariance matrix estimated from each imputed dataset is likely to lead to misleading or meaningless results. Similarly, determining a common set of principal components across imputed datasets can be problematic, with the variability in the imputed values leading to different decisions being made for different datasets . Considering these difficulties, we performed our analyses on a single dataset selected at random. To assess whether our results were sensitive to the variability in the imputed values, we conducted the same analyses in 10 additional datasets selected at random.