A rasch model to test the cross-cultural validity in the positive and negative syndrome scale (PANSS) across six geo-cultural groups

Background The objective of this study was to examine the cross-cultural differences of the PANSS across six geo-cultural regions. The specific aims are (1) to examine measurement properties of the PANSS; and (2) to examine how each of the 30 items function across geo-cultural regions. Methods Data was obtained for 1,169 raters from 6 different regions: Eastern Asia (n = 202), India (n = 185), Northern Europe (n = 126), Russia & Ukraine (n = 197), Southern Europe (n = 162), United States (n = 297). A principle components analysis assessed unidimensionality of the subscales. Rasch rating scale analysis examined cross-cultural differences among each item of the PANSS. Results Lower item values reflects items in which raters often showed less variation in the scores; higher item values reflects items with more variation in the scores. Positive Subscale: Most regions found item P5 (Excitement) to be the most difficult item to score. Items varied in severity from −0.93 [item P6. Suspiciousness/persecution (USA) to 0.69 item P4. Excitement (Eastern Asia)]. Item P3 (Hallucinatory Behavior) was the easiest item to score for all geographical regions. Negative Subscale: The most difficult item to score for all regions is N7 (Stereotyped Thinking) with India showing the most difficulty Δ = 0.69, and Northern Europe and the United States showing the least difficulty Δ = 0.21, each. The second most difficult item for raters to score was N1 (Blunted Affect) for most countries including Southern Europe (Δ = 0.30), Eastern Asia (Δ = 0.28), Russia & Ukraine (Δ = 0.22) and India (Δ = 0.10). General Psychopathology: The most difficult item for raters to score for all regions is G4 (Tension) with difficulty levels ranging from Δ = 1.38 (India) to Δ = 0.72. Conclusions There were significant differences in response to a number of items on the PANSS, possibly caused by a lack of equivalence between the original and translated versions, cultural differences among interpretation of items or scoring parameters. Knowing which items are problematic for various cultures can help guide PANSS training and make training specialized for specific geographical regions.


Background
Psychopathology encompasses different types of conditions, causes and consequences, including cultural, physical, psychological, interpersonal and temporal dimensions. Diagnosing and measuring the severity of psychopathology in evidence-based medicine usually implies a judgment by a clinician (or, rater) of the experience of the individual, and is generally based on the rater's subjective perceptions [1]. Structured or semi-structured interview guides have aided in increasing rater consistency by standardizing the framework in which diagnostic severity is measured. In clinical trials, good inter-rater reliability is central to reducing error variance and achieving adequate statistical power for a studyor at least preserving the estimated sample size outlined in the original protocol. Inter-rater reliability typically is established in these studies through rater training programs to ensure competent use of selected measures.
The Standards for Educational and Psychological Testing (American Educational Research Association, AERA [2]) indicate that test equivalence include assessing construct, functional, translational, cultural and metric categories. Although, many assessments used in psychopathology have examined construct, functional, translational and metric categories of rating scales, except for a handful of studies [3,4], the significance of clinical rater differences across cultures in schizophrenia rating scales has rarely been investigated. There is ample research demonstrating the penchant for clinical misdiagnosis and broad interpretation of symptoms between races, ethnicities, and cultures, usually Caucasian American or European vis-à-vis an "other." For example, van Os and Kapur [5], and Myers [6] point to a variation in cross-cultural psychopathology ratings. The presence of these findings suggests that the results of psychiatric rating scales may not adequately assess cultural disparities not only in symptom expression but also in rater judgment of those symptoms and their severity. Several primary methods have been championed in the past decade as means to aid in the implementation of evaluation methods in the face of cultural diversity [7][8][9]. These approaches, still in their infancy, have yielded positive results in the areas of diagnosis, treatment, and care of patients, but they still require reevaluation and additional adjustment [10][11][12]. As clinical trials become increasingly global, it is imperative to understand the limitations of current tools and to adapt, or to augment methods where, and when necessary.
One of the most widely used measures of psychopathology of schizophrenia in clinical research is the Positive and Negative Syndrome Scale (PANSS) [13][14][15]. Since its development, the PANSS has become a benchmark when screening and assessing change, in both clinical and research patients. The strengths of the PANSS include its structured interview, robust factor dimensions, reliability [13,16,17], availability of detailed anchor points, and validity. However, a number of psychometric issues have been raised concerning assessment of schizophrenia across languages and culture [18]. Given the widespread use of the PANSS in schizophrenia and related disorders as well as the increasing globalization of clinical trials, understanding of the psychometric properties of the scale across cultures is of considerable interest.
Most international prevalence data for mental health is difficult to compare because of diverse diagnostic criteria, differences in perceptions of symptoms, clinical terminology, and the rating scales used. For example, in cross-cultural studies with social variables, such as behavior, it is often assumed that differences in scores can be compared at face value. In non-psychotic psychiatric illnesses, cultural background has been shown to have substantial influence on the interpretation of behavior as either normal or pathological [19]. This suggests that studies using behavioral rating scales for any disorder should not be undertaken in the absence of prior knowledge about cross-cultural differences when interpreting the behaviors of interest.
There are a number of methodological issues when evaluating cross-cultural differences using results obtained from rating scales [20][21][22][23]. Rasch models have been used to examine and account for, cross-cultural bias [24]. Riordan and Vandenberg [25] (p. 644) discussed two focal issues in measurement equivalence across cultures, (1) whether rating scales elicit the same frame of reference in culturally diverse groups, and (2) whether raters calibrate the anchor points (or scoring options) in the same manner. Having non-equivalence in rating scales among cultures can be a serious threat to the validity of quantitative cross-cultural comparison studies as it is difficult to tell whether the differences observed are reflecting reality. To guide decision-making on the most appropriate differences within a sample, studies advocate more comprehensive analyses using psychometric methods such as Rasch analysis [24][25][26]. To date, few studies have used Rasch analysis to assess the psychometric properties of the PANSS [27][28][29][30]. Rasch analysis can provide evidence of anomalies with respect to two or more cultural groups in which an item can show differential item functioning (DIF). DIF can be used to establish whether a particular group show different scoring patterns within a rating scale [31][32][33]. DIF has been used to examine differences in rating scale scores with respect to translation, country, gender, ethnicity, age, and education level [34,35].
The goal of this study was to examine the crosscultural validity of the PANSS across six geo-cultural groups (Eastern Asia, India, Northern Europe, Russia & Ukraine, Southern Europe, and the United States of America) for data obtained from United States training videos (translated and subtitled for other languages). The study examines (1) measurement properties of the PANSS, namely dimensionality and score structure across cultures, (2) the validity of the PANSS across geo-cultural groups when assessing a patient from the United States, and (3) ways to enhance rater training based on cross-cultural differences in the PANSS.

Measures
The PANSS [13] is a 30-item scale used to evaluate the presence, absence and severity of Positive, Negative and General Psychopathology symptoms of schizophrenia. Each subscale contains individual items. The 30 items are arranged as seven positive symptom subscale items (P1 -P7), seven negative symptom subscale items (N1 -N7), and 16 general psychopathology symptom items (G1 -G16). All 30 items are rated on a 7-point scale (1 = absent; 7 = extreme). The PANSS was developed with a comprehensive anchor system to standardize administration, and improve the reliability of ratings. The potential range of scores on the Positive and Negative scales are 7 -49, a score of 7 indicating no symptoms. The potential range of scores on the General Psychopathology Scale is 16 -112. The PANSS was scored by a clinician trained in psychiatric interview techniques, with experience working with the schizophrenia population (e.g., psychiatrists, mental healthcare professionals). A semi structured interview for the PANSS, the SCI-PANSS [36], was used as a guide during the interview.
Currently there are over 40 official language versions of the PANSS. This translation work has been carried out according to international guidelines, in co-operation between specific sponsors, together with translation agencies in the geo-cultural groups concerned. Translation standards for the PANSS followed internationally recognized guidelines with the objective to achieve semantic equivalence as outlined by Multi Health Systems (MHS Translation Policy, available at http://www.mhs.com/ info.aspx?gr=mhs&prod=service&id=Translations). Semantic equivalence is concerned with the transfer of meaning across language.

Rater training
For the data used in this study, each PANSS rater was required to obtain rater certification through ProPhase LLC, Rater Training Group, New York City, New York, and to achieve interrater reliability with an intraclass correlation coefficient = 0.80 with the "Expert consensus PANSS" scores (or Gold Score rating), in addition to other specified item and scale level criteria. Gold Score is described below. Only a Master's level psychologist with one year experience working with schizophrenic patients and/or using clinical rating instruments, or a PhD level Psychologist, or Psychiatrist is eligible for PANSS rater certification. Rater training on the PANSS required the following steps: 1. First, a comprehensive, interactive, didactic tutorial was administered prior to the investigator meeting for the specified clinical trial. The tutorial was available at the Investigator's Meeting, online, or on DVD or cassette for others. The tutorial included a comprehensive description of the PANSS and its associated items, after which the rater was required to view a video of a PANSS interview and rate each item. 2. Second, the rater was provided with feedback indicating the Gold Score rating of each item along with a justification for that score. The Gold Score rating was established by a group of four to five Psychiatrists or PhD level Psychologists who have administered the PANSS for ≥5 years. These individuals rated each interview independently.
Scores for each of the interviews were combined and reviewed collectively in order to determine the Gold Score rating. 3. Once the rater completed the above steps with the qualifying scoring criteria, the rater was provisionally certified to complete the PANSS evaluations.  Table 1 consists of sample characteristics and the distribution of countries per geo-cultural group. Data for African raters were not included in the analysis (i.e., 0.85% of total sample, n = 10; N = 1,179) due to inadequate sample size needed for comparison. One can note that the percentages of data that was removed for raters (from Africa (0.85%)) and for missing PANSS items (0.0%) are all reasonably small. These percentages point to the strong unlikelihood that analyses of these data would not be compromised by excluding these raters. It is not surprising to observe relatively no missing responses for the PANSS as scores on the instrument are incremental for training and raters are required to score each item for rater training and certification prior to the initiation of the study.

Data
The study protocol was approved by Western Institutional Review Board, Olympia, WA for secondary analysis of existing data. Research involving human subjects (including human material or human data) that is reported in the manuscript was performed with the approval of an ethics committee (Western Institutional Review Board (WIRB) registered with OHRP/FDA; registration number is IRB00000533, parent organization number is IORG0000432.) in compliance with the Helsinki Declaration.

Rasch analysis sample considerations
There are no established guidelines on the sample size required for Rasch and DIF analyses. The minimum number of respondents will depend on the type of method used, the distribution of the item response in the groups, and whether there are equal numbers in each group. Previous suggestions for minimum sample size for DIF analyses have usually been in the range of 100-200 per group [37,38] to ensure adequate performance (>80% power). For the present study, an item shows DIF if there is not an equal probability of scoring consistently on a particular PANSS item [39] (p. 264).

Selection of Geo-Cultural Groups
For this study, we assembled our data according to culture, with special attention to the presence and impact of clinical trials, and to the geographic residence of the raters. The resultant groups were defined prior to considering the amount of available data for each geocultural group.. An attempt was made to include raters who were likely to share more culturally within each group. The geo-cultural groups aim to gather the raters of a town, region, country, or continent on the basis of the realities and challenges of their society. Using geography in part to inform our cultural demarcations are not unproblematic or without limitations. Culture is necessarily social and is not strictly rooted in geography or lineage. However, the categories we elected for this study take into account geography as this was the criterion by which data were organized during rater training.
A few of our groups may appear unconventional at first glance. We separated India from other parts of Asia [38]. Table 1 presents the composition of the geocultural groupings. The groups are discursive and artificial constructs intended solely for the purpose of this study. No study of culture can involve all places and facets of life simultaneously and thus will reflect only generalities and approximations. For this reason, we were forced to overlook the multiple cultural subjectivities and hybridity [40], acculturation and appropriation [41], and fluidity that exist within and between the groups we constructed. The authors chose to keep the United States of America (US) as its own category since the scale is a cultural product of the US and was initially validated in this region.
As with any statistical analysis, if the categories were assembled differently (i.e., including or excluding certain groups, following a different organizing rationale) the analyses may have yielded slightly different results. However, the authors felt that there were enough similarities within the groupings: symptom expression and perception [42][43][44], clinical interview conduct [45], educational pedagogy and experience [46,47], intellectual approach [48], ideas about individuality versus group identity [49], etc. to warrant our arrangement of data. An attempt also was made to group countries with related histories, educational and training programs and ethnicities under the assumption that the within-grouping differences are likely to be less than the between-grouping differences. Prevalence of English language fluency and exposure was not considered in our categorization. While local language training materials were made available in all cases (i.e., transcripts of patient videos) some training events included additional resources (i.e., translated didactic slides, on-site translators). The range of Englishlanguage comprehension varied greatly among raters as well between and within many of the categories. The variance caused by language itself or as a complex hybrid with cultural understanding and clinician experience with a measure or in clinical trials deserves more attention [50]. Therefore, it is recommended that a separate analysis of the effects of language on inter-rater reliability be conducted.

Statistical methods
The Rasch measurement model assumes that the probability of a rater scoring an item is a function of the difference between the subject's level of psychopathology and the level of psychopathology symptoms expressed by the item. Analyses conducted included assessment of the response format, overall model fit, individual item fit, differential item functioning (DIF), and dimensionality.
Inter-rater reliability: The internal consistency of the PANSS was tested through Cronbach α reliability coefficients whereas inter-rater reliability [51] was tested based on intra class correlation coefficient (ICC). The inter-rater reliability of the PANSS across all regions was assessed. We classified ICC above 0.75 as excellent agreement and below 0.4 as poor agreement [52].
Unidimensionality: DIF analyses assume that the underlying distribution of θ (the latent variable, i.e., psychopathology) is unidimensional [53], with all items measuring a single concept; for this reason, the PANSS subscales (Positive symptoms, Negative symptoms, and General Psychopathology) were used, as opposed to a total score. Dimensionality was examined by first conducting principal components analysis (PCA) assess unidimensionality as follows: (1) a PCA was conducted on the seven Positive Symptom items, (2) the eigenvalues for the first and second component produced by the PCA were compared, (3) if the first eigenvalue is about three times larger than the second one, dimensionality was assumed. Similar eigenvalue comparison was conducted for the seven items of the Negative Symptoms subscale and the 16 items of the General Psychopathology subscale [54] for methods of assessing unidimensionality using PCA). Suitability of the data for factor analysis was tested by Bartlett's Test of Sphericity [55] which should be significant, and the Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy, which should be >0.6 [56].
Rasch Analysis: For each PANSS item a separate model was estimated using the response to that item as the dependent variable. The overall subscale score for the Positive symptoms, Negative symptoms, and General Psychopathology scale, and each cultural grouping, was the independent variables.
Two sets of Rasch analyses were conducted for each of the 30 items from the PANSS scale.

Rasch analyses by geo-cultural grouping
To assess the measurement invariance of item calibrations across countries in the present study, the Rasch rating scale model was used [57]. The primary approach to addressing measurement invariance involves the study of group similarities and differences in patterns of responses to the items of the rating scale. Such analysis is concerned with the relative severity of individual test items for groups with dissimilar cultural or backgrounds. It seeks to identify items for which equally qualified raters from different cultural groups have different probabilities of endorsing a score of a particular item on the PANSS. To be used in different cultures, items must function the same way regardless of cultural differences. The Rasch model proposes that the responses to a set of items can be explained by a rater's ability to assess symptoms and by the characteristics of the items. The Rasch rating scale model is based on the assumption that all PANSS subscale items have a shared structure for the response choices. The model provides estimates of the item locations that define the order of the items along the overall level of psychopathology.
Rasch analysis makes a calibration of items based on likelihood of endorsement (symptom severity). Inspection of item location is presented as average item calibrations (Δ Difficulty), goodness of fit (weighted mean square) and standard error (SE). The Rasch analysis was performed using jMetrik [58], where Δ Difficulty indicates that the lower the number (i.e., negative Δ), the less difficulty the rater has with that item. Taking into account the set order of the item calibrations based on ranking the Δ from smallest to largest, the adequacy of each item can be further evaluated by examining the pattern of easy and difficult items to rate based on culture (see Tables 2, 3 and 4b). When there is a good fit to the model (i.e., weighted mean square (WMS)), responses from individuals should correspond well with those predicted by the model. If the fit of most of the items is satisfactory, then the performance of the instrument is accurate. WMS fit statistics show the size of the randomness, i.e., the amount of distortion of the measurement system. Values less than 1.0 indicate observations are too predictable (redundancy, data overfit the model). Values greater than 1.0 indicate unpredictability (unmodeled noise, data underfit the model). Therefore a mean square of 1.5 indicates that there is 50% more randomness (i.e., noise) in the data than modeled. High mean-squares (WMS >2.0) were evaluated before low ones, because the average mean-square is usually forced to be near 1.0. Since, mean-square fit statistics average about 1.0, if an item was accepted with large mean-squares (low discrimination, WMS >2.0), then counter-balancing items with low mean-squares (high discrimination, WMS < 0.50) were also accepted.

DIF analyses by geo-cultural grouping
Based on the results of Rasch analyses different approaches can be taken to account for weaknesses in the scoring properties of the PANSS post-hoc. The Mantel-Haenszel statistic is commonly used in studies of DIF, because it makes meaningful comparisons of item performance for different geographical groups, by comparing raters of similar cultural backgrounds, instead of comparing overall group performance on an item. In a typical differential item functioning (DIF) analysis, a significance test is conducted for each item. As the scale consists of multiple items, such multiple testing may increase the possibility of making a Type I error at least once. Type I error rate can be affected by several factors, including multiple testing. For DIF of the 30 item PANSS the expectation is that 2 item response strings have a probability of p ≤.05 according with the Rasch model. α is the Type I error for a single test (incorrectly rejecting a true null hypothesis). So, when the data fit the model, the probability of a correct finding for one item is (1-α), and for n items, (1-α) n . Consequently the Type I error for n independent items is 1-(1-&alpha) n . Thus, the level for each single test is α/n. So that for a finding of p ≤ .05 to be found for 30 items, then at least one item would need to be reported with p ≤ .0017 on a single item test for the hypothesis that "the entire set of items fits the Rasch model" to be rejected.
As the PANSS was developed in the US and the rater training was conducted by a training facility in the US, the authors chose to compare each geo-cultural group to the US. Additionally, raters in similar geo-cultural groups were compared (e.g., Northern European raters vs. Southern European raters, Eastern Asian raters (will here forth be referred to as Asia or Asian) vs. Indian raters, Northern European raters vs. Russia & Ukraine raters). The Mantel-Haenszel procedure is performed in jMetrik and produces effect size computation and Educational Testing Services (ETS) DIF classifications as follows: A. = Negligible DIF B. = Slight to Moderate DIF C. = Moderate to Large DIF Operational items categorized as C are carefully reviewed to determine whether there is a plausible reason why any aspect of that item may be unfairly related to group membership, and may or may not be retained on the test.
Additionally, each category A, B or C is scored as eitheror + where, -: Favors reference group (indicating the item is easier to score for this group, than the comparison group) + : Favors focal group (indicating the item is easier to score for this group, than the comparison group)  Table 2).

Reliability
Reliability for subscale measures also show excellent reliability across all three subscales for each of the six geo-cultural groups.

Assessment of unidimensionality
Principal Components Analysis (PCA) without rotation revealed one component with an eigenvalue greater than one for the Positive Symptoms subscale, one component with an eigenvalue greater than one for the Negative Symptoms subscale and four components with an eigenvalue greater than one for the General Psychopathology subscale. Bartlett's Test of Sphericity was significant (p < .001) for all three subscales and the Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy produced values of 0.790, 0.877, and 0.821 for the Positive,

Rasch analysis
Most items showed high mean squares (WMS > 2.0 or low discrimination). Poor fit does not mean that the Rasch measures (parameter estimates) aren't additive (appropriate). The Rasch model forces its estimates to be additive. So a WMS > 2.0 suggests a deviation from unidimensionality in the data, not in the measures. Therefore, values greater than 2.0 (see Table 3) indicate unpredictability (unmodeled noise, model underfit). Items with high WMS were examined first (to assess which items may have been influenced by outliers), and temporarily removed from the analysis, before investigating the items with low WMS, until WMS values were closer to 1.0.

Positive symptoms
Average item calibrations and goodness of fit values for each PANSS Positive subscale item for the 6 geocultural groups are presented in Table 3. Lower item calibration reflects items easy to endorse, in which raters often showed less difficulty scoring; higher item calibration reflects items more difficultly scoring.

General psychopathology symptoms
Average item calibrations and goodness of fit values for each PANSS General Psychopathology subscale item for the 6 geo-cultural groups are presented in Table 3. Lower item calibration reflects items easy to endorse, in which raters often showed less difficulty scoring; higher item calibration reflects items more difficultly scoring. of America (Δ = 0.69) were difficult to score. Other geocultural groups also showed difficulties rating this item (see Table 2). Also, raters from Asia had difficulties scoring item G5 (Mannerisms and Posturing), Δ = 0.81). Raters from Southern Europe, India, Asia, Russia & Ukraine and the United States of America also had difficulties scoring item G14 (Poor Impulse Control) with item difficulty ranging from Δ = 0.90 (Southern Europe) to Δ = 0.40 (Asia). Russia & Ukraine also had significant difficulty with item G15 (Preoccupation). Overall, the goodness-of-fit of the PANSS Positive items was satisfactory across all geo-cultural groups. Moderate to large DIF (Class C) is also seen for P5 Grandiosity favoring the US (see Table 4).  Table 4).

Differential item functioning analysis
India: Significant DIF was found for items P3. Hallucinatory Behavior, P6. Suspiciousness/Persecution, and P7. Hostility for Indian raters compared to USA raters. Of the significant items, P6. Suspiciousness/Persecution shows slight to moderate DIF (Class B) favoring the United States (reference group), whilst P7. Hostility shows slight to moderate DIF (Class B) favoring India. Negligible DIF (Class A) was observed for P3. Hallucinatory Behavior (see Table 5).
Asia: Significant DIF was found for P7. Hostility for Asian raters compared to USA raters. P7. Hostility was showed Slight to Moderate DIF (Class B) favoring Asian raters (see Table 5).

Negative symptoms
Northern Europe: Significant DIF was found for items N1. Blunted Affect and N7. Stereotyped Thinking for Northern European raters compared to US raters. Of the significant items N1. Blunted Affect showed slight to moderate DIF (Class B) favoring USA and N7. Stereotyped Thinking showed slight to moderate DIF (Class B) favoring Northern Europe (see Table 4).  Table 4).
Russia & Ukraine: Significant DIF was found for N6. Lack of Spontaneity and Flow of Conversation; Negligible DIF (Class A) was observed for scores obtained for raters from Russia & Ukraine compared to US raters (see Table 4).
India: Significant DIF was found for items N3. Poor Rapport, N4. Passive Apathetic Social Withdrawal, and N6. Lack of Spontaneity and Flow of Conversation for Indian raters compared to USA raters. Of the significant items, only N3. Poor Rapport showed slight to moderate DIF (Class B) Indian Raters. Negligible DIF (Class A) was observed for N4. Passive Apathetic Social Withdrawal, and N6. Lack of Spontaneity and Flow of Conversation (see Table 5).
Asia: No significant DIF was found for Asian raters compared to US raters (see Table 5).
India: Table 6 shows, significant slight to moderate DIF (Class B) DIF was observed for G3. Guilt Feelings, and G14. Poor Impulse Control; G3. Guilt Feelings favored the US raters and G14. Poor Impulse Control favored the Indian raters. Significant moderate to severe DIF (Class C) were found for G2. Anxiety, G6. Depression, G10. Orientation, and G12. Lack of Judgment and Insight, with G2. Anxiety and G6. Depression favoring US raters and G10. Orientation, G12. Lack of Judgment and Insight favoring Indian raters.

Discussion
This article is the first to publish a cross-cultural comparison of the psychometric performance, mean scale scores, and item and scale-summary for the PANSS using qualified raters who rated one of 13 standardized training videos of a patient in the United States. Our aim was to perform a cross-cultural validity assessment by checking DIF due to cultural factor in a sample of qualified raters from 6 different geo-cultural groups. The results showed that there were significant differences in response to a number of items on the PANSS.
The Intra Class Correlations (ICCs) for the PANSS total score for the United States group was marginally higher than the ICC for the other geo-cultural groups. Although all reliability estimates were excellent (i.e., >= 0.80), all groups had the lowest ICCs for the Negative symptom subscale compared to the Positive symptom and General Psychopathology subscales suggesting increased variability among scores for the Negative symptom subscale.

Rasch analysis and differential item functioning (DIF)
Although the PANSS was originally designed with three subscales (Positive, Negative, and General Psychopathology), studies examining the internal structure of the   scale [59][60][61] have all identified the same two underlying factors, Positive and Negative. Other factors have varied and included Disorganized, Excitement, Hostility, Dysphoric, Catatonic and many more [15,62,63]. Given that Rasch analysis depend on how symptom severity is defined, the appropriateness of modeling of items via their subscale scores, rather than a total PANSS score was confirmed by conducting PCA on each subscale to assess unidimensionality. Although the PCA of the General Psychopathology subscale did not assume unidimensionality, it is common practice in clinical trials to examine the Positive and Negative subscales independently from the rest of the scale since these symptoms are considered a key component of the disease [15] and are symptom clusters which are primarily targeted in drug development.
While variation was present in the order and location of some PANSS items for geo-cultural groups, the overall pattern of item calibration was generally congruent. Within each of the six groups, the Rasch model also confirmed the hierarchical structure of the PANSS items, as evidenced by the pattern of average item calibrations and goodness-of-fit indices. In each region, most item calibrations were well spaced along the continuum of psychopathology, suggesting that for the groups included in this study, the PANSS is able to measure a wide range of function in schizophrenia. Items which were found to be easy to score by all geo-cultural groups included P3 (Hallucinatory Behavior), P6 (Suspiciousness/Persecutory Behavior), G12 (Lack of Judgment and Insight), and G2 (Anxiety). Additionally, results indicated that Northern European raters were more likely to endorse higher scores on all Positive symptom items except N7 (Stereotyped Thinking) compared to other regions. It should be noted, the first three items generally load on the Positive symptom factor domain in factor analytic studies [60,61]. The Positive factor is comprised of the most active and first rank symptoms that define schizophrenia and it is primarily with these symptoms that a diagnosis is made clinically of schizophrenia. Therefore, raters may find it easier to score items which are first rank or core features of schizophrenia as these symptoms are also present in the diagnostic criteria.
In addition to the items listed above, raters from Northern Europe, the US and India found item N5 (Difficulty in Abstract Thinking) to be easier items to score. It should be noted that this item is intended to be based on objective responses by the patient, and not rater's subjective interpretation. It can be suggested that items with clear scoring instructions related to objective response (e.g., if the subject answers four out of four proverbs correct, a score of one should be given) are easier to score across most geo-cultural groups. Other items which were observed to be easier to score include N2. Emotional Withdrawal (Southern Europe, and Asia), G12. Lack of Spontaneity and Flow of Conversation (Russia & Ukraine), G6. Depression (Northern Europe), and G16. Active Social Avoidance (United States of America, India, and Asia). With the exception of G6 (Depression), the latter items generally load on a Negative factor domain [60,61]. The Negative factor reflects the difficulties in social relatedness often exhibited in many schizophrenic patients and are considered second rank symptoms. Again, more prevalent symptoms of schizophrenia and are first and second rank symptoms are easier to score across most countries. It should be noted that, upon evaluation of mean scores, Southern European raters scored higher on most Negative symptom subscale items (N1 Blunted Affect, N2 Emotional Withdrawal, N5 Difficulty in Abstract Thinking, and N6 Lack of Spontaneity and Flow of Conversation), whereas the lowest scores for Negative symptom items were from raters from Asia for N1 Blunted Affect, N3 Poor Rapport, N5 Difficulty in Abstract Thinking, and N7 Stereotyped Thinking compared to other geo-cultural groups.
All geo-cultural groups had significant DIF for P4 (Excitement), N7 (Stereotyped Thinking), and G10 (Disorientation). Santor and colleagues [28] have demonstrated that many items (N7 and G10) have problematic features and some fundamental issues with relation to the level of psychopathology measured by the overall PANSS. Our own item response analysis [30], has also demonstrated significant DIF with item G10 (Disorientation) with regards to its contribution to the assessment of psychopathology as measured by the PANSS. Additionally, previous psychometric investigations have indicated that item G10 (Disorientation) either does not discriminate well in terms of assessing overall severity or does not reflect dimensional individual differences between patients within schizophrenia [15,62]. Similarly, all groups showed significant slight to moderate DIF with the US raters for item G10 (Disorientation). This item measures the lack of awareness of the subject's relationship to their surroundings and assesses specific questions relating to the subject's knowledge of his/her doctor, address, and political figures. Therefore, this item may also lack a relationship to psychopathology, rather than cultural/geographical differences in scoring patterns.
The main source of rater differences among items were observed for the General Psychopathology subscale with five items showing different rating patterns across geocultural groups (i.e., G3 Guilt Feelings (Russia & Ukraine), G5 Mannerisms and Posturing (Asia), G16 Active Social Avoidance (Southern Europe and Russia & Ukraine), G14 Poor Impulse Control (all regions except Northern Europe), G15 Preoccupation (Russia & Ukraine)). Although support for the PANSS General Psychopathology subscale has been found in other studies [13], the current findings suggest the rating for items on the General Psychopathology subscale differ for European and Japanese raters and it should not be assumed that the same, standard rating tools were applied indiscriminately across these groups.
There are several possible explanations for discrepancies among raters both between and within the geocultural groups examined [63]. One of the possible reasons may be interpretation variance. Interpretation variance implies that once raters agree on the common criteria, when there are differences, it is more frequently because of decision-making differences in the scoring of the item. Thus, when training a cohort of raters, it may be necessary to focus part of the training on cultural differences and expectations on different thresholds of symptoms. There were significant moderate to large DIF (i.e., ETS Class scores of C) for most items scored by Southern European raters compared to US raters. Different social views due to cultural influence might lead to different rating of social and emotional behaviors. Despite the fact that the Positive and Negative subscales met the other Rasch model requirements, the presence of DIF by region means that culture might contribute to the scores on these items. Therefore, when clinical trial investigators pool PANSS data from different countries, items showing DIF should be removed or split. An iterative "top-down purification" splitting approach for items showing uniform DIF has been applied elsewhere [26].
Yet another possible source of differing reliability may be cultural biases found in common-place standardized training methods and materials. Using standardized patients (patients who are trained to portray specific sets of symptoms), it has been demonstrated that videorecorded or tape-recorded interviews increases interrater reliability even among raters with limited exposure to the PANSS (e.g., [64]) and affective measures [65]. As such, a culturally diverse group of raters was asked to evaluate cultural idioms, symptom expressions, and social dynamics during the interview which may have been unfamiliar. It can be argued that higher inter-rater reliability may therefore be associated with a higher degree of acculturation amongst raters as much as it is an indication of rater comprehension and agreement.

Limitations
There are some limitations to this study. First, this analysis focuses on a small cohort of raters from only six geo-cultural groups who rated US training videos. This sample is not representative of all PANSS raters and patients within the geo-cultural groups (except the US); hence findings may not be generalizable across different regions (e.g., South America, Latin America, and areas of Africa). Also, a similar analytic technique using data obtained in clinical trials (utilizing patients from the specific geo-cultural region) is currently underway by this research team. Secondly, this study addresses reliability cross-sectionally, and not longitudinally. Additional studies would be needed to address differences in PANSS scores across time. Unfortunately in this analysis, we were unable to access data to adjust for possible confounding variables such as level of rater qualifications, amount of experience in the field of schizophrenia, or gender, and recognize that these may influence the differences in mean scale scores. This is, however, a common limitation of cross-cultural comparisons of any subjective or objective data where local socio-demographic conditions vary in their definition and measurement. Since this is the first cross-cultural study of the PANSS and not taking into account the presence of confounding sample characteristics other than geographically defined culture, these findings should to be taken in a preliminary and cautionary manner. More specifically, how rater training is translated in the geo-cultural regions could not be examined using the currently available data, and should be addressed in future cross-cultural studies. Additionally, culture may influence responses to the PANSS for many reasons. For example, some Hispanics have been noted to express emotional and mental health as physical health symptoms [66], which may be different from non-Hispanics. Also, Klienman and Good [67] reported individuals with depression may be less likely to report sadness or anxiety, but more likely to report sleep problems or appetite changes. The authors recognize that wide variations exist in educational level, occupational status, and cultural identity within communities of raters, therefore as much as geo-cultural matching was attempted, the authors recognize the limitations of the selection of countries per group. Finally, although Rasch analysis allows for the detection of DIF within the current sample size, future studies should attempt to replicate these results using greater and more balanced sample sizes across regions.

Conclusions
This is the first Rasch analysis of the PANSS in a global setting across cultures. One strength of the Rasch analysis is that problematic items are clearly flagged and specific modifications can be identified to improve rater training and data surveillance of the PANSS. The results showed support for the two subscales (i.e., Positive and Negative symptoms) with recommendations to further assess administration and scoring of specific items; however, the General Psychopathology subscale was shown to be a multidimensional subscale warranting further review. Attention to cultural bias in training curricula and delivery may help to reduce these elements as confounders in future inquiries. The results of the current study further emphasize the need for rigorous individualized training and rater surveillance of scores on the PANSS across different groups, to decrease sources of unreliability. Clearly further research is warranted to confirm these findings and establish good sensitivity and specificity across cultures.

Competing interests
Financial competing interests In the past five years none of the authors have received reimbursements, fees, funding, or salary from an organization that may in any way gain or lose financially from the publication of this manuscript, either now or in the future. • All authors indicate that they do not hold any stocks or shares in an organization that may in any way gain or lose financially from the publication of this manuscript, either now or in the future. • All authors indicate that they do not hold or are currently applying for any patents relating to the content of the manuscript. No author has received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript.
• MO has received funding from National Institute of Mental Health. AK has received funding from Janssen Pharmaceuticals, LLP, and the National Institute of Mental Health. All remaining authors have no competing funding interests. Non-financial competing interests All authors confirm that they have no non-financial competing interests (political, personal, religious, ideological, academic, intellectual, commercial or any other) to declare in relation to this manuscript.
Authors' contributions AK and SL participated in the development of the concept for the study. AK and CY participated in the design of the study, performed the statistical analysis and drafted the manuscript. CY and MO assisted with the statistical analysis and helped draft the manuscript. CY, GDC, LY and AK along with MO conceived of the study, and participated in its design and coordination and helped to draft the manuscript. All remaining authors participated in the design, coordination and drafted the manuscript. All authors read and approved the final manuscript.
Authors' information AK is a Statistician at ProPhase LLC and Manhattan Psychiatric Center, New York, NY; she has 10 years' experience working in psychopharmacology research as a statistician and has peer reviewed publications in clinical trials in schizophrenia, including collaborations on book chapters. AKs research interests are in Item Response Theory, Testing and Measurement and Bayesian applications in clinical research. AK obtained a degree in Psychometrics from Fordham University, NY. CY studied at Manchester Metropolitan University, UK, and is the current Clinical Director at CROnos CCS. CYs research interests are testing and measurement, data monitoring, surveillance and mental health studies. CY has publications in statistics, data monitoring and testing and measurement. MO  TI is a medical doctor and the Vice President of Seiwa Hospital, Institute of Neuropsychiatry in Tokyo, Japan. His work includes adenosine A2A receptor associated methamphetamine dependence in Japanese, cross cultural comparisons, and pharmacological treatments for Japanese patients with schizophrenia. LY is an epidemiologist at Columbia University, Mailman School of Public Health, New York. LYs work focuses on several key areas of psychiatric epidemiology including identification of subtypes of schizophrenia, cultural implications of schizophrenia, developing interventions for Asian Americans with psychosis.