This article is the first to publish a cross-cultural comparison of the psychometric performance, mean scale scores, and item and scale-summary for the PANSS using qualified raters who rated one of 13 standardized training videos of a patient in the United States. Our aim was to perform a cross-cultural validity assessment by checking DIF due to cultural factor in a sample of qualified raters from 6 different geo-cultural groups. The results showed that there were significant differences in response to a number of items on the PANSS.
The Intra Class Correlations (ICCs) for the PANSS total score for the United States group was marginally higher than the ICC for the other geo-cultural groups. Although all reliability estimates were excellent (i.e., >= 0.80), all groups had the lowest ICCs for the Negative symptom subscale compared to the Positive symptom and General Psychopathology subscales suggesting increased variability among scores for the Negative symptom subscale.
Rasch analysis and differential item functioning (DIF)
Although the PANSS was originally designed with three subscales (Positive, Negative, and General Psychopathology), studies examining the internal structure of the scale [59–61] have all identified the same two underlying factors, Positive and Negative. Other factors have varied and included Disorganized, Excitement, Hostility, Dysphoric, Catatonic and many more [15, 62, 63]. Given that Rasch analysis depend on how symptom severity is defined, the appropriateness of modeling of items via their subscale scores, rather than a total PANSS score was confirmed by conducting PCA on each subscale to assess unidimensionality. Although the PCA of the General Psychopathology subscale did not assume unidimensionality, it is common practice in clinical trials to examine the Positive and Negative subscales independently from the rest of the scale since these symptoms are considered a key component of the disease  and are symptom clusters which are primarily targeted in drug development.
While variation was present in the order and location of some PANSS items for geo-cultural groups, the overall pattern of item calibration was generally congruent. Within each of the six groups, the Rasch model also confirmed the hierarchical structure of the PANSS items, as evidenced by the pattern of average item calibrations and goodness-of-fit indices. In each region, most item calibrations were well spaced along the continuum of psychopathology, suggesting that for the groups included in this study, the PANSS is able to measure a wide range of function in schizophrenia. Items which were found to be easy to score by all geo-cultural groups included P3 (Hallucinatory Behavior), P6 (Suspiciousness/Persecutory Behavior), G12 (Lack of Judgment and Insight), and G2 (Anxiety). Additionally, results indicated that Northern European raters were more likely to endorse higher scores on all Positive symptom items except N7 (Stereotyped Thinking) compared to other regions. It should be noted, the first three items generally load on the Positive symptom factor domain in factor analytic studies [60, 61]. The Positive factor is comprised of the most active and first rank symptoms that define schizophrenia and it is primarily with these symptoms that a diagnosis is made clinically of schizophrenia. Therefore, raters may find it easier to score items which are first rank or core features of schizophrenia as these symptoms are also present in the diagnostic criteria.
In addition to the items listed above, raters from Northern Europe, the US and India found item N5 (Difficulty in Abstract Thinking) to be easier items to score. It should be noted that this item is intended to be based on objective responses by the patient, and not rater’s subjective interpretation. It can be suggested that items with clear scoring instructions related to objective response (e.g., if the subject answers four out of four proverbs correct, a score of one should be given) are easier to score across most geo-cultural groups. Other items which were observed to be easier to score include N2. Emotional Withdrawal (Southern Europe, and Asia), G12. Lack of Spontaneity and Flow of Conversation (Russia & Ukraine), G6. Depression (Northern Europe), and G16. Active Social Avoidance (United States of America, India, and Asia). With the exception of G6 (Depression), the latter items generally load on a Negative factor domain [60, 61]. The Negative factor reflects the difficulties in social relatedness often exhibited in many schizophrenic patients and are considered second rank symptoms. Again, more prevalent symptoms of schizophrenia and are first and second rank symptoms are easier to score across most countries. It should be noted that, upon evaluation of mean scores, Southern European raters scored higher on most Negative symptom subscale items (N1 Blunted Affect, N2 Emotional Withdrawal, N5 Difficulty in Abstract Thinking, and N6 Lack of Spontaneity and Flow of Conversation), whereas the lowest scores for Negative symptom items were from raters from Asia for N1 Blunted Affect, N3 Poor Rapport, N5 Difficulty in Abstract Thinking, and N7 Stereotyped Thinking compared to other geo-cultural groups.
All geo-cultural groups had significant DIF for P4 (Excitement), N7 (Stereotyped Thinking), and G10 (Disorientation). Santor and colleagues  have demonstrated that many items (N7 and G10) have problematic features and some fundamental issues with relation to the level of psychopathology measured by the overall PANSS. Our own item response analysis , has also demonstrated significant DIF with item G10 (Disorientation) with regards to its contribution to the assessment of psychopathology as measured by the PANSS. Additionally, previous psychometric investigations have indicated that item G10 (Disorientation) either does not discriminate well in terms of assessing overall severity or does not reflect dimensional individual differences between patients within schizophrenia [15, 62]. Similarly, all groups showed significant slight to moderate DIF with the US raters for item G10 (Disorientation). This item measures the lack of awareness of the subject’s relationship to their surroundings and assesses specific questions relating to the subject’s knowledge of his/her doctor, address, and political figures. Therefore, this item may also lack a relationship to psychopathology, rather than cultural/geographical differences in scoring patterns.
The main source of rater differences among items were observed for the General Psychopathology subscale with five items showing different rating patterns across geo-cultural groups (i.e., G3 Guilt Feelings (Russia & Ukraine), G5 Mannerisms and Posturing (Asia), G16 Active Social Avoidance (Southern Europe and Russia & Ukraine), G14 Poor Impulse Control (all regions except Northern Europe), G15 Preoccupation (Russia & Ukraine)). Although support for the PANSS General Psychopathology subscale has been found in other studies , the current findings suggest the rating for items on the General Psychopathology subscale differ for European and Japanese raters and it should not be assumed that the same, standard rating tools were applied indiscriminately across these groups.
There are several possible explanations for discrepancies among raters both between and within the geo-cultural groups examined . One of the possible reasons may be interpretation variance. Interpretation variance implies that once raters agree on the common criteria, when there are differences, it is more frequently because of decision-making differences in the scoring of the item. Thus, when training a cohort of raters, it may be necessary to focus part of the training on cultural differences and expectations on different thresholds of symptoms. There were significant moderate to large DIF (i.e., ETS Class scores of C) for most items scored by Southern European raters compared to US raters. Different social views due to cultural influence might lead to different rating of social and emotional behaviors. Despite the fact that the Positive and Negative subscales met the other Rasch model requirements, the presence of DIF by region means that culture might contribute to the scores on these items. Therefore, when clinical trial investigators pool PANSS data from different countries, items showing DIF should be removed or split. An iterative “top-down purification” splitting approach for items showing uniform DIF has been applied elsewhere .
Yet another possible source of differing reliability may be cultural biases found in common-place standardized training methods and materials. Using standardized patients (patients who are trained to portray specific sets of symptoms), it has been demonstrated that video-recorded or tape-recorded interviews increases inter-rater reliability even among raters with limited exposure to the PANSS (e.g., ) and affective measures . As such, a culturally diverse group of raters was asked to evaluate cultural idioms, symptom expressions, and social dynamics during the interview which may have been unfamiliar. It can be argued that higher inter-rater reliability may therefore be associated with a higher degree of acculturation amongst raters as much as it is an indication of rater comprehension and agreement.
There are some limitations to this study. First, this analysis focuses on a small cohort of raters from only six geo-cultural groups who rated US training videos. This sample is not representative of all PANSS raters and patients within the geo-cultural groups (except the US); hence findings may not be generalizable across different regions (e.g., South America, Latin America, and areas of Africa). Also, a similar analytic technique using data obtained in clinical trials (utilizing patients from the specific geo-cultural region) is currently underway by this research team. Secondly, this study addresses reliability cross-sectionally, and not longitudinally. Additional studies would be needed to address differences in PANSS scores across time. Unfortunately in this analysis, we were unable to access data to adjust for possible confounding variables such as level of rater qualifications, amount of experience in the field of schizophrenia, or gender, and recognize that these may influence the differences in mean scale scores. This is, however, a common limitation of cross-cultural comparisons of any subjective or objective data where local socio-demographic conditions vary in their definition and measurement. Since this is the first cross-cultural study of the PANSS and not taking into account the presence of confounding sample characteristics other than geographically defined culture, these findings should to be taken in a preliminary and cautionary manner. More specifically, how rater training is translated in the geo-cultural regions could not be examined using the currently available data, and should be addressed in future cross-cultural studies. Additionally, culture may influence responses to the PANSS for many reasons. For example, some Hispanics have been noted to express emotional and mental health as physical health symptoms , which may be different from non-Hispanics. Also, Klienman and Good  reported individuals with depression may be less likely to report sadness or anxiety, but more likely to report sleep problems or appetite changes. The authors recognize that wide variations exist in educational level, occupational status, and cultural identity within communities of raters, therefore as much as geo-cultural matching was attempted, the authors recognize the limitations of the selection of countries per group. Finally, although Rasch analysis allows for the detection of DIF within the current sample size, future studies should attempt to replicate these results using greater and more balanced sample sizes across regions.