Study design
The study consisted of two visits. During the first visit (cross-sectional validity study design), participants were randomly allocated to either the digital RFFT or the paper-and-pencil RFFT using block randomization (block size of four) stratified for gender, age group (< 40 years, 40–59 years, or ≥ 60 years), and highest level of completed education (low, middle, or high, based on the International Standard Classification of Education [14] (Additional file 1: Appendix 1)). We used a random number generator for the randomization. After the first test (digital RFFT or paper-and-pencil RFFT), the other test was performed (cross-over). Participants were invited to return for a second visit 1 week after the first visit, in which only the digital RFFT was repeated (test–retest reliability study design). For this study, ethical approval was obtained by the medical ethical committee of the University Medical Centre Groningen (trial number METc 2019/389, date of approval 23/07/2019). This research was carried out in accordance with relevant guidelines and regulations.
Study population
Participants were recruited during a 6-week period in July and August 2019 through posters and flyers, convenience sampling and online advertising. Individuals interested in participation could make an appointment by using an online registration website (or by telephone) for a first and second visit (after 1 week) at the research site. Afterwards, participants received a voucher of 10 euros as an incentive to participate. Participants were deemed eligible if they (1) were 18 years or older, (2) provided written informed consent, (3) understand the Dutch language, and (4) did not have impairments in writing with the dominant hand, hearing, or vision.
Data collection
Digital RFFT (first and second visit)
Participants performed the digital RFFT independently within an application using an Apple iPad Pro (2018, 12.9 inch, 64 GB), an Apple Pencil (2nd generation), and headphone. The software for the application was developed by Bruna & Bruna (www.brunabruna.nl). The digital RFFT started with a video instruction about the assignment. In line with the Standard Operating Procedure of the RFFT [15], participants also received feedback on the performance of the practice sheets through correction videos on the iPad. If instructions were not clear enough yet, participants were also able to watch example videos before and during the tests showing both simple and more complex examples for each point configuration.
Paper-and-pencil RFFT (first visit only)
During the paper-and-pencil RFFT, a trained examiner provided test instructions according to the Standardized Operating Procedure of the RFFT [15]. First, participants received a practice sheet with three boxes on which they could draw unique designs by connecting two or more dots. The trained examiners corrected the participant if needed. Then, the participants performed this task on a sheet of 35 boxes with identical configurations of points, in which they should draw as many unique designs as possible within 60 s. The participants performed these tasks on a total of five different practice and test sheets which consisted of different point configurations (Fig. 1).
The paper-and-pencil RFFT was performed on an 8.5 × 11″ sheet of paper with a red marker. All five RFFT sheets have a different point configuration.
Scoring of the RFFT sheets
For the digital RFFT, each individual box was automatically identified as a unique design, perseverative error, erroneous design, or empty box through an algorithm. Criteria for identifying unique designs, perseverative errors, erroneous designs and empty boxes are shown in Additional file 1: Appendix 2. Subsequently, the number of unique designs and perseverative errors were automatically computed and stored in a database.
For the digital and paper-and-pencil RFFT at the first visit, two independent and trained human raters identified each individual box as a unique design, perseverative error, erroneous design, or empty box. Furthermore, they scored the number of unique designs and perseverative errors. Additional scoring was performed when the two raters’ number of unique designs or perseverative errors differed on more than two points in one sheet or more than four points on the total score of the five sheets [13]. Subsequently, agreement by the two raters was obtained through a consensus meeting. If the two raters’ number of unique designs or perseverative errors differed two points or less in one sheet or four points or less on the total score for the five sheets, the scores of the two raters were averaged. The fact that scoring of the digital RFFT at the first visit was also performed by human raters allowed us to compare the automatic and manual scoring of the digital RFFT, and thereby, to evaluate the scoring performance of the algorithm against a common reference standard.
Questionnaire
Participants filled out a questionnaire on the socio-demographic characteristics age, gender, and highest level of completed education. Highest level of completed education was categorized into low, middle, and high based on the International Standard Classification of Education [14] (Additional file 1: Appendix 1). Additionally, highest level of education was also dichotomized into ≤ 12 years of education and > 12 years of education [16]. Furthermore, for practicability purposes, the trained examiner reported potential problems of the digital RFFT as well as how often the participants watched the videos with examples.
Statistical methods
Descriptive statistics were provided for the entire study population, and separately for the two randomized groups (i.e. group that started with the digital RFFT and the group that started with the paper-and-pencil RFFT). Differences in demographic characteristics and the number of unique designs and perseverative errors of the RFFT (digital, paper-and-pencil) between the randomized groups were assessed using two-sample t-test (normally distributed continuous variables), Mann–Whitney U test (non-normally distributed continuous variables), and a Chi-Square test (categorical variables).
We examined the criterion validity of the digital RFFT from two perspectives. First, we examined the congruence between the scores provided by the digital RFFT and those from human raters (gold standard). Specifically, the number of unique designs and perseverative errors were compared between the automatic and manual scorings. For this purpose, we computed the intraclass correlation coefficient (ICC; absolute, two-way mixed), a Lin’s Concordance Correlation Coefficient (LCCC) for replacement testing, and a Bland–Altman plot. Moreover, to further examine whether the automatic scorings of digital RFFT can correctly identify individual boxes as unique designs and perseverative errors, we calculated the sensitivity and specificity using the manual scorings as the reference standard. Second, we examined the congruence between the scores provided by the digital RFFT and those from paper-and-pencil RFFT. For this purpose, we computed the ICC (absolute, two-way mixed) and Bland–Altman plots.
Secondly, we investigated convergent validity, which refers to the congruence between the digital RFFT and theoretically related constructs [17]. Specifically, we examined the correlation between the number of unique designs and perseverative errors of the digital RFFT during the first visit (automatic scoring) with age and education level. For this purpose, we used Pearson’s correlation coefficient for the normally distributed variables and Spearman’s correlation coefficient for non-normally distributed variables.
Thirdly, we investigated test–retest reliability of the digital RFFT, which refers to the congruence between test scores on different occasions, assuming that the participant’s ability remains the same [17]. For this purpose, we compared the number of unique designs and perseverative errors between the first and second visit based on the automatic scoring. Here, we provided an ICC (absolute, two-way mixed) and a Bland–Altman plot.
For criterion validity, convergent validity, and test–retest reliability, we considered ICC values below 0.50, between 0.50 and 0.74, between 0.75 and 0.90, and above 0.90 as poor, moderate, good, and excellent, respectively [18].