This article describes an automated data monitoring system based on the example of the disease register RABBIT-SpA. Monitoring real world data of large observational cohort studies that include several thousand patients is a major challenge. Although it is highly recommended to describe “[...] quality assurance and quality control procedures” (10, 11) when publishing results. We have identified only one publication describing the handling of quality assurance in cohort studies (3). Data generated by cohort studies support clinical decision-making and guideline recommendations (2, 12–14). The relevance and acceptance of such studies has increased in the last years and the demands for data quality are coming into focus in a way that was not common before (15, 16).
Monitoring data manually can lead to further errors for example when editing the conflict list and manually incorporating corrections from the queries. Furthermore, the process of manually driven monitoring systems is very time consuming and personnel-intensive tasks (9).
An automated monitoring system must take into account the challenges and problems that can arise during digital data collection and data cleaning. The prerequisite for automated monitoring is a consistent data structure. This also requires that follow-up eCRFs are consistent in content, design and data structure.
The benefit of digital data collection, compared to paper-based documentation, is the possibility of giving immediate feedback on missing and implausible data, via plausibility checks directly in the eCRF. The analysis showed that 40–64% of eCRFs, submitted between baseline and T3, had conflicts. Therefore, incorrect or implausible data continued to be received, despite initial error checking on the eCRF. Only 1% of the baseline eCRFs were entire eCRF-conflicts. This is because baseline visits are mandatory for study inclusion. However, due to technical reasons 12 of the questionnaires had only one to three variables that were filled in, which resulted in the entire eCRF being queried as missing. Most of the entire eCRF-conflicts were queried at T1 (3 months after baseline). Since RABBIT-SpA is an observational study, patients are invited at the physician’s discretion and some study centers do not regularly perform a patient visit after three months, which explains the high number of missing entire eCRFs at this time point. Most of the single conflicts were queried at baseline and T3, which is probably related to the fact that these eCRFs contain more variables, are more complex, and the number of possible conflicts is higher than at the follow-up visits T1 and T2. Whereas 388 conflicts are possible at baseline, and 208 at T3, only 118 conflicts are the maximum to be queried at T1 and T2. Thus, the number of eCRFs, that need to be queried, depends on the number of possible conflicts and on the complexity of the eCRF.
The most common single conflicts were missing laboratory values. It is very plausible that in many cases the laboratory results were not yet available at the point of documentation. Therefore, it is of great importance to provide a query system, which allows the completion of these missing values at a later time.
One of the goals of the automated monitoring system was to make the answering of queries as easy as possible for the study centers. Therefore, the queries reflect parts of the eCRF so that they can be easily recognized and answered quickly. In this analysis, 67% of the queries were answered. Positive feedback from participating study centers suggests that a user-friendly monitoring system was successfully created that meets the predefined requirements.
The automated monitoring has proven beneficial on several levels. It is much less personnel-intensive, while one data manager is able to take responsibility for the complete data collection and monitoring, in the paper-based RABBIT study considerably more staff is required for queries (9). For the participating rheumatology practices, the workload is considerably reduced because the subsequent queries are eliminated and the incorrect data is recognized and reported directly in the system. With an answering rate of 69% of entire eCRF conflicts and 66% of single conflicts, the completeness of data will improve significantly.
In a comparison of cohort studies covering pregnancy in rheumatology patients, a significant variation in number of missing values has been described depending on the variable (17). The extent of missing data in our analysis is considerably lower, compared to other cohorts (17, 18). Nevertheless, querying the missing values will increase the data quality even further. Whereas the dropout rate is higher in other studies (9, 19), the dropout rate after one year of observation is only 1,8% in RABBIT-SpA, which is a remarkable low rate for observational studies. The successful implementation of our monitoring system might be one of the reasons for this low number.
Prior to active monitoring, there was a large amount of incomplete or missing data in our observational cohort study, which is probably typical of this type of study. However, even though almost half of the eCRFs had to be queried, this resulted in only a few (1.1 to 2.4) conflicts per eCRF, which kept the workload for the study centers at an acceptable level.