The study attempted to clarify if physical function tests measured in patients undertaking HD are reproducible when changing the testing day (before the HD session vs. non-dialysis day). The sample size reached the recommended number of 3031.
Although high ICC coefficients were obtained, ICC is a ratio index of within and between subjects’ variability, therefore agreement between groups of subjects does not provide information about the individual change or error in scores. Additionally, ICC is dependent of the sample variability, and thus ICC should not be employed isolated32. The Bland-Altman plots were useful in exposing the relationship between the trials, so that there was a tendency to have better scores when the physical function test was performed before the HD session (except for the HG tests).
The present study shows a high degree of agreement between measurements on different days (HD day before the session vs. non-HD days) and good or excellent ICC results (above 0.86) only for some tests (STS-10, STS-60, TUG and HG tests) demonstrating lack of systematic bias when the measurement day changed. Thus, our results support the use of these tests when there is a change in the timing for assessment.
The scores from our participants were the similar to those reported by previous research of our group, with a slight difference only for the handgrip tests (STS-10: 25.2–25.6 s vs 25.1–24.0 s; STS-60: 22–22.5 repetitions vs 25.6–25.5 repetitions; TUG: 8.9–9.1 s vs 9.0–8.6; HG right: 22–23 kg vs 26.9–25.9 kg; Handgrip left: 20.5–20.5 kg vs 23.8–23.4 kg )19,23. Our sample was around 5 years older than the previous samples studied. Compared to other studies, with HD patients around 62 and 57 years old, results are also similar, for the STS-60, with 26–28 repetitions23, and 20.5–19.8 repetitions33, this last article differing from the rest, probably due to the small sample of only 10 patients. For the TUG, it is reported 8.9–8.1 seconds33.
Our results suggest that without arm support HG test is also reliable and has even lower values of MDC, what would made it easier to find true changes out of the variability of the measurement.
The present ICC results concur with those from our previous studies, in similar samples (39 participants for the STS-10, STS-60, HG)17 or in larger samples (71 participants for the TUG )19 (STS10: 0.861 vs 0.88; STS60: 0.925 vs 0.97; TUG: 0.945 vs 0.96; HG right 0.945 vs 0.96; HG left 0.925 vs 0.95). They are also in agreement with values reported by other studies for STS-60 (0.927)23, Our ICC values are better compared to the values of a small study with 10 patients (STS-60: 0.84; TUG 0.71)33. However, to the best of our knowledge this is the first work to check agreement and reproducibility when the timing of the test administration (before the HD session vs. non-HD day) is changed.
The SEM and MDC90 found in the current study, compared to previous studies are similar for the SEM (STS10: 3.6 vs 3.6s; STS60: 2.3 vs 1.7 repetitions; TUG: 0.9 vs 1.24 s; HG right 2.3 vs 1.5kg; HG left 2.9 vs 1.5 kg) and for the MDC90 values (STS10: 8.5 vs 8.4s; STS60: 5.4 vs 4 repetitions; TUG: 2.1 vs 2.9s; HG right 5.5 vs 3.4 kg; HG left 6.8 vs 3.4 kg). In general, apart from the TUG, measurement variation is higher when measures are taken in different days, before HD and non-dialysis days, so these data support the recommendation of avoiding changing the testing day to decrease absolute reliability values. These data are in agreement with previous data, (STS-60: SEM values 1.323–2.4333 repetitions; MDC95, 4 repetitions23, MDC90 5.47 repetitions33).
Our results show that there was no systematic bias for the STS-10, STS-60, TUG, or HG tests and so, these tests can be measured on different days. Nevertheless, this study shows a systematic bias for the SPPB, gait speed, and 6MWT when the timing (before the HD session vs. non-dialysis day) changes. Systematic bias have been explained by the learning effect once the participant repeats the test and improves results during the re-test, albeit to a non-significant degree34. A previous intra-rater study also showed a non-learning effect19. Our results do not show this learning effect, since gait speed and 6MWT performance was better before the HD session on trial 1 compared to the retest session on non-HD days (Table 2). Some authors suggest that the testing before the HD session may have reduced the effects of fatigue from the previous HD session33. Additionally, it is well-known the high variability of functional results in this cohort17,20, so it seems very important to keep the same testing circumstances when testing this cohort.
Hence, the use of Bland-Altman method evidenced that 6MWT, gait speed, OLST and SPPB showed substantial bias and large disproportion of the LOA. This case, large ICC values but lack of agreement with Bland-Altman method, was also found when establishing reliability of some motor tests32. Gait speed, and 6MWT achieved higher results when testing before the HD session, while balance achieved higher results on non-HD days. Fatigue, as a result of administering all the tests in a row on a non-HD day could explain why some tests obtained poorer results on non-HD days, which should not affect balance. Previous research has tested a battery of three test on non-HD days 33. Clinical feasibility does not allow us to test patients on several non-HD days because these participants already spend many hours in a clinical setting for their treatments and so it would be difficult to convince them to spend extra time in for physical function testing alone. Finally, our results may help to clarify which tests could be measured before the HD session by the same rater, because there is no consensus on this regard and clinical applicability should be considered to extend testing into routine treatment.
The main strength of this study was that, to the best of our knowledge, this was the first time that the reproducibility of physical function tests in patients undergoing HD has been tested with different test administration timings. Assessment at the nephrology units could be difficult to implement because of a lack of human resources and logistics in many clinical settings. Thus, it is important to be flexible regarding the test timing in this cohort, but it is also important to note that these changes impact the reproducibility of several commonly used physical function tests. The main weaknesses of this work were that the sample size was relatively small. Another limitation is that we did not make two measurements with each timing. Since there was only one-week difference between measurements, we believe we may assume that there were no systematic biases between measurements within subjects and that the within-subject SDs were similar for all measurements.
Our results have important implications in the implementation of physical function testing in HD units and indicate that the same assessors should test patients. Future work should be multicentric and include higher sample sizes to confirm it and should also aim to clarify the ideal battery for clinical assessments in this population by assessing other tests, such as lower-muscle strength tests.