CHEST Guidelines-Are-Statistical-Tests-Really-Needed-to-Compare-Tra

Are-Statistical-Tests-Really-Needed-to-Compare-Tra

Pdf Summary

The article argues against the necessity of conducting statistical tests to compare training and validation datasets during the development and evaluation of predictive models. It highlights two scenarios for validation set creation: as a randomly selected subset from the original dataset or as a completely external and separately curated set. The authors contend that statistical tests in these contexts do not meaningfully contribute to understanding model performance.<br /><br />In the case of randomly selecting a validation set from the original data, statistical tests are redundant since the subsets should inherently share the same population parameters. Thus, any statistically significant result would likely be a false positive. The focus here should be on assessing model performance through specific metrics rather than relying on discrepancy tests that could stem from random sampling variance.<br /><br />For completely external validation datasets, the paper suggests that statistical differences don't impact model performance evaluation. Instead, attention should be on performance measures, such as discrimination and calibration, which determine the model's efficacy. The authors emphasize that a model's worth is based on its performance on external validation data rather than any statistical differences from the training data.<br /><br />Overall, the paper advises that predictive modeling research should not focus on statistical discrepancies but rather on model performance. Descriptive data characteristics should be reported for clarity, but findings of statistical significance are not pivotal for model evaluation. Accordingly, improving model performance should be prioritized over the significance of statistical tests between datasets.<br /><br />Financial or non-financial disclosures were not declared by the authors.

Keywords

predictive models

statistical tests

validation datasets

model performance

random sampling

external validation

discrimination

calibration

false positive

descriptive data