Machine learning is undoubtedly an essential technology for harnessing the ever-growing data pools. This applies to both scientific and commercial applications.
In a recent paper[1], Sarash Kapoor and Arvid Narayana draw attention to a problem called “leakage” in ML-based studies. Leakage, be it data leakage or leakage in features, leads to an overestimation of model performance.
In their article, Kapoor and Narayana provide a list of scientific papers where leakage was detected. A comparable overview is currently not available for commercial ML applications.
The paper by Kapoor and Narayana is a recommended read for all those interested in ML. It helps the addressees of ML evaluations to ask the right questions to model builders and data scientists.
[1] Kapoor S., Naravanan A.: Leakage and the Reproducability Crisis in ML-based Science. Draft Paper. Center for Information Technology Policy at Princeton University. 2022.