Leakage in Machine Learning Studies

Machine learning is undoubtedly an essential technology for harnessing the ever-growing data pools. This applies to both scientific and commercial applications. 

In a recent paper[1], Sarash Kapoor and Arvid Narayana draw attention to a problem called “leakage” in ML-based studies. Leakage, be it data leakage or leakage in features, leads to an overestimation of model performance. 

In their article, Kapoor and Narayana provide a list of scientific papers where leakage was detected. A comparable overview is currently not available for commercial ML applications. 

The paper by Kapoor and Narayana is a recommended read for all those interested in ML. It helps the addressees of ML evaluations to ask the right questions to model builders and   data scientists.


[1] Kapoor S., Naravanan A.: Leakage and the Reproducability Crisis in ML-based Science. Draft Paper. Center for Information Technology Policy at Princeton University. 2022.

https://reproducible.cs.princeton.edu