Leakage in Machine Learning Studies

Machine learning is undoubtedly an essential technology for harnessing the ever-growing data pools. This applies to both scientific and commercial applications. 

In a recent paper[1], Sarash Kapoor and Arvid Narayana draw attention to a problem called “leakage” in ML-based studies. Leakage, be it data leakage or leakage in features, leads to an overestimation of model performance. 

In their article, Kapoor and Narayana provide a list of scientific papers where leakage was detected. A comparable overview is currently not available for commercial ML applications. 

The paper by Kapoor and Narayana is a recommended read for all those interested in ML. It helps the addressees of ML evaluations to ask the right questions to model builders and   data scientists.


[1] Kapoor S., Naravanan A.: Leakage and the Reproducability Crisis in ML-based Science. Draft Paper. Center for Information Technology Policy at Princeton University. 2022.

https://reproducible.cs.princeton.edu

A Remark on Transparency in statistical Reporting

Scientific research reports are usually very generously peppered with statistics, most often those of inferential nature. Weissgerber et al. for example examined all original research articles published in June 2017 in the top 25% of physiology journals. There were 328, of which 85% included either a t-test of an analysis of variance[1]. Inferential statistical methods are (also) used to underpin the objectivity and reliability of the studies. This is not a problematic intent at all if the information[2] on such statistical methods are documented and allow their assessment and, if necessary, reproducibility. Weissgerber’s contribution is a committed plea for transparency in statistical procedural matters and a very recommended reading overall.

In many cases the application of inferential procedures require data from probability samples where every population element has a known non-zero chance of being included in the sample. However, in certain research settings probability samples are not possible. This is for example the case when the target population is not well known, or study participants are not easily accessible for a survey. In order to gain insights at all in such a setting, it may be necessary to use convenience sampling: Respondents are included based on their accessibility and willingness to collaborate (self-selection). Clearly this method does not allow to draw inferences about the underlying population. Also, the identification and handling of outliers is a concern as well as the presence of biases. Convenience samples are nevertheless a workable approach if all survey steps are transparently documented. A high degree of transparency allows the reader to interpret the study results in an informed way.

One social group whose overall size is difficult to quantify and whose members are hardly interested in scientific knowledge are the participants in anti-Corona protests, commonly referred to as “Querdenker”. This group is very diverse and includes, among others, corona deniers, supporters of conspiracy theories and esoterics.

It is obviously not easy to conduct a statistical survey in this group. A team of scientists from the University of Basel have nevertheless turned their attention to this heterogeneous group and presented first results recently[3]. The aim of their research was to analyze the motives, values and beliefs of the participants in rallies, actions and demonstrations directed against the corona-related measures in Germany, Switzerland, and Austria. To achieve this goal, the researchers used a convenience sampling approach.

This first results report contains a comprehensive and very precise description of the survey work. It not only presents the challenges that arose in identifying the study population. Equally, aspects of data collection are discussed and factors that may have caused result biases are mentioned. The extremely interesting survey results are presented in a neutral way and hypotheses, interpretations, and conclusions are accordingly drawn very cautiously. 

This is transparency at its best!

Visit us at www.evaluaid.consulting or contact us at agieshoff@evaluaid.consulting to learn more about evaluaid and our professional statistical services for you.


[1] Weissgerber et al. eLife 2018;7:e36163

[2] e.g., for t-tests: exact p-values, power, sample size/degrees of freedom, paired/unpaired, variance assumption

[3] O. Nachtwey, R. Schäfer, N. Frei: Politische Soziologie der Corona-Proteste. Universität Basel. Institut für Soziologie. Dezember 2020.(https://doi.org/10.31235/osf.io/zyp3)