Leakage in Machine Learning Studies

Machine learning is undoubtedly an essential technology for harnessing the ever-growing data pools. This applies to both scientific and commercial applications. 

In a recent paper[1], Sarash Kapoor and Arvid Narayana draw attention to a problem called “leakage” in ML-based studies. Leakage, be it data leakage or leakage in features, leads to an overestimation of model performance. 

In their article, Kapoor and Narayana provide a list of scientific papers where leakage was detected. A comparable overview is currently not available for commercial ML applications. 

The paper by Kapoor and Narayana is a recommended read for all those interested in ML. It helps the addressees of ML evaluations to ask the right questions to model builders and   data scientists.


[1] Kapoor S., Naravanan A.: Leakage and the Reproducability Crisis in ML-based Science. Draft Paper. Center for Information Technology Policy at Princeton University. 2022.

https://reproducible.cs.princeton.edu

A Remark on Transparency in statistical Reporting

Scientific research reports are usually very generously peppered with statistics, most often those of inferential nature. Weissgerber et al. for example examined all original research articles published in June 2017 in the top 25% of physiology journals. There were 328, of which 85% included either a t-test of an analysis of variance[1]. Inferential statistical methods are (also) used to underpin the objectivity and reliability of the studies. This is not a problematic intent at all if the information[2] on such statistical methods are documented and allow their assessment and, if necessary, reproducibility. Weissgerber’s contribution is a committed plea for transparency in statistical procedural matters and a very recommended reading overall.

In many cases the application of inferential procedures require data from probability samples where every population element has a known non-zero chance of being included in the sample. However, in certain research settings probability samples are not possible. This is for example the case when the target population is not well known, or study participants are not easily accessible for a survey. In order to gain insights at all in such a setting, it may be necessary to use convenience sampling: Respondents are included based on their accessibility and willingness to collaborate (self-selection). Clearly this method does not allow to draw inferences about the underlying population. Also, the identification and handling of outliers is a concern as well as the presence of biases. Convenience samples are nevertheless a workable approach if all survey steps are transparently documented. A high degree of transparency allows the reader to interpret the study results in an informed way.

One social group whose overall size is difficult to quantify and whose members are hardly interested in scientific knowledge are the participants in anti-Corona protests, commonly referred to as “Querdenker”. This group is very diverse and includes, among others, corona deniers, supporters of conspiracy theories and esoterics.

It is obviously not easy to conduct a statistical survey in this group. A team of scientists from the University of Basel have nevertheless turned their attention to this heterogeneous group and presented first results recently[3]. The aim of their research was to analyze the motives, values and beliefs of the participants in rallies, actions and demonstrations directed against the corona-related measures in Germany, Switzerland, and Austria. To achieve this goal, the researchers used a convenience sampling approach.

This first results report contains a comprehensive and very precise description of the survey work. It not only presents the challenges that arose in identifying the study population. Equally, aspects of data collection are discussed and factors that may have caused result biases are mentioned. The extremely interesting survey results are presented in a neutral way and hypotheses, interpretations, and conclusions are accordingly drawn very cautiously. 

This is transparency at its best!

Visit us at www.evaluaid.consulting or contact us at agieshoff@evaluaid.consulting to learn more about evaluaid and our professional statistical services for you.


[1] Weissgerber et al. eLife 2018;7:e36163

[2] e.g., for t-tests: exact p-values, power, sample size/degrees of freedom, paired/unpaired, variance assumption

[3] O. Nachtwey, R. Schäfer, N. Frei: Politische Soziologie der Corona-Proteste. Universität Basel. Institut für Soziologie. Dezember 2020.(https://doi.org/10.31235/osf.io/zyp3)

Statistics in Action: Measuring Freedom of Press and Information

Reporters Without Borders (RSF), an award-winning international NGO founded in 1985, is dedicated to defending and promoting press freedom. An important part of the NGOs work is the annual publication of the World Press Freedom Index. 

This index is based on extensive questionnaire work. Here are some key-aspects:

87 qualitative questions arranged in 6 umbrella categories like e.g. transparency, media independence, legislative framework.

An additional quantitative category measures the level of abuses and violence

The questionnaire is translated into 20 languages including Chinese and Russian, and completed for 180 countries.

Most questions are presented with 10-point unipolar scales with number labels. However, the scale endpoints are verbalized. Some questions require yes/no answers and another group of questions has fully verbalized scales with 4 response options.

According to RSF the questionnaire is completed by several hundred experts like journalists, lawyers, scientists and human rights activists. Indeed – a close look at the questionnaire shows that expert knowledge is evidently required to answer. Here is an example[1]:

Ein Bild, das Tisch enthält.

Automatisch generierte Beschreibung

Rightfully, and as a matter of transparency, RSF clearly states that this survey is not representative according to scientific criteria[2]. Consequently, no inferences are drawn from the results nor are any attempts made to calculate the sampling error.

The responses are finally combined in a weighted manner and the respective formulars are available as well[3].

Despite the deliberate selection of survey participants and the very challenging questions and choice of scales, this survey is capable of providing important insights. This is particularly due to the methodological transparency, which the authors demonstrate in an exemplary manner.

One result of this work is an index that ranges between 0 (best possible score – absolute freedom of press) and 100 (worst possible score). RSF uses the following classification:

Score ClassInterpretation
-15,00Good situation
15,01-25,00Satisfactory situation
25,01-35,00Problematic  -“”-
35,01-55,00Difficult -“”-
>55,00Very serious situation

Here is a snapshot of the latest results, published in January 2022[4]:

First 10 (all scores <15, good situation)Last 10 (all scores >55, very serious situation)
NorwayCuba
FinlandLaos
SwedenSyria
DenmarkIran
Costa RicaVietnam
NetherlandsDjibouti
JamaicaChina
New ZealandTurkmenistan
PortugalNorth Korea
SwitzerlandEritrea

This country ranking is an important result. Decisive though is the concrete situation in which media workers operate in the country and which determines the score and country ranking. RSF also provides this description of the situation and thus makes an important contribution to improving the working conditions of media professionals.

Visit us at www.evaluaid.consulting or contact us at agieshoff@evaluaid.consulting to learn more about evaluaid and our professional statistical services for you.


[1] https://rsf.org/sites/default/files/rsf_survey_en.pdf

[2] https://www.reporter-ohne-grenzen.de/fileadmin/Redaktion/Downloads/Ranglisten/Rangliste_2021/Methodik_Rangliste_der_Pressefreiheit2021_-_RSF.pdf

[3] https://rsf.org/en/detailed-methodology

[4] https://rsf.org/en/ranking

Like an Echo Chamber

We gather a large, if not a major part of information from the internet. News portals and social media are only a mouse click away and provide information we think we need. This is a very positive side of the internet. A downside, however, is that people under certain circumstances may find themselves in echo chambers where like-minded people group together and listen to information and arguments that are uni-directional and are just replicating and confirming already existing knowledge, views, and opinions.

At the heart of all big data analytics projects lies the specification of predictive statistical models. A situation often encountered is that the model is well fitted to the data set which was used to train it. However, if applied to new data the model provides much lower quality. This phenomenon is called over-fitting. The model can just replicate and confirm already seen data and is not able to deal with new data. It cannot be generalized and is useless for predictions.

And this is like a statistical echo chamber.

Of course, statistical science provides techniques to mitigate over-fitting, such as feature removal, or cross-validation for example. Such techniques, however, do not replace the fundamental requirement to use relevant, representative and high-quality data for the modeling process. A health care data set with diabetes prescriptions of patients older than 55 years only will never be useful to specify a predictive model for diabetes product consumption of the entire population. And this is irrespective of statistical models used and the buzz wording around them.

Investments in data quality and relevance give good returns in terms of model quality and relevance.