Machine learning is undoubtedly an essential technology for harnessing the ever-growing data pools. This applies to both scientific and commercial applications.
In a recent paper[1], Sarash Kapoor and Arvid Narayana draw attention to a problem called “leakage” in ML-based studies. Leakage, be it data leakage or leakage in features, leads to an overestimation of model performance.
In their article, Kapoor and Narayana provide a list of scientific papers where leakage was detected. A comparable overview is currently not available for commercial ML applications.
The paper by Kapoor and Narayana is a recommended read for all those interested in ML. It helps the addressees of ML evaluations to ask the right questions to model builders and data scientists.
[1]Kapoor S., Naravanan A.: Leakage and the Reproducability Crisis in ML-based Science. Draft Paper. Center for Information Technology Policy at Princeton University. 2022.
Reporters Without Borders (RSF), an award-winning international NGO founded in 1985, is dedicated to defending and promoting press freedom. An important part of the NGOs work is the annual publication of the World Press Freedom Index.
This index is based on extensive questionnaire work. Here are some key-aspects:
87 qualitative questions arranged in 6 umbrella categories like e.g. transparency, media independence, legislative framework.
An additional quantitative category measures the level of abuses and violence
The questionnaire is translated into 20 languages including Chinese and Russian, and completed for 180 countries.
Most questions are presented with 10-point unipolar scales with number labels. However, the scale endpoints are verbalized. Some questions require yes/no answers and another group of questions has fully verbalized scales with 4 response options.
According to RSF the questionnaire is completed by several hundred experts like journalists, lawyers, scientists and human rights activists. Indeed – a close look at the questionnaire shows that expert knowledge is evidently required to answer. Here is an example[1]:
Rightfully, and as a matter of transparency, RSF clearly states that this survey is not representative according to scientific criteria[2]. Consequently, no inferences are drawn from the results nor are any attempts made to calculate the sampling error.
The responses are finally combined in a weighted manner and the respective formulars are available as well[3].
Despite the deliberate selection of survey participants and the very challenging questions and choice of scales, this survey is capable of providing important insights. This is particularly due to the methodological transparency, which the authors demonstrate in an exemplary manner.
One result of this work is an index that ranges between 0 (best possible score – absolute freedom of press) and 100 (worst possible score). RSF uses the following classification:
Score Class
Interpretation
-15,00
Good situation
15,01-25,00
Satisfactory situation
25,01-35,00
Problematic -“”-
35,01-55,00
Difficult -“”-
>55,00
Very serious situation
Here is a snapshot of the latest results, published in January 2022[4]:
First 10 (all scores <15, good situation)
Last 10 (all scores >55, very serious situation)
Norway
Cuba
Finland
Laos
Sweden
Syria
Denmark
Iran
Costa Rica
Vietnam
Netherlands
Djibouti
Jamaica
China
New Zealand
Turkmenistan
Portugal
North Korea
Switzerland
Eritrea
This country ranking is an important result. Decisive though is the concrete situation in which media workers operate in the country and which determines the score and country ranking. RSF also provides this description of the situation and thus makes an important contribution to improving the working conditions of media professionals.
Scientific research reports are usually very generously peppered with statistics, most often those of inferential nature. Weissgerber et al. for example examined all original research articles published in June 2017 in the top 25% of physiology journals. There were 328, of which 85% included either a t-test of an analysis of variance[1]. Inferential statistical methods are (also) used to underpin the objectivity and reliability of the studies. This is not a problematic intent at all if the information[2] on such statistical methods are documented and allow their assessment and, if necessary, reproducibility. Weissgerber’s contribution is a committed plea for transparency in statistical procedural matters and a very recommended reading overall.
In many cases the application of inferential procedures require data from probability samples where every population element has a known non-zero chance of being included in the sample. However, in certain research settings probability samples are not possible. This is for example the case when the target population is not well known, or study participants are not easily accessible for a survey. In order to gain insights at all in such a setting, it may be necessary to use convenience sampling: Respondents are included based on their accessibility and willingness to collaborate (self-selection). Clearly this method does not allow to draw inferences about the underlying population. Also, the identification and handling of outliers is a concern as well as the presence of biases. Convenience samples are nevertheless a workable approach if all survey steps are transparently documented. A high degree of transparency allows the reader to interpret the study results in an informed way.
One social group whose overall size is difficult to quantify and whose members are hardly interested in scientific knowledge are the participants in anti-Corona protests, commonly referred to as “Querdenker”. This group is very diverse and includes, among others, corona deniers, supporters of conspiracy theories and esoterics.
It is obviously not easy to conduct a statistical survey in this group. A team of scientists from the University of Basel have nevertheless turned their attention to this heterogeneous group and presented first results recently[3]. The aim of their research was to analyze the motives, values and beliefs of the participants in rallies, actions and demonstrations directed against the corona-related measures in Germany, Switzerland, and Austria. To achieve this goal, the researchers used a convenience sampling approach.
This first results report contains a comprehensive and very precise description of the survey work. It not only presents the challenges that arose in identifying the study population. Equally, aspects of data collection are discussed and factors that may have caused result biases are mentioned. The extremely interesting survey results are presented in a neutral way and hypotheses, interpretations, and conclusions are accordingly drawn very cautiously.
[2] e.g., for t-tests: exact p-values, power, sample size/degrees of freedom, paired/unpaired, variance assumption
[3] O. Nachtwey, R. Schäfer, N. Frei: Politische Soziologie der Corona-Proteste. Universität Basel. Institut für Soziologie. Dezember 2020.(https://doi.org/10.31235/osf.io/zyp3)