[Preprint] Ranking the scores of algorithms with confidence (ESANN 2025)

In competitions or in original research papers which compare the results of some algorithms on a given task, the centerpiece is generally the Big Table of Results. The Big Table of Results is where you put a list of algorithms on one axis, a list of metrics on the other axis, and you put in bold the algorithm that performs best. In original research papers, it’s where you justify that your method is better than the others, with tables such as the one below. See? It’s in bold!

Method	Result
Old Classic Baseline [1]	0.71
State-of-the-art from a few years back [2]	0.82
Previous work from the authors [3]	0.82
This work [4]	0.84

In a competitions, this gives us the leaderboard, which will look something like this:

Rank	Team	Result
1	Big AI Research group	0.91
2	Big AI Company	0.90
3	Someone with 2 GPUs	0.87
4	Someone with 1 GPU	0.84
…	…	…
157	I don’t know what I’m doing	0.42

Needless to say, things are a bit more complicated than that. In our new preprint, accepted at the ESANN 2025 conference and available online (PDF), we argue for a more nuanced approach to ranking where, instead of saying “this is the best method”, we compute confidence intervals on the rankings based on the assumption that the test set is a random sample from the larger population of “all cases where we may want to apply our algorithm”. We take the general procedure proposed by S. Holm in 2013¹, which uses the result of pairwise statistics tests to infer the confidence interval.

Several options for these statistical tests are evaluated using a Monte Carlo simulation on synthetic data. The procedure that appears to be the most robust based on our experiments is:

Make a Iman-Davenport test comparing the m algorithms. If the null hypothesis (no significant difference between results) cannot be rejected, we stop here: all algorithms have the same confidence interval [1, m].
Compute pairwise one-sided Wilcoxon signed-rank tests, adjusting the p-values using Holm’s procedure. The ranking for each algorithm is then: [1 + #_sba, m − #_swa], with sba/swa for the number of significantly better/worse algorithms.

We release with this paper the cirank Python library, which you can use to compute the confidence intervals with:

from cirank import ci_ranking
import numpy # for the example

# example results:
results = [np.random.random((10,)) for _ in range(5)]

# default method
rankings = ci_ranking(results)
print(rankings)

This paper (and the library) have large limitations in scope, and are likely to be expanded in the future – as discussed in the paper.

Reference:

A. Foucart, A. Elskens, C. Decaestecker
Ranking the scores of algorithms with confidence
ESANN 2025 (Accepted).

S.Holm. Confidence intervals for ranks. https://www.diva-portal.org/smash/get/diva2:634016/fulltext01.pdf, 2013.↩︎

15 Jan 2025 | Adrien Foucart | adrien@adfoucart.be