ChatGPT on evaluation metrics
Adrien Foucart, PhD in biomedical engineering.
Get sparse and irregular email updates by subscribing to https://adfoucart.substack.com. This website is guaranteed 100% human-written, ad-free and tracker-free.
Adrien Foucart, PhD in biomedical engineering.
Get sparse and irregular email updates by subscribing to https://adfoucart.substack.com. This website is guaranteed 100% human-written, ad-free and tracker-free.
I recently reviewed a paper which I strongly suspect, without certainty, had a little help from ChatGPT for its research. One thing that tipped me off was that it was bit confused in the evaluation metrics: it reported, for a segmentation task, both the “F1” score and the “Dice” score, which it defined as:
$$F1 = \frac{2 \times precision \times recall}{precision + recall}$$
$$Dice = \frac{|P \cap G|}{|P| + |G|}$$
Where P & G refer to the predicted and ground truth regions.
Both of these definitions are fine, but… well, before I explain the problem, let’s see what ChatGPT (version 3.5) has to say about it.
User:
What is the difference between the F1 score and the Dice Similarity Coefficient ?
ChatGPT’s answer:
The F1 score and the Dice Similarity Coefficient (also known as Dice coefficient or Dice index) are both metrics commonly used to evaluate the performance of binary classification tasks, particularly in the context of image segmentation or medical imaging. While they serve similar purposes, they have some differences in how they are calculated and what aspects of the classification performance they emphasize.
Differences
In summary, both metrics are valuable tools for evaluating classification performance, but the choice between them depends on the specific requirements and characteristics of the task at hand.
ChatGPT is verbose as always, but the two definitions are mostly correct. Nothing much to say about the F1-Score, and for Dice the only real problem I have is with “measures the overlap,” as I’d rather say “is one of the measures of overlap.” The other main one being the Jaccard Index, also known as “Intersection over Union.”
The “differences” section is where I have way more problems. F1 score is balanced in terms of precision and recall, but it is not at all appropriate when there is a class imbalance (see Chapter 4 of my thesis for more!). And “the Dice coefficient may be less sensitive to false negatives if the overlap between the predicted and true regions is still significant” is nonsense.
But that’s still not the main issue. Let’s see if I can get ChatGPT to notice it.
User:
Express both the F1 score and the Dice Similarity Coefficient in terms of True Positives, False Positive and False Positives.
ChatGPT’s answer
Sure, here are the formulas for both the F1 score and the Dice Similarity Coefficient expressed in terms of True Positives (TP), False Positives (FP), and False Negatives (FN):
In both cases, TP represents the number of true positive predictions, FP represents the number of false positive predictions, and FN represents the number of false negative predictions. These formulas allow us to calculate both metrics based on the counts of these classification outcomes.
The formulas are still correct ! But weirdly, ChatGPT didn’t exactly do as I asked. It still puts the F1 Score in terms of Precision and Recall, not in terms of TP, FP and FN. Let’s prompt it a bit further.
User:
You put the F1 Score in terms of P and R, give the complete formula for the F1 Score based on TP, FP and FN.
ChatGPT’s answer
Apologies for the oversight. Here’s the complete formula for the F1 score expressed in terms of True Positives (TP), False Positives (FP), and False Negatives (FN):
F1 Score = (2 * TP) / (2 * TP + FP + FN)
This formula directly uses the counts of true positives, false positives, and false negatives to compute the F1 score, without needing to calculate precision and recall separately.
So let’s put the two formulas next to each other:
$$Dice = \frac{2 \times TP}{2 \times TP + FP + FN}$$
$$F1 Score = \frac{2 \times TP}{2 \times TP + FP + FN}$$
They are the same thing ! Dice and F1 Score are two different names for the same thing. The only real difference is in when they are often used. F1 is a more common terminology in classification and detection problems, whereas Dice is more often used for segmentation problems. But they are the same.
All the talk about their differences was complete bullshit. But it would have been relatively difficult to spot without already knowing the information. Which is always the problem with trying to use ChatGPT as a source of knowledge.
It was not a very good paper, and whether the authors misunderstood evaluation metrics because of ChatGPT or all on their own, I don’t think their study will be published, at least in the journal they were aiming for.
But after more than a year of people trying to convince me that ChatGPT really is usually right, and useful as a source of information… I remain firmly convinced that it’s only “right” if you don’t know the topic and can’t fact-check the answer fully. In this case, the information is easily found on Google (even on Bing !). On Google, the first line I see with the same original prompt is: “Dice coefficient = F1 score,” from a Kaggle notebook.
Sure, “if I upgrade to the paid version” it may be better. The only thing that really makes ChatGPT-Plus better is the capacity to search the web. I’d rather do that directly on my own – and see the information in its context, so that I can critically assess it.