ChatGPT on evaluation metrics

I recently reviewed a paper which I strongly suspect, without certainty, had a little help from ChatGPT for its research. One thing that tipped me off was that it was bit confused in the evaluation metrics: it reported, for a segmentation task, both the “F1” score and the “Dice” score, which it defined as:

$$F1 = \frac{2 \times precision \times recall}{precision + recall}$$

$$Dice = \frac{|P \cap G|}{|P| + |G|}$$

Where P & G refer to the predicted and ground truth regions.

Both of these definitions are fine, but… well, before I explain the problem, let’s see what ChatGPT (version 3.5) has to say about it.

Asking the question

User:  
What is the difference between the F1 score and the Dice Similarity Coefficient ?

ChatGPT’s answer:

The F1 score and the Dice Similarity Coefficient (also known as Dice coefficient or Dice index) are both metrics commonly used to evaluate the performance of binary classification tasks, particularly in the context of image segmentation or medical imaging. While they serve similar purposes, they have some differences in how they are calculated and what aspects of the classification performance they emphasize.

F1 Score:

The F1 score is the harmonic mean of precision and recall.
Precision measures the ratio of true positive predictions to the total number of positive predictions made by the classifier.
Recall measures the ratio of true positive predictions to the total number of actual positive instances in the dataset.
The F1 score balances both precision and recall, giving equal weight to both. It is calculated as: F1 = 2 * (precision * recall) / (precision + recall).
F1 score ranges from 0 to 1, where a higher value indicates better performance.

Dice Similarity Coefficient:

The Dice coefficient measures the overlap between two samples.
It is calculated as twice the intersection of the two samples divided by the sum of the sizes of the two samples.
In the context of binary classification tasks, one sample typically represents the predicted segmentation mask, while the other represents the ground truth mask.
Dice coefficient ranges from 0 to 1, where 0 indicates no overlap between the predicted and ground truth masks, and 1 indicates perfect overlap.

Differences

While both metrics are used to evaluate binary classification tasks, they emphasize different aspects of classification performance.
F1 score emphasizes the balance between precision and recall, which is particularly useful when there is an imbalance between the classes in the dataset.
Dice coefficient emphasizes the overlap between the predicted and ground truth masks, which is especially relevant in tasks such as image segmentation, where the spatial agreement between the predicted and true regions is crucial.
The F1 score can penalize false positives and false negatives equally, while the Dice coefficient may be less sensitive to false negatives if the overlap between the predicted and true regions is still significant.

In summary, both metrics are valuable tools for evaluating classification performance, but the choice between them depends on the specific requirements and characteristics of the task at hand.

Dissecting the answer

ChatGPT is verbose as always, but the two definitions are mostly correct. Nothing much to say about the F1-Score, and for Dice the only real problem I have is with “measures the overlap,” as I’d rather say “is one of the measures of overlap.” The other main one being the Jaccard Index, also known as “Intersection over Union.”

The “differences” section is where I have way more problems. F1 score is balanced in terms of precision and recall, but it is not at all appropriate when there is a class imbalance (see Chapter 4 of my thesis for more!). And “the Dice coefficient may be less sensitive to false negatives if the overlap between the predicted and true regions is still significant” is nonsense.

But that’s still not the main issue. Let’s see if I can get ChatGPT to notice it.

One step further

User:  
Express both the F1 score and the Dice Similarity Coefficient in terms of True Positives, False Positive and False Positives.

ChatGPT’s answer

Sure, here are the formulas for both the F1 score and the Dice Similarity Coefficient expressed in terms of True Positives (TP), False Positives (FP), and False Negatives (FN):

F1 Score:
- Precision (P) = TP / (TP + FP)
- Recall (R) = TP / (TP + FN)
- F1 Score = 2 * (P * R) / (P + R)
Dice Similarity Coefficient:
- Dice = (2 * TP) / (2 * TP + FP + FN)

In both cases, TP represents the number of true positive predictions, FP represents the number of false positive predictions, and FN represents the number of false negative predictions. These formulas allow us to calculate both metrics based on the counts of these classification outcomes.

Almost there

The formulas are still correct ! But weirdly, ChatGPT didn’t exactly do as I asked. It still puts the F1 Score in terms of Precision and Recall, not in terms of TP, FP and FN. Let’s prompt it a bit further.

User:  
You put the F1 Score in terms of P and R, give the complete formula for the F1 Score based on TP, FP and FN.

ChatGPT’s answer

Apologies for the oversight. Here’s the complete formula for the F1 score expressed in terms of True Positives (TP), False Positives (FP), and False Negatives (FN):

F1 Score = (2 * TP) / (2 * TP + FP + FN)

This formula directly uses the counts of true positives, false positives, and false negatives to compute the F1 score, without needing to calculate precision and recall separately.

There we go !

So let’s put the two formulas next to each other:

$$Dice = \frac{2 \times TP}{2 \times TP + FP + FN}$$

$$F1 Score = \frac{2 \times TP}{2 \times TP + FP + FN}$$

They are the same thing ! Dice and F1 Score are two different names for the same thing. The only real difference is in when they are often used. F1 is a more common terminology in classification and detection problems, whereas Dice is more often used for segmentation problems. But they are the same.

All the talk about their differences was complete bullshit. But it would have been relatively difficult to spot without already knowing the information. Which is always the problem with trying to use ChatGPT as a source of knowledge.

Conclusions

It was not a very good paper, and whether the authors misunderstood evaluation metrics because of ChatGPT or all on their own, I don’t think their study will be published, at least in the journal they were aiming for.

But after more than a year of people trying to convince me that ChatGPT really is usually right, and useful as a source of information… I remain firmly convinced that it’s only “right” if you don’t know the topic and can’t fact-check the answer fully. In this case, the information is easily found on Google (even on Bing !). On Google, the first line I see with the same original prompt is: “Dice coefficient = F1 score,” from a Kaggle notebook.

Sure, “if I upgrade to the paid version” it may be better. The only thing that really makes ChatGPT-Plus better is the capacity to search the web. I’d rather do that directly on my own – and see the information in its context, so that I can critically assess it.

9 Feb 2024 | Adrien Foucart | adrien@adfoucart.be