Medical tasks are prone to inter-rater variability due to multiple factors such as image quality, professional experience and training, or guideline clarity. Training deep learning networks with annotations from multiple raters is a common practice that mitigates the model’s bias towards a single expert. Reliable models generating calibrated outputs and reflecting the inter-rater disagreement are key to the integration of artificial intelligence in clinical practice. Various methods exist to take into account different expert labels. We focus on comparing three label fusion methods: STAPLE, average of the rater’s segmentation, and random sampling of each rater’s segmentation during training. Each label fusion method is studied using both the conventional training framework and the recently published SoftSeg framework that limits information loss by treating the segmentation task as a regression. Our results, across 10 data splittings on two public datasets (spinal cord gray matter challenge, and multiple sclerosis brain lesion segmentation), indicate that SoftSeg models, regardless of the ground truth fusion method, had better calibration and preservation of the inter-rater rater variability compared with their conventional counterparts without impacting the segmentation performance. Conventional models, i.e., trained with a Dice loss, with binary inputs, and sigmoid/softmax final activate, were overconfident and underestimated the uncertainty associated with inter-rater variability. Conversely, fusing labels by averaging with the SoftSeg framework led to underconfident outputs and overestimation of the rater disagreement. In terms of segmentation performance, the best label fusion method was different for the two datasets studied, indicating this parameter might be task-dependent. However, SoftSeg had segmentation performance systematically superior or equal to the conventionally trained models and had the best calibration and preservation of the inter-rater variability. SoftSeg has a low computational cost and performed similarly in terms of uncertainty to ensembles which require multiple models and forward passes. Our code is available at
Inter-rater variability · Calibration · Segmentation · Deep learning · Soft training · Label fusion
@article{melba:2022:031:lemay,
title = "Label fusion and training methods for reliable representation of inter-rater uncertainty",
author = "Lemay, Andreanne and Gros, Charley and Naga Karthik, Enamundram and Cohen-Adad, Julien",
journal = "Machine Learning for Biomedical Imaging",
volume = "1",
issue = "January 2023 issue",
year = "2022",
pages = "1--27",
issn = "2766-905X",
doi = "https://doi.org/10.59275/j.melba.2022-db5c",
url = "https://melba-journal.org/2022:031"
}
TY - JOUR
AU - Lemay, Andreanne
AU - Gros, Charley
AU - Naga Karthik, Enamundram
AU - Cohen-Adad, Julien
PY - 2022
TI - Label fusion and training methods for reliable representation of inter-rater uncertainty
T2 - Machine Learning for Biomedical Imaging
VL - 1
IS - January 2023 issue
SP - 1
EP - 27
SN - 2766-905X
DO - https://doi.org/10.59275/j.melba.2022-db5c
UR - https://melba-journal.org/2022:031
ER -