About Experience Publications Projects
← Back to projects

Transformer-Based Radiotherapy Dose Prediction

Gijs de Jong, Jakob Kaiser, Macha Meijer, Thijmen Nijdam, Derck Prinzhorn

UNETR architecture for radiotherapy dose prediction

Summary

In radiotherapy, clinicians need to plan exactly how much radiation to deliver to a tumor while minimizing damage to surrounding healthy organs. This planning is time-consuming and requires deep expertise. This project uses deep learning to predict optimal radiation dose distributions for head and neck cancer patients.

We attempted to reproduce a recent transformer-based approach (TrDosePred), but found the original paper lacked sufficient implementation details. Instead, we built on UNETR, a hybrid architecture combining a Vision Transformer (which captures global context in 3D scans) with a U-Net decoder (which produces fine-grained spatial predictions). We then extended it with physics-informed training objectives and a novel sequential prediction strategy that significantly improved results.

Key Contributions

Method

The model takes as input a 3D CT scan of a patient's head and neck (128×128×128 voxels), along with masks indicating where the tumor and critical organs are located. It outputs a predicted radiation dose for every point in the volume. We used data from the OpenKBP Challenge, which includes scans from 340 patients.

The architecture is UNETR: a Vision Transformer processes the full 3D scan to capture long-range spatial relationships (e.g., how the tumor's position relates to distant organs), while a U-Net-style decoder generates the detailed voxel-by-voxel dose prediction. We tested 9-block and 12-block encoder variants.

On top of the standard pixel-wise error (MAE), we explored two physics-informed losses. DVH loss directly optimizes the dose-volume histograms that clinicians review when evaluating plans. Moment loss ensures the predicted dose has the right statistical shape within each organ structure.

For sequential prediction, we compared three strategies. The winning approach uses a convolutional RNN that processes the 3D volume slice-by-slice, building up a running memory of what it has predicted so far, similar to how a human might plan treatment one cross-section at a time.

Results

Configuration DVH Score ↓ Dose Score ↓
TrDosePred (original paper) 2.512 1.658
UNETR 9-block (MAE baseline) 4.252 5.587
UNETR 12-block 4.081 6.405
MAE + DVH loss 4.777 5.730
MAE + Moment loss 4.233 5.706
MAE + RNN autoregression 3.811 4.927

Both scores measure prediction error (lower is better). DVH Score reflects clinical plan quality; Dose Score measures voxel-level accuracy. The RNN-based sequential decoder improved both metrics ~10% over the baseline with no additional computational cost. Physics-based losses alone gave marginal improvement, likely needing more tuning. While the original TrDosePred results could not be reproduced, the UNETR + RNN extension narrows the gap substantially.