Back to VIRLab Research
Under Review

TRACE: Temporal Radiology with Anatomical Change Explanation for Grounded X-ray Report Generation

University of Texas at San Antonio
90.2%
Grounding Accuracy (IoU>0.5)
48.0%
Change Detection Accuracy
0.772
Mean IoU
0.260
BLEU-4 Score
TRACE Comparison

Figure: Comparison of chest X-ray report generation approaches. Single-Image Report Generation generates reports without temporal context or spatial grounding. Temporal Comparison detects disease progression but outputs only classification labels. Grounded Report Generation produces spatially grounded reports but lacks temporal reasoning. TRACE (Ours) is the first to combine explicit temporal change classification with spatial grounding, describing what changed, where, and how.

Abstract

Temporal comparison of chest X-rays is fundamental to clinical radiology, enabling detection of disease progression, treatment response, and new findings. While vision-language models have advanced single-image report generation and visual grounding, no existing method combines these capabilities for temporal change detection. We introduce Temporal Radiology with Anatomical Change Explanation (TRACE), the first model that jointly performs temporal comparison, change classification, and spatial localization. Given a prior and current chest X-ray, TRACE generates natural language descriptions of interval changes (worsened, improved, stable) while grounding each finding with bounding box coordinates. TRACE demonstrates effective spatial localization with over 90% grounding accuracy, establishing a foundation for this challenging new task.

Key Discovery

Our ablation study uncovers an emergent capability: change detection arises only when temporal comparison and spatial grounding are jointly learned. Models with access to both images but without grounding supervision fail entirely, defaulting to predicting "stable" for all samples. This suggests that grounding provides a spatial attention mechanism essential for temporal reasoning.

Example TRACE Output:

"Interval worsening of pneumothorax <box>0.19,0.11,0.52,0.63</box> in right lung."

Key Contributions

  • We introduce grounded temporal change detection, a new task combining temporal comparison, change classification, and spatial localization in a unified framework.
  • We discover an emergent capability: change detection only arises when both temporal comparison and spatial grounding are combined. Without grounding supervision, models fail to detect changes even with access to both images.
  • We construct a large-scale dataset with grounded temporal annotations from MIMIC-CXR and Chest ImaGenome (79,202 training, 22,553 test samples).

Method

TRACE Architecture

Figure: TRACE architecture for grounded temporal change detection. Given a prior and current chest X-ray, the model generates natural language descriptions of interval changes while localizing each finding with bounding box coordinates.

Temporal Input

Takes both prior and current chest X-ray images to enable comparison of disease progression over time.

Spatial Grounding

Localizes each finding with bounding box coordinates, enabling precise identification of where changes occur.

Change Classification

Classifies findings as worsened, improved, or stable to track disease progression and treatment response.

Report Generation

Generates natural language descriptions combining temporal changes with spatial localization.

Results

Main Results

TRACE performance on the test set (22,553 samples):

Category Metric Value
NLG Metrics BLEU-4 0.260
METEOR 0.438
ROUGE-L 0.494
Clinical Efficacy RadGraph F1 0.406
CheXbert F1 0.432
Grounding Mean IoU 0.772
IoU > 0.5 90.2%
Change Detection Accuracy 48.0%

Comparison with Prior Methods

Method Temporal Acc. Grounding Report Gen
CheXRelNet 46.8%
CheXRelFormer 49.3%
TRACE (Ours) 48.0% (90.2%)

TRACE achieves comparable temporal accuracy while providing two additional capabilities: spatial grounding and full natural language report generation.

Per-Class Change Detection

Class Precision Recall F1 Support
Worsening 0.485 0.375 0.423 7,787
Improvement 0.382 0.263 0.311 4,888
Stable 0.505 0.674 0.578 9,878
Macro Average 0.457 0.437 0.437 22,553

Per-Anatomy Analysis

Anatomy Accuracy Support
Mediastinum 75.5% 339
Cardiac Silhouette 68.8% 1,868
Right Hilar 61.7% 60
Right Lung 45.3% 17,343
Left Lung 44.7% 2,549

Cardiac changes are substantially easier to detect than pulmonary changes, as cardiac changes involve distinct boundary shifts while pulmonary changes are often more diffuse and subtle.

Qualitative Results

Qualitative Results

Figure: Qualitative results of TRACE on three representative cases: worsening, stable, and improvement. Each row shows the prior image, current image, model prediction with bounding box overlay, and the corresponding ground truth and model output text. The model successfully identifies temporal changes and localizes the affected anatomical regions with accurate bounding box coordinates.

Ablation Study

We conduct ablation experiments to understand the contribution of temporal input and grounding supervision:

Model Prior Image Grounding BLEU-4 Change Accuracy
TRACE (Full) 0.260 48.0%
Single-Image 0.051 43.8%
No-Grounding 0.240 43.8%

Predicts "stable" for all samples (0% recall on worsening/improvement)

Critical Finding

Despite having access to both prior and current images, the no-grounding variant fails at change detection, achieving only 43.8% accuracy by predicting "stable" for all samples. This suggests that grounding supervision provides a critical inductive bias for temporal comparison. Without explicit spatial localization, the model cannot effectively compare corresponding anatomical regions across time points.

Language Model Comparison

Comparison of Vicuna-7B and Mistral-7B backbones reveals distinct clinical behaviors:

LLM Backbone Accuracy Worsening Recall Improvement Recall Stable Recall IoU>0.5
Vicuna-7B 48.0% 37.5% 26.3% 67.4% 90.2%
Mistral-7B 45.0% 73.6% 57.9% 15.7% 54.5%

Clinical Implications: Vicuna-7B achieves higher accuracy through conservative prediction (high specificity, lower sensitivity), suitable for screening applications where false alarms are costly. Mistral-7B exhibits aggressive change detection (high sensitivity, lower specificity), better suited for high-risk monitoring where missing actual changes is unacceptable.

BibTeX

@article{aranya2025trace,
  title={TRACE: Temporal Radiology with Anatomical Change Explanation for Grounded X-ray Report Generation},
  author={Aranya, OFM Riaz Rahman and Desai, Kevin},
  journal={arXiv preprint},
  year={2025}
}

Acknowledgments

This work builds upon RaDialog, LLaVA, and BioViL-T. We use the MIMIC-CXR and Chest ImaGenome datasets.