Skip to the content.

Proposal Video

1. Introduction / Background

We propose a multimodal ML pipeline to process de-identified urology visit artifacts (scanned reports and images) and train supervised models that generate reports similar in style and content to the originals. The task involves aligning textual reports with corresponding images so that, when given an unseen study, the system can output a coherent, clinically styled draft.

Multimodal medical models are advancing rapidly: Google’s MedGemma is designed for medical text-image comprehension and shows strong benchmark performance [1], [7]. LLaVA-Med [2] and BiomedCLIP [3] demonstrate effective biomedical vision-language alignment, providing open-source starting points. For text-only baselines, ClinicalBERT remains a strong encoder for clinical notes [4].

Some scans contain text burned into images (faxed forms, typed overlays). To make these usable, we apply OCR preprocessing [5], [6].


Dataset

We’re starting with 198 studies, each structured as:

  • /scans/ - PNG images (some with annotations)
  • /ocr/ - OCR outputs from Advanced Urology’s script
  • report.txt - Corresponding urology report

This gives us paired multimodal data: images (plus OCR text) aligned with reports. More studies will be added over time.

2. Problem Definition

Problem: Given urology studies (scans plus optional OCR), we aim to train supervised models that generate clinical reports similar to the references.

Motivation: Radiologists interpret scans and draft detailed reports that urologists rely on for patient care. This process is very time consuming at scale. An assistive system that generates a first pass draft could streamline workflow [8], [9]. Radiologists would only verify and edit, reducing reporting time, improving consistency, and giving urologists faster access to structured reports. This system is assistive only, it’s not a replacement for actual clinicians.

3. Methods

Preprocessing

  • OCR integration - use Advanced Urology’s outputs; optionally re-run with other engines
  • Text normalization - clean OCR text, tokenize, chunk long documents
  • Image prep - resize and normalize scans, optionally extract ROIs
  • Dataset splits - patient-level train/val/test partitions to avoid leakage

Models

  • Text baselines - n-gram or TF-IDF retrieval templates
  • Clinical PLMs - fine-tune ClinicalBERT [4] or similar encoders
  • Multimodal V+L:
    • BiomedCLIP embeddings for image-conditioned generation [3]
    • LLaVA-Med for multimodal captioning [2]
    • MedGemma for long-context multimodal comprehension [1], [7]

Supervised Learning

  • Retrieval baselines (TF-IDF, nearest-neighbor)
  • Transformer encoder-decoder models (BERT2BERT, T5, MedGemma fine-tuning)
  • Evaluation via BLEU, ROUGE, BERTScore against references

Where Unsupervised Helps

  • Quality control - UMAP or k-means on embeddings to detect outliers
  • Exploration - topic modeling (BERTopic) for grouping and curriculum training

Implementations

Hugging Face Transformers, OpenCLIP/BiomedCLIP, LLaVA-Med, MedGemma, long-context retrieval


Safety & Ethics

  • All reports de-identified by Advanced Urology
  • Outputs must be verified by radiologists
  • We are adding guardrails against hallucination with clinician-in-the-loop validation

4. (Potential) Results & Discussion

Metrics

  • Text similarity - BLEU, ROUGE-L, METEOR
  • Embedding-based - BERTScore, CLIPScore
  • Clinical correctness - rule-based key-finding checks plus expert review
  • Efficiency - throughput (reports per second)

Goals

  • Achieve ROUGE-L >= 0.7 (competitive similarity)
  • Capture clinical style and structure
  • Ensure safe deployment - outputs as drafts, not diagnoses

Expected Results

  • Transformer models (ClinicalBERT-conditioned seq2seq) outperform retrieval baselines
  • Multimodal fusion (BiomedCLIP, LLaVA-Med, MedGemma) improves scan-specific detail inclusion
  • Long-context handling reduces truncation issues and improves coherence

References

[1] D. Golden and R. Pilgrim, “MedGemma: Our most capable open models for health AI development,” Google Research, Jul. 09, 2025. https://research.google/blog/medgemma-our-most-capable-open-models-for-health-ai-development/ (accessed Oct. 02, 2025).

[2] C. Li et al., “LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day,” Jun. 2023. doi: https://doi.org/10.48550/arXiv.2306.00890.

[3] S. Zhang et al., “BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs,” Jan. 2025. doi: https://doi.org/10.48550/arXiv.2303.00915.

[4] K. Huang, J. Altosaar, and R. Ranganath, “ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission,” Nov. 2020. doi: https://doi.org/10.48550/arxiv.1904.05342.

[5] E. Hsu, I. Malagaris, Y.-F. Kuo, R. Sultana, and K. Roberts, “Deep learning-based NLP data pipeline for EHR-scanned document information extraction,” JAMIA Open, vol. 5, no. 2, Jun. 2022, doi: https://doi.org/10.1093/jamiaopen/ooac045.

[6] J. K. James et al., “Experience With an Optical Character Recognition Search Application for Review of Outside Medical Records,” Mayo Clinic Proceedings: Digital Health, vol. 2, no. 4, pp. 511–514, Aug. 2024, doi: https://doi.org/10.1016/j.mcpdig.2024.08.001.

[7] G. Preda, “From X-rays to Insights: Exploring MedGemma 4B’s Medical Multimodality,” Medium.com, May 25, 2025. https://medium.com/@gabi.preda/from-x-rays-to-insights-exploring-medgemma-4bs-medical-multimodality-ef620a4d8264 (accessed Oct. 03, 2025).

[8] L. Yang et al., “Advancing Multimodal Medical Capabilities of Gemini,” May 2024. doi: https://doi.org/10.48550/arXiv.2405.03162.

[9] K. Saab et al., “Capabilities of Gemini Models in Medicine,” arXiv.org, May 01, 2024. https://arxiv.org/abs/2404.18416 (accessed Oct. 02, 2025).

Gantt Chart

Here is the Gantt chart

Contribution Table

Contribution Table

We will opt in to the award consideration.