Proposal

Proposal Video

1. Introduction / Background

We propose a multimodal ML pipeline to process de-identified urology visit artifacts (scanned reports and images) and train supervised models that generate reports similar in style and content to the originals. The task involves aligning textual reports with corresponding images so that, when given an unseen study, the system can output a coherent, clinically styled draft.

Multimodal medical models are advancing rapidly: Google’s MedGemma is designed for medical text-image comprehension and shows strong benchmark performance [1], [7]. LLaVA-Med [2] and BiomedCLIP [3] demonstrate effective biomedical vision-language alignment, providing open-source starting points. For text-only baselines, ClinicalBERT remains a strong encoder for clinical notes [4].

Some scans contain text burned into images (faxed forms, typed overlays). To make these usable, we apply OCR preprocessing [5], [6].

Dataset

We’re starting with 198 studies, each structured as:

/scans/ - PNG images (some with annotations)
/ocr/ - OCR outputs from Advanced Urology’s script
report.txt - Corresponding urology report

This gives us paired multimodal data: images (plus OCR text) aligned with reports. More studies will be added over time.

2. Problem Definition

Problem: Given urology studies (scans plus optional OCR), we aim to train supervised models that generate clinical reports similar to the references.

Motivation: Radiologists interpret scans and draft detailed reports that urologists rely on for patient care. This process is very time consuming at scale. An assistive system that generates a first pass draft could streamline workflow [8], [9]. Radiologists would only verify and edit, reducing reporting time, improving consistency, and giving urologists faster access to structured reports. This system is assistive only, it’s not a replacement for actual clinicians.

3. Methods

Preprocessing

OCR integration - use Advanced Urology’s outputs; optionally re-run with other engines
Text normalization - clean OCR text, tokenize, chunk long documents
Image prep - resize and normalize scans, optionally extract ROIs
Dataset splits - patient-level train/val/test partitions to avoid leakage

Models

Text baselines - n-gram or TF-IDF retrieval templates
Clinical PLMs - fine-tune ClinicalBERT [4] or similar encoders
Multimodal V+L:
- BiomedCLIP embeddings for image-conditioned generation [3]
- LLaVA-Med for multimodal captioning [2]
- MedGemma for long-context multimodal comprehension [1], [7]

Supervised Learning

Retrieval baselines (TF-IDF, nearest-neighbor)
Transformer encoder-decoder models (BERT2BERT, T5, MedGemma fine-tuning)
Evaluation via BLEU, ROUGE, BERTScore against references

Where Unsupervised Helps

Quality control - UMAP or k-means on embeddings to detect outliers
Exploration - topic modeling (BERTopic) for grouping and curriculum training

Implementations

Hugging Face Transformers, OpenCLIP/BiomedCLIP, LLaVA-Med, MedGemma, long-context retrieval

Safety & Ethics

All reports de-identified by Advanced Urology
Outputs must be verified by radiologists
We are adding guardrails against hallucination with clinician-in-the-loop validation

4. (Potential) Results & Discussion

Metrics

Text similarity - BLEU, ROUGE-L, METEOR
Embedding-based - BERTScore, CLIPScore
Clinical correctness - rule-based key-finding checks plus expert review
Efficiency - throughput (reports per second)

Goals

Achieve ROUGE-L >= 0.7 (competitive similarity)
Capture clinical style and structure
Ensure safe deployment - outputs as drafts, not diagnoses

Expected Results

Transformer models (ClinicalBERT-conditioned seq2seq) outperform retrieval baselines
Multimodal fusion (BiomedCLIP, LLaVA-Med, MedGemma) improves scan-specific detail inclusion
Long-context handling reduces truncation issues and improves coherence

References

[1] D. Golden and R. Pilgrim, “MedGemma: Our most capable open models for health AI development,” Google Research, Jul. 09, 2025. https://research.google/blog/medgemma-our-most-capable-open-models-for-health-ai-development/ (accessed Oct. 02, 2025).

[2] C. Li et al., “LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day,” Jun. 2023. doi: https://doi.org/10.48550/arXiv.2306.00890.

[3] S. Zhang et al., “BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs,” Jan. 2025. doi: https://doi.org/10.48550/arXiv.2303.00915.

[4] K. Huang, J. Altosaar, and R. Ranganath, “ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission,” Nov. 2020. doi: https://doi.org/10.48550/arxiv.1904.05342.

[5] E. Hsu, I. Malagaris, Y.-F. Kuo, R. Sultana, and K. Roberts, “Deep learning-based NLP data pipeline for EHR-scanned document information extraction,” JAMIA Open, vol. 5, no. 2, Jun. 2022, doi: https://doi.org/10.1093/jamiaopen/ooac045.

[6] J. K. James et al., “Experience With an Optical Character Recognition Search Application for Review of Outside Medical Records,” Mayo Clinic Proceedings: Digital Health, vol. 2, no. 4, pp. 511â€“514, Aug. 2024, doi: https://doi.org/10.1016/j.mcpdig.2024.08.001.

[7] G. Preda, “From X-rays to Insights: Exploring MedGemma 4B’s Medical Multimodality,” Medium.com, May 25, 2025. https://medium.com/@gabi.preda/from-x-rays-to-insights-exploring-medgemma-4bs-medical-multimodality-ef620a4d8264 (accessed Oct. 03, 2025).

[8] L. Yang et al., “Advancing Multimodal Medical Capabilities of Gemini,” May 2024. doi: https://doi.org/10.48550/arXiv.2405.03162.

[9] K. Saab et al., “Capabilities of Gemini Models in Medicine,” arXiv.org, May 01, 2024. https://arxiv.org/abs/2404.18416 (accessed Oct. 02, 2025).

Gantt Chart

Here is the Gantt chart

Contribution Table

We will opt in to the award consideration.