Introduction: How can we teach an AI to see the connections between a patient’s heart rhythms and their clinical notes? In healthcare, critical information is spread across different data types – a patient’s ECG waveform might show subtle signs that only make sense in light of the doctor’s written notes or the recorded diagnoses. In this article, we explore how our team built a multi-modal contrastive learning system that links electrocardiograms (ECGs) with both structured data and free-text clinical notes. By leveraging our one-of-a-kind dataset – millions of ECGs each paired with rich electronic health records – we created models that learn a shared representation between these modalities. This approach allows us to retrieve relevant patient reports for a given ECG (and vice versa), improve diagnostic classifiers, and unlock insights that single-modal models miss. Let’s dive into how it works.
Multi-Modal Learning: ECG + EHR (Structured & Text)
This work introduces a family of three multi-modal models, each aligning ECG signals with EHR data in different combinations:
- ECG–sEHR model: pairs an ECG waveform with the patient’s structured EHR data (diagnosis codes, procedure codes, and medication names).
- ECG–Text model: pairs an ECG waveform with unstructured text from the patient’s records (clinical notes and reports).
- sEHR–ECG–Text model: a three-way model that uses all three modalities (ECG, structured EHR, and text) simultaneously
The core idea is to use contrastive learning – the model learns to make representations of related data (e.g. an ECG and its corresponding report) closer to each other and push unrelated pairs further apart. By doing so, we force the model to capture the underlying connections between what a clinician sees on an ECG strip and what’s written in the patient’s chart.
Why is this powerful? In cardiology, many conditions that affect the heart’s electrical signals (ECG) are also described in the medical record. For instance, an ECG may be subtly different if a patient has low ejection fraction – but detecting that requires context from echocardiogram reports or notes. Our multi-modal approach lets the model read the ECG in the context of diagnoses and notes, essentially learning clinically relevant ECG features that align with those conditions.
Pairing Data Across Modalities: How We Did It
The first challenge was creating meaningful pairs (and triplets) of data. Thanks to our extensive dataset, we could link each ECG to the wealth of information in the patient’s EHR:
- Structured EHR pairing: For a given ECG, we gathered all the structured codes (ICD diagnoses, procedures, etc.) recorded in the one-year window before and after the ECG. These codes were sorted chronologically to form a sequence (the input for our sEHR-BERT encoder discussed in blog post 1). This gives a snapshot of the patient’s conditions and interventions around the time of the ECG.
- Unstructured text pairing: We also pulled in clinical notes and reports around each ECG. We included a wide range of texts – ECG interpretation reports, echocardiography reports, radiology reports, microbiology reports, clinical notes, and surgical notes. However, the total volume of text could be huge, and not all of it is relevant. We tackled this by filtering and timing:
We used NLP techniques to pick notes containing keywords or entities of interest (diseases, symptoms, meds, etc.).
We tried two strategies to concatenate notes: Report Concatenation (simply gluing together all note texts within ±1 month of the ECG) and Entity Concatenation (extracting just the pertinent medical entities from notes within ±1 year of the ECG). The entity-based approach yielded a more compact summary (~266 tokens on average vs. 354 for full text).
Outcome: We found the entity-centric method was more effective, presumably because it retained the key clinical concepts without the fluff. So for our main model, each ECG’s paired “note” is a concatenation of important clinical terms from the year surrounding that ECG.
By constructing these pairs, we created two parallel training datasets: one of ECG+sEHR pairs and one of ECG+Text pairs (and even triplets for the three-way model). It’s worth noting that assembling this required robust data engineering – aligning multimodal data by patient and time – which our platform is uniquely equipped to do.
Model Architecture: Aligning the Modalities
So how do we train a model to bring these data streams together? We built on a contrastive learning framework where each modality has its own encoder, and their outputs are compared in a shared embedding space:
- ECG encoder: a 1-D convolutional neural network (CNN) tailored for time-series ECG signals. It ingests the raw ECG waveform (we use 5-second 12-lead segments) and outputs a feature vector (think of this as the model’s understanding of the heartbeat patterns).
- Structured EHR encoder: our newly minted sEHR-BERT (described in the first blog post). It takes the sequence of coded entries (diagnoses, procedures, etc.) and produces a representation vector capturing the patient’s medical history around ECG acquisition.
- Text encoder: a transformer-based language model for clinical text. We leveraged GatorTron, a large medical text BERT from University of Florida (pre-trained on 80B clinical words). By using this off-the-shelf model (the text counterpart to sEHR-BERT), we saved time – it already speaks the language of doctors’ notes.
Each encoder produces an embedding for the ECG, structured EHR, and text respectively. We then add a small neural network (MLP projection head) on each, which maps these modality-specific embeddings into a shared latent space where comparisons are made. The training objective pulls matching pairs’ representation closer to each other in this space and pushes non-matching pairs’ representation far apart.
Fine- and Coarse-Grained Alignment: One insight from prior research (MultiModal Versatile Networks by Alayrac et al., 2020) was that not all modalities should be treated equally. For example, an ECG and a set of diagnosis codes might have a very tight correlation (fine-grained: an explicit diagnosis often directly affects the ECG), whereas an ECG and a text note could be more loosely related (coarse-grained: a note might summarize many aspects, not all visible on ECG). We adopted their FAC (Fine-and-Coarse) approach: we actually use two embedding sub-spaces – one where ECG & structured EHR are aligned, and another where ECG & text are aligned. During training, the model optimizes a contrastive loss across both spaces. This gave us the best of both worlds: preserving the precise links between ECG and codes, while also learning the broader connections between ECG and textual observations.
(If you peek at Figure 1 in our paper, you’ll see this architecture visualized – three encoders feeding into fine-grained ECG–sEHR and coarse ECG–Text embedding spaces, with contrastive loss applied in each.)
Learning to Read ECGs in Context
After training, our multi-modal models developed some remarkable capabilities:
- An ECG–Text embedding space where an ECG tracing and the corresponding cardiologist’s report end up close together. This means the model has learned a form of “ECG language” – certain waveform patterns correlate with certain words or diagnoses. For example, an ECG showing atrial fibrillation will be near text embeddings that mention “atrial fibrillation” in the report.
- Similarly, an ECG–sEHR space where ECGs align with relevant diagnosis codes. If a patient’s EHR mentions they have pulmonary hypertension, the model will pull that code’s representation toward the patient’s ECG embedding, even without explicit labels. In effect, it’s self-supervising by using the EHR as a teaching signal for the ECG encoder.
Crucially, these models are trained on entirely weakly labeled data – we don’t need any manual annotation of ECGs. The supervision comes for free from the pairing of ECGs with EHR content. With over 4.5 million such sEHR-ECG-Text triplets seen in training, the models learn a rich representation of cardiac function in the context of general health data.
What Can We Do With This?
The true power of our multi-modal ECG representations emerged when we applied them to various tasks (which we’ll detail in the next post). But to name a few exciting outcomes:
- We can perform zero-shot retrieval: ask the model to find ECGs in our database that match a given text query (like “ECG showing signs of prior myocardial infarction”) – without having explicitly trained it for that task. Because ECGs and text share an embedding space, we just embed the query text and look for nearest ECG vectors to retrieve the matching ECGs.
- The models set new state-of-the-art results in ECG classification problems by transferring knowledge from EHR data. For instance, an ECG–Text pre-trained model, when fine-tuned on a small ECG arrhythmia dataset, outperforms models that trained from scratch with random initialization of the weights.
- They also yield better out-of-distribution detection, meaning if an input ECG looks very different from those in training data, the model’s embedding reflects that. This is important for safety – flagging signals that the system might not understand.
All these benefits stem from one idea: learning from the multimodal richness of real clinical data. By tying together what a clinician writes and what an ECG waveform shows, we imbue our model with a more holistic “understanding” of cardiac conditions.
Conclusion: A New Frontier for Clinical AI
Our contrastive learning approach to combine ECGs with textual and structured EHR data demonstrates innovation at the intersection of medical signal processing and natural language processing. We’re pushing the boundaries of what AI can do in healthcare. The team is excited to continue this journey, and we welcome those who are passionate about multi-modal AI to collaborate with us. For a deep dive into the methods, see our paper “ECG Representation Learning with Multi-Modal EHR Data” published in TMLR 11/2023. It covers the model architecture and training process in detail.


