Anumana Appoints Kevin Ballinger and Jean-Luc Butel to Board of Directors

Building sEHR-BERT: A Custom Language Model for Structured Electronic Health Records – Adapting transformer architecture for the unique challenges of medical coding and structured health data

January 22, 2026

Why Standard BERT Isn’t Enough for Structured EHRs

BERT revolutionized natural language processing, but electronic health records (EHRs) present unique challenges that standard language models weren’t designed to handle. While BERT excels at understanding sentences and paragraphs, structured EHRs are fundamentally different—they’re sequences of medical codes, and categorical data that don’t follow natural language patterns.

Consider a patient timeline: [I48.91, I49.9, I48.3, I42.0, 82435, 84295, 82565, 84132, 36415, 80051, …]

This isn’t a sentence—it’s a coded medical narrative where each element has precise meaning, temporal relationships matter enormously, and the vocabulary is highly specialized. Standard BERT models trained on Wikipedia and books simply don’t understand this medical “language.”

That’s why we built sEHR-BERT: a transformer model specifically designed to understand and represent structured electronic health records. This work is detailed in our paper, ECG Representation Learning with Multi-Modal EHR Data published in TMLR 11/2023.

The Medical Coding Challenge

Before diving into architecture, let’s understand what we’re working with. Structured EHRs contain several types of medical codes:

ICD Codes (International Classification of Diseases): The backbone of medical diagnosis coding
●   ICD-9: Older standard, still present in historical data
●   ICD-10: Current standard, more granular (e.g., I26.9 = “pulmonary embolism with acute cor pulmonale”)

CPT Codes (Current Procedural Terminology): Describe medical procedures
●   Example: 93654 = “Ventricular Tachycardia Ablation”

Medication Names: Both generic and brand names, often with dosage information
The challenge? These codes don’t follow linguistic patterns. There’s no grammar, no sentence structure—just meaningful sequences that represent a patient’s medical journey over time.

Architecture Design: Beyond Standard BERT

Our sEHR-BERT architecture builds on BERT’s transformer foundation but introduces several medical domain-specific innovations:

Specialized Vocabulary Construction
Patient histories were represented as chronological sequences of codes. We sampled random slices of up to 512 consecutive codes from patient timelines. On average each training sequence had ~168 codes, covering months or years of care.

Each medical code isn’t just represented by a single embedding vector. Instead, we use three complementary embedding types:
Medical Code Embedding: The primary representation of the code itself (similar to token embeddings in standard BERT).

Time Embedding: Captures temporal patterns by grouping codes within non-overlapping 7-day windows. This is crucial because medical events that happen close in time are often related.

Medical Code Type Embedding: Distinguishes between different types of codes (diagnoses, procedures, medications, etc.). This helps the model understand that a diagnosis code and a procedure code play different roles in a patient’s story.

The final input representation is the sum of all three embeddings—a technique that allows the model to understand both what happened, when it happened, and what type of medical event it represents.

Model Architecture & Training
We used BERT architecture for modeling sEHR-BERT. The model was initialized from scratch (random weights) and trained with the standard Masked Language Modeling (MLM) objective. Essentially, we ask sEHR-BERT to “fill in the blanks” for masked-out codes in the sequence, teaching it the contextual relationships between medical events (just as BERT learns to predict missing words in a sentence). Using our dataset of over 2.4 million patient sequences, we trained sEHR-BERT for 100 epochs with a batch size of 512, using AdamW optimizer. During training, 15% of the tokens (medical codes) were randomly masked – the model had to infer these from context.

sEHR-BERT in Action
After training, sEHR-BERT produces dense vector representations for any given list of a patient’s medical codes. In our experiments, this model became a cornerstone for multi-modal learning. Embeddings of structured EHR data are paired with other modalities (like ECG waveforms) to learn ECG representations using contrastive learning – effectively teaching an AI to connect a patient’s diagnoses/procedures with the patterns seen in their ECGs. (We’ll dive into that in the next blog post!)

Conclusion:
With sEHR-BERT, we’ve shown that investing in custom models for our unique dataset yields powerful tools for multi-modal AI. Our ability to train a model on millions of EHR sequences – incorporating diagnoses, procedures, and medications – is a direct result of the depth of data at our disposal. These are the kind of ambitious projects that make our company a top destination for ML engineers and researchers. If you’re excited by the prospect of training novel models on truly unique healthcare data and want to push the boundaries of AI in medicine, we’d love to have you on our team. There’s a lot more to build, and the next breakthrough might be yours to create!

Share
Related Articles