Technical 29 May 2026

Captioning a medical conference: why accuracy on clinical terms matters

Medical captioning is harder than general corporate captioning because of drug names, anatomy, accented international speakers, and multi-session scale. That gap in difficulty is why CART (human real-time captioning) is the default for audited CPD content, not a premium add-on.

By Studio AV team

Most AV vendors will tell you captioning is captioning. Run the AI, attach it to the stream, done. At a corporate town hall with a single presenter speaking plain English, that is probably fine. At a cardiologist’s annual scientific meeting with eleven concurrent sessions, an international faculty, and CPD-archivable recordings, it is not close to fine.

The accuracy gap between general-purpose automated captioning and a properly briefed human CART reporter is real and it is specific to medical content. Drug names, anatomical terms, gene identifiers, trial acronyms. These are exactly the words an automated system is most likely to get wrong, and they are the words that matter most in the transcript.

Medical and pharma conference production carries a captioning requirement that most other formats do not. This article covers why, and how to scope it properly.

Why medical captioning is harder

General-purpose speech-to-text engines achieve 92 to 95 per cent accuracy on clean speech from a native English speaker at a consistent pace with no jargon. For most corporate content, 95 per cent is good enough. A thousand words with fifty wrong is annoying but intelligible.

A clinical presenter is a different input. You have:

Drug names and brand names. Pharmaceutical names are engineered to be distinctive, which means they are also engineered to be unlike any other word the model has been trained on. An AI system that cannot recognise a drug name does not produce a blank. It produces a confident wrong word, often another real drug or a common English word that sounds similar. The research is specific on this: a transcript showing 99 per cent overall accuracy can still contain medication name errors that swap drugs with entirely different therapeutic uses. That is not an accuracy problem at the level of “the chair vs a chair.” It is a factual error in a medical record.
Anatomical and procedural terms. Subspecialties run deep. A cardiothoracic surgeon’s session will include terms that a general medical vocabulary model has rarely seen. A dermatology session uses a different set. An oncology session a different set again. The model generalises poorly across subspecialties.
Gene and trial identifiers. BRCA2, HER2, KEYNOTE-522. These are alphanumeric strings. Automated systems frequently mangle them, especially when spoken quickly or with any variation in stress.
Accented international speakers. Medical conferences routinely feature faculty from across Asia, Europe, and the Americas. The same AI engine that handles a Sydney-born presenter at 95 per cent accuracy can drop to 70 to 80 per cent on an accented speaker with a fast delivery pace. Error rates compound badly here.
Dense, fast-paced delivery. Presenters in scientific sessions often read prepared text at speed. The AI needs time and context to resolve ambiguous phonemes. At 180 words per minute from a technical vocabulary, it does not have either.

The compound effect: a specialist scientific session can have an effective captioning accuracy of 75 to 80 per cent with general automated AI, even if the same system handles a standard corporate keynote at 94 per cent. The captions are technically present but they are not reliable.

CART vs AI: where each fits

CART (Communication Access Real-Time Translation) is a human stenographer trained in real-time transcription using a phonetic steno keyboard. A professional CART reporter producing court-grade output consistently achieves 98 to 99 per cent accuracy. The critical difference from AI is not just the number but the failure mode.

When an AI system is uncertain, it guesses and displays the guess. When a CART reporter is uncertain, they pause, listen again, and either resolve or briefly mark the uncertainty. The errors in CART output tend to be omissions or brief gaps. The errors in AI output tend to be confident wrong words. For a scientific transcript, omissions are recoverable. Confident wrong words are not.

CART reporters can also be prepared. Send a briefing pack before the event covering the speaker faculty, the session topics, the drug names and trial acronyms in the programme, and the specific vocabulary expected for each session. A prepared reporter enters the event having already logged the hard words. Their accuracy on specialist terms is markedly higher than an unprepared reporter, and far higher than any AI system on the same input.

Where AI captioning is appropriate at a medical conference:

Internal logistics sessions, committee meetings, welcome addresses, and other non-scientific content where accuracy on specialist terms is not material.
A rapid first-pass transcript that will be human-reviewed and corrected before archiving.
Sessions where the CPD body does not require audited captioning and the organiser has confirmed that AI accuracy meets the access need.
As a backup system running in parallel with CART, giving a second output if the CART feed has a technical disruption.

Where CART is the right call:

Any scientific session being archived as CPD-accredited content.
Any session with an accented international faculty member presenting dense clinical material.
Any session where the recording will be used as a reference document rather than just a memory aid.
External broadcasts to a deaf or hard-of-hearing remote audience who are relying on captions as their primary access to the content.
Any content governed by a regulatory requirement or funding-body accessibility standard.

The cost difference is real but not prohibitive. Human CART typically adds $400 to $800 per event day for a single captioning feed. For a conference with four concurrent scientific rooms and two days of content, the CART uplift across all rooms might be $6,000 to $12,000. Against a total AV scope of $100,000 or more, that is a small line item to protect the integrity of the primary deliverable.

Three places captions have to appear

A medical conference has three distinct caption outputs, and they have different technical and timing requirements. Treating them as one job is how captions end up missing from one of them on the day.

1. The in-room screen. Live captions displayed on a secondary screen or strip at the bottom of the main presentation screen, visible to in-room delegates. This is real-time output with latency of around two to five seconds (CART) or less (AI). The in-room caption display serves delegates who are deaf or hard of hearing, delegates whose first language is not English, and anyone following a dense technical session who benefits from reading the text as they hear it. The display format matters: font size, contrast, placement relative to the presentation content, and whether the caption window interrupts the slides. Plan the display position at the screen layout stage, not on the day.

2. The webcast caption track. Live captions embedded in or overlaid on the streaming output. Platforms like Zoom Webinars, Vimeo, and custom RTMP destinations handle this differently. Some accept an external CART feed via stenography software output. Some only support their own integrated AI captioning. Know which applies to your platform before you brief the captioning provider, because the technical integration (CART software to streaming encoder to platform) has to be tested before the event. A caption feed that works perfectly in-room but is not reaching the stream is invisible to the home audience. For live streaming production for corporate events at broadcast-grade, this integration is part of the production spec, not an afterthought.

3. The recording sidecar (SRT file). The post-event SRT file delivered alongside the video recording is the long-term accessibility asset. It is also the document CPD bodies and accessibility auditors are most likely to actually inspect. An SRT derived from the live CART feed, cleaned in post, is the highest-quality output. An SRT generated by auto-transcribing the video file after the event is a lower-quality fallback. For CPD-grade archives, the SRT should be human-reviewed before delivery. The recording and post-production workflow should include SRT review as a line item, with a clear handoff from the captioning provider to the post team.

Coordinating all three outputs through one captioning provider who understands the technical chain is substantially simpler than splitting the in-room feed from the webcast feed from the post-production SRT. Brief the captioning provider on all three requirements at the same time.

Glossaries and pre-event prep

The fastest way to improve captioning accuracy without changing the captioner or the technology is to prepare better. A CART reporter who has ingested the event vocabulary before the session starts will outperform the same reporter working cold by a meaningful margin.

What a good pre-event glossary pack includes:

Speaker names and affiliations. Full names and institutional names are frequently mispronounced or unusual. A list with phonetic guides if any names are non-standard.
Drug names and trial names. Every drug mentioned in the programme, both generic and brand names. Every clinical trial acronym. Dosage forms and administration routes where they are likely to appear.
Procedural and anatomical terms specific to the speciality. A cardiac surgery session and an ophthalmology session have almost no vocabulary overlap. The glossary is specialty-specific, not generically medical.
Session titles and breakout names. The exact titles as they appear in the programme, so the reporter can match transcript segments to session records.
Presenter abstracts. Where available, the abstract or slide deck summary gives the reporter context for the session before it begins.

Most professional CART providers will ask for this material. If yours does not, send it anyway. The standard practice for conference captioning at this level is for the captioner to review and load the glossary into their steno software in advance, so specialist terms are pre-programmed as single-stroke outputs rather than resolved phonetically on the fly.

For multi-day conferences, update the glossary if the programme changes. Late-added speakers or sessions added after the initial brief should trigger a glossary update, not just a schedule update.

Accessibility standards and why CPD bodies default to CART

Accessibility in Australia sits under the Disability Discrimination Act 1992 and, for public-facing organisations, the Web Content Accessibility Guidelines (WCAG). For medical societies, the obligation is broader than just legal minimum: the membership typically includes clinicians and researchers with hearing loss, and the CPD content being archived is an educational resource that has to be genuinely accessible.

The practical standard that Australian medical colleges and research funding bodies apply is a captioning accuracy threshold of 98 per cent or higher for archived CPD content. General-purpose AI captioning on technical medical content does not reliably reach that threshold. CART does.

The pattern for audited content is consistent: human CART for live scientific sessions, human review of the CART output before the SRT is delivered with the archive, and AI captioning only for non-audited supporting content. Some funding bodies and conference organisers are explicit about this in their accessibility policies. Others apply it as an unwritten standard that the AV team is expected to know.

For societies with international members or federally funded research activity, the requirement may also extend to remote access: the webcast caption track must meet the same accuracy standard as the in-room captions, not a lower bar because it is “just a stream.” Planning the caption scope around the highest applicable standard, rather than the minimum, avoids having to retrofit accessibility compliance after the event.

The production implication is straightforward: caption scope for a medical conference should be confirmed in the brief at the same time as the recording scope, with the same level of specificity. Which rooms need CART. Which sessions are CPD-archivable and therefore need SRT delivery. Which platform the webcast is running on and whether it accepts an external caption feed. All of that is pre-production planning, not day-of logistics.

Get the captioning scope right early, and the technical chain runs cleanly across in-room display, webcast track, and archive SRT. Leave it until bump-in, and you are solving an accessibility gap in real time while the programme is running.

If you are scoping a medical or scientific conference and want to work through the captioning requirements against the session programme and CPD spec, contact us and we will walk through the scope with you.