Development of reports preparation system for colonoscopy using speech-to-text technology

Appendix A: technical description of the speech-to-text (STT) system

The speech-to-text (STT) system employed in this study was developed as a server–client-based cloud architecture, optimized for real-time transcription of medical procedures such as colonoscopy. The workflow begins with the transcription of pre-recorded audio data (database audio files) collected during procedures. These transcriptions serve as the foundation for subsequent model training processes.

The acoustic model is first constructed through phoneme-level training. Initially, a hidden Markov model (HMM) is trained using a Gaussian mixture model (GMM) for each phoneme. This step serves as the base for training a deep neural network (DNN), where the outputs from the HMM-GMM serve as supervised targets. If a previously trained acoustic model exists, the DNN weights are fine-tuned using the newly obtained transcription data.

To complement the acoustic model, an n-gram-based language model is developed. This language model is constructed using a corpus extracted from the transcribed data and is combined with the trained acoustic model to create a comprehensive speech recognition system.

The final system integrates several specialized modules to enhance performance in noisy and complex clinical environments:

A spacing module to correct word boundaries in Korean, which does not have spaces between all words.

An inverse text normalization (ITN) module that converts recognized Korean speech into appropriate formats for English terms and numeric expressions.

Pronunciation models for processing out-of-vocabulary (OOV) terms, especially those common in medical jargon.

A noise-silence module designed to exclude background noise and silent voice segments, improving recognition in busy endoscopy suites.

For speech signal processing, the system utilizes advanced feature extraction methods, including

These extracted features are fed into the DNN-based acoustic model. The model architecture comprises a convolutional neural network (CNN) encoder to enhance spatial and temporal features of the voice data, followed by a bidirectional long short-term memory (BLSTM) network or gated recurrent unit (GRU) for sequential decoding. Additionally, alternative neural networks (ANNs) can be employed for advanced adaptation based on data characteristics.

The STT system differs from general-purpose Korean speech recognition engines in that it is specifically tailored to medical recording contexts, which often involve complex terms—typically English medical vocabulary—combined with Korean grammar and sentence structures. To address this, a domain-specific language model was constructed by training on large-scale colonoscopy reports and real-world medical speech data collected during actual procedures. This model was further adapted by combining it with a general Korean language model to balance technical accuracy and linguistic fluency.

Comments (0)

No login
gif