Voice Transcription Research Tool

The Problem

How much does background noise degrade speech-to-text accuracy, and does the type of noise matter? This was the research question driving an HCI experiment that needed a reliable way to present text excerpts to participants, play ambient noise conditions simultaneously, capture their spoken reading, transcribe it, and compute accuracy metrics, all in a synchronized, repeatable workflow.

Doing this manually would introduce timing inconsistencies and transcription variability. The experiment needed an automated instrument.

The Tool

The Speech Transcription Trializer 3000 is a multi-threaded Python desktop application purpose-built for this experiment. It synchronizes four concurrent operations: displaying a text excerpt for the participant to read aloud, playing an ambient noise clip (library quiet, busy cafe, or crowd noise), recording the participant’s voice, and logging performance metrics, all coordinated through Python’s threading infrastructure.

Results dialog showing the transcribed text, duration, words per minute, and word error rate for a completed trial — Post-trial results showing the transcription, timing metrics, and word error rate computed against the source text.

After each trial, the tool sends the audio recording to the Google Speech-to-Text API for transcription, then automatically computes word error rate (WER) by comparing the transcription to the source text and words per minute (WPM) from the recording duration. Results are written to a structured CSV pipeline for downstream statistical analysis.

The tool manages the full experimental matrix, cycling through text excerpts and noise conditions, with a GUI that lets the researcher monitor progress, review individual trial results, and restart trials if needed.

Results

The automated pipeline collected data across all participants and noise conditions with consistent timing and methodology. The statistical analysis revealed several patterns in how ambient noise affects transcription accuracy.

Scatter plot of word error rate versus words per minute, color-coded by noise type (library, busy cafe, crowd), showing higher error rates at lower speaking speeds across all conditions — Word error rate by speaking speed and noise condition. Slower speakers showed higher error rates regardless of ambient noise type.

Box plot showing word error rate distributions for library, busy cafe, and crowd noise conditions, with crowd noise showing the highest median WER — WER distribution by noise condition (left) shows crowd noise producing the highest median error rates. The heatmap (right) reveals that very slow speech consistently yields the worst transcription accuracy across all noise types.

Heatmap of average word error rate by audio clip type and words-per-minute category, showing very slow speech producing the highest error rates — WER distribution by noise condition (left) shows crowd noise producing the highest median error rates. The heatmap (right) reveals that very slow speech consistently yields the worst transcription accuracy across all noise types.

Crowd noise produced the highest median word error rates, but the most striking finding was the influence of speaking speed: very slow speech consistently yielded the worst transcription accuracy across all noise conditions, suggesting that the speech-to-text API’s language model performs better with natural-cadence input than with halting, deliberate speech.

Graduate coursework, Research Methods, RIT · Fall 2024

Experiment Automation Tool	Multi-threaded Python application synchronizing transcript display, ambient noise playback, voice capture, and real-time metrics logging
Statistical Analysis	Word error rate and words-per-minute analysis across noise conditions, with visualizations

Voice Transcription Research Tool

What was built

The Problem

The Tool

Results

Methods

Tech Stack