The Problem
How much does background noise degrade speech-to-text accuracy, and does the type of noise matter? This was the research question driving an HCI experiment that needed a reliable way to present text excerpts to participants, play ambient noise conditions simultaneously, capture their spoken reading, transcribe it, and compute accuracy metrics, all in a synchronized, repeatable workflow.
Doing this manually would introduce timing inconsistencies and transcription variability. The experiment needed an automated instrument.
The Tool
The Speech Transcription Trializer 3000 is a multi-threaded Python desktop application purpose-built for this experiment. It synchronizes four concurrent operations: displaying a text excerpt for the participant to read aloud, playing an ambient noise clip (library quiet, busy cafe, or crowd noise), recording the participant’s voice, and logging performance metrics, all coordinated through Python’s threading infrastructure.
After each trial, the tool sends the audio recording to the Google Speech-to-Text API for transcription, then automatically computes word error rate (WER) by comparing the transcription to the source text and words per minute (WPM) from the recording duration. Results are written to a structured CSV pipeline for downstream statistical analysis.
The tool manages the full experimental matrix, cycling through text excerpts and noise conditions, with a GUI that lets the researcher monitor progress, review individual trial results, and restart trials if needed.
Results
The automated pipeline collected data across all participants and noise conditions with consistent timing and methodology. The statistical analysis revealed several patterns in how ambient noise affects transcription accuracy.


Crowd noise produced the highest median word error rates, but the most striking finding was the influence of speaking speed: very slow speech consistently yielded the worst transcription accuracy across all noise conditions, suggesting that the speech-to-text API’s language model performs better with natural-cadence input than with halting, deliberate speech.
Graduate coursework, Research Methods, RIT · Fall 2024