Projects /

Voice Transcription Research Tool

All Projects
Engineering Research Ai Ml

Voice Transcription Research Tool

A Python research instrument for measuring speech-to-text accuracy under varying ambient noise conditions, automating data collection, transcription, and statistical analysis for an HCI experiment.

The Speech Transcription Trializer 3000 application showing a list of experimental trials and a transcript being displayed for reading
Role
Lead Developer
Context
Graduate Coursework, Research Methods, RIT
Timeline
Nov 2024
Duration
~3 weeks

What was built

Experiment Automation Tool Multi-threaded Python application synchronizing transcript display, ambient noise playback, voice capture, and real-time metrics logging
Statistical Analysis Word error rate and words-per-minute analysis across noise conditions, with visualizations

The Problem

How much does background noise degrade speech-to-text accuracy, and does the type of noise matter? This was the research question driving an HCI experiment that needed a reliable way to present text excerpts to participants, play ambient noise conditions simultaneously, capture their spoken reading, transcribe it, and compute accuracy metrics, all in a synchronized, repeatable workflow.

Doing this manually would introduce timing inconsistencies and transcription variability. The experiment needed an automated instrument.

The Tool

The Speech Transcription Trializer 3000 is a multi-threaded Python desktop application purpose-built for this experiment. It synchronizes four concurrent operations: displaying a text excerpt for the participant to read aloud, playing an ambient noise clip (library quiet, busy cafe, or crowd noise), recording the participant’s voice, and logging performance metrics, all coordinated through Python’s threading infrastructure.

Results dialog showing the transcribed text, duration, words per minute, and word error rate for a completed trial
Post-trial results showing the transcription, timing metrics, and word error rate computed against the source text.

After each trial, the tool sends the audio recording to the Google Speech-to-Text API for transcription, then automatically computes word error rate (WER) by comparing the transcription to the source text and words per minute (WPM) from the recording duration. Results are written to a structured CSV pipeline for downstream statistical analysis.

The tool manages the full experimental matrix, cycling through text excerpts and noise conditions, with a GUI that lets the researcher monitor progress, review individual trial results, and restart trials if needed.

Results

The automated pipeline collected data across all participants and noise conditions with consistent timing and methodology. The statistical analysis revealed several patterns in how ambient noise affects transcription accuracy.

Scatter plot of word error rate versus words per minute, color-coded by noise type (library, busy cafe, crowd), showing higher error rates at lower speaking speeds across all conditions
Word error rate by speaking speed and noise condition. Slower speakers showed higher error rates regardless of ambient noise type.
Box plot showing word error rate distributions for library, busy cafe, and crowd noise conditions, with crowd noise showing the highest median WER
Heatmap of average word error rate by audio clip type and words-per-minute category, showing very slow speech producing the highest error rates
WER distribution by noise condition (left) shows crowd noise producing the highest median error rates. The heatmap (right) reveals that very slow speech consistently yields the worst transcription accuracy across all noise types.

Crowd noise produced the highest median word error rates, but the most striking finding was the influence of speaking speed: very slow speech consistently yielded the worst transcription accuracy across all noise conditions, suggesting that the speech-to-text API’s language model performs better with natural-cadence input than with halting, deliberate speech.


Graduate coursework, Research Methods, RIT · Fall 2024

Methods

Statistical Analysis

Tech Stack

Python Google Speech-to-Text API