Evaluate ASR accuracy
Evaluate the ASR accuracy of your application using Speechly CLI.
Overview
Evaluating ASR accuracy helps you to verify that your test dataset gets correctly transcribed. When building large scale voice services, continuously evaluating the applications’ ASR accuracy becomes an important part of the workflow that helps you identify weaknesses in your training data.
Evaluation consists of four steps:
- Define a test dataset
- Transcribe the test dataset
- Prepare ground truth transcripts
- Compute ASR accuracy
In this guide we'll be using snippets from The Speechly Podcast as an example.
Define test dataset
A test dataset consists of audio recordings of your users speaking to your application. You can record your own audios, or use real audio recordings available from your own database. See Supported audio formats for more information.
Put the audio files in a folder and create a JSON Lines file with each audio on their own line using the following format:
{"audio": "podcast1.wav"}
{"audio": "podcast2.wav"}
{"audio": "podcast3.wav"}
{"audio": "podcast4.wav"}
Transcribe test dataset
Transcribe your test dataset by running:
speechly transcribe test-dataset.jsonl -a YOUR-APP-ID > ground-truths.jsonl
The resulting ground-truths.jsonl
contains the test dataset as transcribed by Speechly Cloud:
{"audio": "podcast1.wav", "hypothesis": "welcome to another episode of the speechly podcast where you can expect conversations express"}
{"audio": "podcast2.wav", "hypothesis": "and so in the original siri there were about i don't know fifty different partners and"}
{"audio": "podcast3.wav", "hypothesis": "this concept of voice being an expert ui could you might be unpack that concept a little bit"}
{"audio": "podcast4.wav", "hypothesis": "well again thanks so much for the time adam really enjoyed the discussion"}
If you have a large dateset then reading the JSON Lines text can be difficult. You can use jq
to make viewing large files easier. Remember to convert back to compact before sending the ground truths for evaluation.
Using Speechly On-device
Transcribe the test dataset using Speechly On-device by specifying a model bundle file:
speechly transcribe test-dataset.jsonl -m YOUR_MODEL_BUNDLE.ort.bundle > ground-truths.jsonl
On-device transcription is available on Enterprise plans.
Prepare ground truth transcripts
Go through the contents of ground-truths.jsonl
and fix all transcription errors that you might find:
-{"audio": "podcast1.wav", "transcript": "welcome to another episode of the speechly podcast where you can expect conversations express"}
+{"audio": "podcast1.wav", "transcript": "welcome to another episode of the speechly podcast where you can expect conversations"}
{"audio": "podcast2.wav", "transcript": "and so in the original siri there were about i don't know fifty different partners and"}
-{"audio": "podcast3.wav", "transcript": "this concept of voice being an expert ui could you might be unpack that concept a little bit"}
+{"audio": "podcast3.wav", "transcript": "this concept of voice being an expert ui could you maybe unpack that concept a little bit"}
{"audio": "podcast4.wav", "transcript": "well again thanks so much for the time adam really enjoyed the discussion"}
In this example podcast1.wav
and podcast3.wav
had some minor transcription errors that needed fixing. The other transcripts were correct and require no actions.
The results from the transcribe
command are returned as (audio, hypotesis)
key-value pairs. Ground truths must be defined as (audio, transcript)
key-value pairs.
Compute ASR accuracy
Compute ASR accuracy by running:
speechly evaluate asr YOUR-APP-ID ground-truths.jsonl
The command outputs:
Audio: podcast1.wav
└─ Ground truth: welcome to another episode of the speechly podcast where you can expect conversations********
└─ Prediction: welcome to another episode of the speechly podcast where you can expect conversations express
Audio: podcast3.wav
└─ Ground truth: this concept of voice being an expert ui could you may***be unpack that concept a little bit
└─ Prediction: this concept of voice being an expert ui could you might be unpack that concept a little bit
Word Error Rate (WER): 0.04 (3/68)
WER is computed by comparing your ground truth transcripts against Speechly Cloud generated transcripts. Speechly Cloud uses the latest version of your application.
Improving ASR accuracy
Transcription errors can be mitigated by ensuring that the utterance appears in your training data. Text adaptation will get you quite far, but many tricky cases benefit from Audio adaptation.
After fixing the issues, redeploy your application. You can run the evaluation again with the same ground-truths.jsonl
to see if the ASR accuracy improved.