Skip to main content

Evaluate ASR accuracy

Evaluate the ASR accuracy of your application using Speechly CLI.

Overview

Evaluating ASR accuracy helps you to verify that your test dataset gets correctly transcribed. When building large scale voice services, continuously evaluating the applications’ ASR accuracy becomes an important part of the workflow that helps you identify weaknesses in your training data.

Evaluation consists of four steps:

  1. Define a test dataset
  2. Transcribe the test dataset
  3. Prepare ground truth transcripts
  4. Compute ASR accuracy

In this guide we'll be using snippets from The Speechly Podcast as an example.

Define test dataset

A test dataset consists of audio recordings of your users speaking to your application. You can record your own audios, or use real audio recordings available from your own database. See Supported audio formats for more information.

Put the audio files in a folder and create a JSON Lines file with each audio on their own line using the following format:

test-dataset.jsonl
{"audio": "podcast1.wav"}
{"audio": "podcast2.wav"}
{"audio": "podcast3.wav"}
{"audio": "podcast4.wav"}

Transcribe test dataset

Transcribe your test dataset by running:

speechly transcribe test-dataset.jsonl -a YOUR-APP-ID > ground-truths.jsonl

The resulting ground-truths.jsonl contains the test dataset as transcribed by Speechly Cloud:

{"audio": "podcast1.wav", "hypothesis": "welcome to another episode of the speechly podcast where you can expect conversations express"}
{"audio": "podcast2.wav", "hypothesis": "and so in the original siri there were about i don't know fifty different partners and"}
{"audio": "podcast3.wav", "hypothesis": "this concept of voice being an expert ui could you might be unpack that concept a little bit"}
{"audio": "podcast4.wav", "hypothesis": "well again thanks so much for the time adam really enjoyed the discussion"}
Tip

If you have a large dateset then reading the JSON Lines text can be difficult. You can use jq to make viewing large files easier. Remember to convert back to compact before sending the ground truths for evaluation.

Using Speechly On-device

Transcribe the test dataset using Speechly On-device by specifying a model bundle file:

speechly transcribe test-dataset.jsonl -m YOUR_MODEL_BUNDLE.ort.bundle > ground-truths.jsonl
Enterprise feature

On-device transcription is available on Enterprise plans.

Prepare ground truth transcripts

Go through the contents of ground-truths.jsonl and fix all transcription errors that you might find:

test-dataset.jsonl
-{"audio": "podcast1.wav", "transcript": "welcome to another episode of the speechly podcast where you can expect conversations express"}
+{"audio": "podcast1.wav", "transcript": "welcome to another episode of the speechly podcast where you can expect conversations"}
{"audio": "podcast2.wav", "transcript": "and so in the original siri there were about i don't know fifty different partners and"}
-{"audio": "podcast3.wav", "transcript": "this concept of voice being an expert ui could you might be unpack that concept a little bit"}
+{"audio": "podcast3.wav", "transcript": "this concept of voice being an expert ui could you maybe unpack that concept a little bit"}
{"audio": "podcast4.wav", "transcript": "well again thanks so much for the time adam really enjoyed the discussion"}

In this example podcast1.wav and podcast3.wav had some minor transcription errors that needed fixing. The other transcripts were correct and require no actions.

Gotcha

The results from the transcribe command are returned as (audio, hypotesis) key-value pairs. Ground truths must be defined as (audio, transcript) key-value pairs.

Compute ASR accuracy

Compute ASR accuracy by running:

speechly evaluate asr YOUR-APP-ID ground-truths.jsonl

The command outputs:

Audio: podcast1.wav
└─ Ground truth: welcome to another episode of the speechly podcast where you can expect conversations********
└─ Prediction: welcome to another episode of the speechly podcast where you can expect conversations express

Audio: podcast3.wav
└─ Ground truth: this concept of voice being an expert ui could you may***be unpack that concept a little bit
└─ Prediction: this concept of voice being an expert ui could you might be unpack that concept a little bit

Word Error Rate (WER): 0.04 (3/68)

WER is computed by comparing your ground truth transcripts against Speechly Cloud generated transcripts. Speechly Cloud uses the latest version of your application.

Improving ASR accuracy

Transcription errors can be mitigated by ensuring that the utterance appears in your training data. Text adaptation will get you quite far, but many tricky cases benefit from Audio adaptation.

After fixing the issues, redeploy your application. You can run the evaluation again with the same ground-truths.jsonl to see if the ASR accuracy improved.