Skip to main content

Evaluate NLU accuracy

Evaluate the NLU accuracy of your application using Speechly CLI.


Evaluating NLU accuracy helps you to verify that your test utterances get recognized and annotated correctly. When building large scale voice applications, continuously evaluating the applications’ NLU accuracy becomes an important part of the workflow that helps you identify weaknesses in your training data.

Evaluation consits of four steps:

  1. Define your test utterances
  2. Annotate the test utterances
  3. Prepare ground truth annotations
  4. Compute NLU accuracy

In this guide we'll be using a simple clothing store as an example.

Define test utterances

Test utterances are text utterances of things your users are saying to your application. You can write your own test utterances, or use real user utterances available from the User data tab in Speechly Dashboard or by using the utterances command.

Create a text file with a single utterance on each line without any annotations:

can i see blue sneakers
i want two of those
my address is one twenty three imaginary road
delivery tomorrow

Annotate test utterances

Annotate your test utterances by running:

speechly annotate YOUR-APP-ID -i test-utterances.txt > ground-truths.txt

The resulting ground-truths.txt contains the test utterances as annotated by Speechly Cloud:

*search can i see [blue|blue](color) [sneakers|sneakers](product)
*add_to_cart i want two of those
*set_address my address is [one twenty three imaginary road|123 Imaginary Rd.](address)
*set_delivery_date delivery [tomorrow|2023-07-18](delivery_date)

The annotations are given in a syntax similar to SAL, with the exception that each entity has two values separated by a | character. The first value is the original transcript and, while the second is the post-processed value that depends on the Entity data type. If an entity has type string, then both the transcript and post-processed value are equal.

Using reference dates

The entities with type date let users to utter dates relative to the current date. For example, the utterance “today” normally resolves to the current date. The post-processed value is therefor affected by the day you annotate the utterance.

You can specify a reference date when annotating to fix the resolved dates:

speechly annotate YOUR-APP-ID -i test-utterances.txt --reference-date <YYYY-MM-DD>
# ...
*set_delivery_date delivery [tomorrow|<YYYY-MM-DD>](delivery_date)

Prepare ground truth annotations

Go through the contents of ground-truths.txt and fix all errors in intents or entities that you might find:

*search can i see [blue|blue](color) [sneakers|sneakers](product)
-*add_to_cart i want two of those
+*add_to_cart i want [two|2](amount) of those
*set_address my address is [one twenty three imaginary road|123 Imaginary Rd.](address)
*set_delivery_date delivery [tomorrow|2023-07-18](delivery_date)

In this example the word "two" was not recognized and needs to be added. The other entities are correctly annotated and require no actions.

Compute NLU accuracy

Compute NLU accuracy by running:

speechly evaluate nlu YOUR-APP-ID ground-truths.txt

The command outputs:

Line: 2
└─ Ground truth: *add_to_cart i want [two|2](amount) of those
└─ Prediction: *add_to_cart i want *two*********** of those

Line: 4
└─ Ground truth: *set_delivery_date delivery [tomorrow|2023-07-18](delivery_date)
└─ Prediction: *set_delivery_date delivery [tomorrow|2023-08-18](delivery_date)

Accuracy: 0.50 (2/4)

Accuracy is computed by comparing your ground truth annotations against Speechly Cloud generated annotations. Speechly Cloud uses the latest version of your application. If prediction doesn't match the ground truth 100%, it's counted as a miss even if the entity was correctly recognized.

Improving NLU accuracy

In the first case the prediction is missing the amount entity. This type of errors can be mitigated by ensuring that the utterance appears in your training data and is properly annotated.

In the second case the delivery_date entity was correctly recognized (from the word "tomorrow") but the date didn't match. Since the return value of relative dates resolves to the current day, it is a good idea to make sure that all the relative dates are annotated against some fixed reference date.

Text adaptation will get you quite far, but many tricky cases benefit from Audio adaptation.

After fixing these issues, redeploy your application. You can run the evaluation again with the same ground-truths.txt to see if the NLU accuracy improved.