Transcribe Audio and Identify Speakers

When data sources are audio files (e.g. depth interviews, customer calls, focus groups) a huge amount of time is needed to listen to each file to take notes and identify key items. Not only time-consuming, but the manual process to analyze this data doesn't scale across projects, read more about the audio use case here.

Relevance AI's Transcribe Audio and Identify Speakers workflow provides you with:

  • Audio transcription
  • Speaker diarization
  • Utterance extraction

When the workflow is successfully finalized, you will have a new dataset under your account.

How to use Audio Core

  1. Upload your audio files to the Relevance AI platform using Upload media. Alternatively, if you have already stored your audio files on the web (i.e. you can access them via an http... URL), include their URLs in a CSV file (a sample here) and upload the CSV to Relevance AI.

Note: When uploading your media files to Relevance AI, if your audio files are large, allow time for the upload process to finalize.

As a result, you will have a dataset in which each entry represents an audio file including a URL to where the file is uploaded.

  1. Once your dataset is ready, locate "Transcribe Audio and Identify Speakers" under Workflows.
Relevance AI - Access to Transcribe Audio and Identify Speakers workflow

Relevance AI - Access to Transcribe Audio and Identify Speakers workflow

Follow the steps in the setup wizard:

  • Select the field that contains the URLs to your audio file(s)
  • Select "Utterance" as the analysis mode
  • Enter a name under which the resulting dataset will be saved
Relevance AI - Audio Core setup wizard

Relevance AI - Audio Core setup wizard

  • Click on Run workflow

Note: You can track the progress on workflow history or wait till you receive an email notification on workflow is finalized.

A very useful tool for processing Focus Group data is the section markers. The audio is broken into sections, according to the specified phrase identifiers used by moderators.

Simply enter the identifier phrase and the corresponding Topic for the section as shown in the example below.

Audio Core workflow outputs

After setting up the workflow wizard and successfully running the workflow, you will have a new dataset called <Original-Dataset-Name>_utterance in your account or <the new entered name in the setup>_utterance. For instance if the original dataset name is audio_dataset, a new dataset named audio_dataset_utterance will be added to your account.

This new dataset included the following main fields/columns:

  • Text: transcription of the audio file(s)
  • Speaker: A, B, C, etc. labels assigned to the voices heard in the audio
  • Start: time in the audio indicating the beginning of a spoken piece (Utterance)
  • End: time in the audio indicating the end of a spoken piece (Utterance)
  • File Name: original file name

Note 1: Transcription can take long depending on the size of your audio file (for example 1.5 hours of audio takes around 20 minutes)
Note 2: You might need to refresh the page for <Original-Dataset-Name>_utterance to appear under Datasets.



Even though audio analysis result is written back to the original dataset, for further processing of your data, use the new dataset. It can be found under <Original-Dataset-Name>_utterance.

Common questions

What do I do next?

After successful execution of the workflow, the audio is transcribed and chunked in three different ways. Results are saves as datasets under your account. You can treat the data as text and apply workflows such as AI Clustering and AI Tagging.

Can I split focus groups to their composing sections?

If moderators use a section identifier, label topics in focus-group transcription is a workflow that can mark the sections for you

What are your tips for the most efficient way to get high level themes?

There are Gist and Summary fields under the chapter dataset which can provide you with high level analysis over the theme. But to further understand the data and deeper analysis, we recommend applying AI Clustering and AI Tagging to your data. Don't forget to switch to <Original-Dataset-Name>_utterance dataset.

What is my best tool to better study and visualize the results

Relevance AI's Explorer is a great tool that provides you with variety of configurable data views as well as search and filtering on your data.

How to extract the transcription

This can be done directly through an export workflow or on the Explorer dashboard. For the latter, set up a categorical view, you can choose any categories such as the Speaker field. then use export which generates a CSV file including the fields you select to download.

Note 1: utterances in the downloaded file might not be in order. We recommend including the Start field in your export, so you can sort the data accordingly.

Note 2: If you are working with multiple audio files, you can use the File Name field to separate the transcriptions.