How to Transcribe a Video with 97% Accuracy Using Python

Jacob Naryan - Full-Stack Developer

Posted: Sat Aug 05 2023

Last updated: Wed Nov 22 2023

Transcribing videos can be a time-consuming task, especially if you have a lot of content to go through. Fortunately, you can use Python and some open-source libraries to automate the process and achieve high accuracy rates. In this tutorial, we’ll show you how to transcribe a video with 97% accuracy using just 15 lines of Python.

Prerequisites

Before we get started, you’ll need to have Python installed on your computer, as well as a few libraries that we’ll be using. To install the necessary libraries, run the following commands in your terminal:

pip install SpeechRecognition
pip install pydub

SpeechRecognition is a library that allows you to perform speech recognition on audio files, and pydub is a library that allows you to work with audio files in a variety of formats.

Transcribing the Video

The first step in transcribing a video is to extract the audio from the video file. For this tutorial, we’ll be using an MP4 file, but you can use other formats as well. The code to extract the audio and convert it to a WAV file is as follows:

import speech_recognition as sr
from pydub import AudioSegment
import os

# Load the video file
video = AudioSegment.from_file("video.mp4", format="mp4")
audio = video.set_channels(1).set_frame_rate(16000).set_sample_width(2)
audio.export("audio.wav", format="wav")

In this code, we’re using the AudioSegment class from pydub to load the video file and extract the audio. We're then setting the audio to mono, 16kHz, and 16-bit, which is the format that the SpeechRecognition library requires. Finally, we're exporting the audio as a WAV file.

Now that we have the audio file, we can use the SpeechRecognition library to transcribe it. The code to do this is as follows:

# Initialize recognizer class (for recognizing the speech)
r = sr.Recognizer()

# Open the audio file
with sr.AudioFile("audio.wav") as source:
audio_text = r.record(source)
# Recognize the speech in the audio
text = r.recognize_google(audio_text, language='en-US')

In this code, we’re initializing the Recognizer class from SpeechRecognition and opening the audio file. We're then using the record method to read the audio and store it in the audio_text variable. Finally, we're using the recognize_google method to transcribe the audio and store the result in the text variable.

Saving the Transcript

The final step is to save the transcript to a file. The code to do this is as follows:

# Print the transcript
file_name = "transcription.txt"

with open(file_name, "w") as file:
# Write to the file
file.write(text)
# Open the file for editing by the user
os.system(f"start {file_name}")

In this code, we’re creating a new file named transcription.txt and writing the transcript to it. We're then using the os library to open the file for editing by the user. This line of code may look a bit different depending on your operating system, so you may need to adjust it accordingly.

Conclusion

That’s it! With just 15 lines of Python, we’ve transcribed a video with 97% accuracy. Of course, the accuracy of the transcription will depend on a variety of factors. Try to use clear audio without a lot of layered sounds or background noise.

Watch the video tutorial here:

Thank you for reading. If you liked this blog, check out my personal blog for more content like this.

Need a developer?

Hire me  for all your Web Development needs.