Debian 12 Linux Transcription server (Python, VOSK, Script)
For tasks when you need to process a large number of audio files in order to convert information into text, there is a VOSK model written in Python and containing most of the world's languages. This action is called transcription. Since search engines index audio and video files by description and tags, and not by their content, the task arises to publish another text version. In this guide, we install as root, if you have a normal one, use the sudo command.
|
Most frequently asked questions:
-
We want all the data to be with us. Can you make all this adjustment on our equipment?
Yes, you can order the installation and configuration of this configuration on your equipment using the link.
1. Install the necessary packages
apt update
apt install python3 python3-pip ffmpeg unzip rename
apt install python3.11-venv
python3 -m venv .venv
2. Install the Vosk utility
source .venv/bin/activate
pip3 install vosk
3. Downloading the model
cd /opt
wget https://alphacephei.com/vosk/models/vosk-model-ru-0.42.zip
unzip vosk-model-ru-0.42.zip
Note 1: In our case, we chose the full model, if the server configuration has less than 8 GB of RAM, then this model will not work and will give an error:
In this case, download the minimum version of the model, as it is less demanding on server resources:
cd /opt
wget https://alphacephei.com/vosk/models/vosk-model-small-ru-0.22.zip
unzip vosk-model-small-ru-0.22.zip
4. Use the following command syntax to check if it works
vosk-transcriber -i audio_file -o text_file -m path_to_model
vosk-transcriber -i /root/test.mp3 -o /root/test.txt -m /opt/vosk-model-ru-0.42
For the minimum model, respectively:
vosk-transcriber -i /root/test.mp3 -o /root/test.txt -m /opt/vosk-model-small-ru-0.22
If you have several files, we found and used the following script on the Internet, which allows you to process the entire folder with audio files and translate all of them into text automatically, you just have to wait for the processing to complete:
touch transcribe.sh
chmod +x transcribe.sh
nano transcribe.sh
#!/bin/bash
errmsg="USAGE: sh transcribe.sh SRCPATH DSTPATH VOSKMODELPATH"
if [ $1 ]; then
srcpath=$1
echo "SOURCE PATH: $srcpath"
else
echo "No source path entered" >&2
echo $errmsg
exit 2
fi
if [ $2 ]; then
dstpath=$2
echo "DESTINATION PATH: $dstpath"
else
echo "No destination path entered" >&2
echo $errmsg
exit 2
fi
if [ $3 ]; then
modelpath=$3
echo "VOSK MODEL PATH: $modelpath"
else
echo "No VOSK language model path entered" >&2
echo $errmsg
exit 2
fi
startdate=$(date)
find $srcpath -name "* *" -type f | rename 's/ /_/g'
## remove spaces from filenames in target directory
famount=$(find $srcpath -type f | wc -l) ## counting the number of files
echo "Found $famount files"
i=0;
for f in $srcpath/*; do
i=$(( $i + 1 ))
echo "Transcribing ${f##*/} ($i/$famount)"
vosk-transcriber -m $modelpath -i $srcpath/${f##*/} -o $dstpath/${f##*/}.txt >/dev/null 2>&1 ;
## remove >/dev/null 2>&1 to display recognition status
## rm $srcpath/${f##*/} ## uncomment to remove the original file
done
fready=$(find $dstpath -type f | wc -l)
echo "DONE. Transcribed $fready of $famount files"
enddate=$(date)
echo STARTED AT: ${startdate}
echo ENDED AT: ${enddate}
Syntax for using the script:
sh transcribe.sh path_to_sources_files path_to_result path_to_model
sh transcribe.sh /root/audio/ /root/text/ /opt/vosk-model-ru-0.42/
Note 2: If you logged into the server via ssh again, then the vosk-transcriber command will be available after you activate the Python virtual environment, so the transcription start sequence will be as follows:
source .venv/bin/activate
vosk-transcriber -i /root/test.mp3 -o /root/test.txt -m /opt/vosk-model-ru-0.42/
To process an entire folder:
sh transcribe.sh /root/audio/ /root/text/ /opt/vosk-model-ru-0.42/
More language models are available for link, it is necessary to download and unpack by analogy described in paragraph 3.
The server for translating audio files into text is ready to use. The quality of translation depends on the clarity of pronunciation and the quality of the recording itself, for example, when processing an audio fragment from an interview, it is noticeably higher than when processing a musical song.