I few years ago, Google embedded speech recognition into the Google Chrome and the Chromium browser . Both implementation uses a hidden API, but anyone is able to access and utilize the API as a Speech To Text service.
The following is a short outline on how to use the Google Speech API.
Requirements
- A FLAC file that stores the recorded speech or a mp3 file
- Curl
installed (installed using e.g.
sudo apt-get install curl
)
Prepare the FLAC
In case you got your audio stored in a mp3 file or another audio format, then you’ll need sox to convert the file to a FLAC file.
Here is the command line I used to convert the first 15 seconds of an mp3 file into a FLAC file.
sox ~/speech.mp3 speech.flac trim 0 15
For some reason the Google Speech API only allows FLAC files upto 15 seconds.
Query the Google Speech To Text API
curl -v -i -X POST -H "Content-Type:audio/x-flac; rate=16000" \
-T speech.flac \
"https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US&maxresults=10&pfilter=0&xjerr=1"
The result looks like this:
{
"status":0,
"id":"f2661df1f2661df1f2661df1f2661df1124-2",
"hypotheses": [
{ "utterance":"this is a test for speech recognition","confidence":0.7654833},
{ "utterance":"this is a fest for speech recognition"}
]
}
Query Parameters
-H "Content-Type:audio/x-flac; rate=16000"
This tells the Speech API that we send a FLAC file with the bitrate of 16000 Hz.
-T speech.flac
This attaches the content of the speech.flac file to the HTTP POST
client
The client’s name you’re connecting from. For spoofing purposes,
let’s use chromium
lang
Speech language, for example, da-DK
for Danish, or en-US
for U.S. English
maxresults
Maximum results to return for the utterance.
pfilter
The porn filter ;-). Google (by default) censors the results, leading
to “Please search for ###” (pfilter!=0) instead of “Please search for
s e x” (pfilter=0).
xjerr
Tell speech recognition server to return errors as part of the JSON
response and not HTTP error codes