Just how much progress have computer systems made in understanding the human voice? I don’t mean recognizing specific commands, like the hands-free Bluetooth in your car. I’m looking for a tool an author can use to write a book without typing. I want to know how well a computer can understand people when they just talk to it. And the answer is, “not very well.”
State of the Art
When I started my research, I was surprised to learn that there are few players in the field. The only commercial product is Nuance’s Dragon Dictate for Mac (or NaturallySpeaking for Windows), which costs just over $100. And then Mac has Dictation Speech to Text, installed in the operating system for all Macs since Mountain Lion in 2012 or earlier. Windows has an embedded program called Speech Recognition. There are also several free online voice-to-text services, which RJ Crayton dealt with on Indies Unlimited last September.
I had a pragmatic (read: selfish) reason for this research. I am working on a project recording and publishing stories told by seniors. I was hoping technology would save me hours of transcribing the hundred or so stories I will collect. People who know anything about this subject will start chuckling about now, because I was dreaming.
As a test piece, I recorded my brother talking to me about an event from our childhood to see how it would transcribe. Then I read the passage aloud myself, using Dragon Dictate that I had already initialized to understand my voice. Then I read the same passage into my Mac with Dictation running. Last, I used Dragon Dictate on the original recording, played on my Zoom H5 digital recorder.
The original passage (punctuation added later):
Yeah, Gord, this story I’m going to tell was actually published in the Province newspaper. They used to have a little program for kids to write in from different parts of the province telling about what they were doing. This is around the time when I was 6 and you were 5 and it was in the spring of the year and we were living at home down by Highway 16, not in sawmill camp or tie hacking camp with our parents.
Trial 1: created by me, using Dragon Dictate after initializing the program to work for my voice, a process that involved reading aloud several short paragraphs given to me by the setup process.
Yeah, Gord, this story and going to tell was actually published in the Providence newspaper. They used to have a little program for kids to write in from different parts of the province telling about what they were doing. This is around the time when I was six and your five and it was in the spring of the year and we were living at home down by Highway 16 not in the sawmill account or tie hacking With her parents.
I type at a decent speed, (most people talk at about 140 wpm. I type around 60) but I suspect using this transcription and fixing the errors would take me about the same amount of time as keyboarding it.
Trial 2: My voice, using the Dictation Speech to Text software in the latest Mac OS, El Capitan.
This story I’m going to tell was actually published in the province newspaper. They used to have a little program for kids to write in from different parts of the province telling about what they were doing. This is a around the time when I was six and you were five and it was in the spring of the year and we were living at home Down by highway 16 not in sawmill Camp or Thai hacking camp with our parents.
Not too bad, although it couldn’t deal with “Yeah, Gord,” at all. More on “Thai hacking” later. Still not good enough to save me any time, compared to entering it via keyboard. This program came free on my computer, so it was definitely cost effective.
Trial 3: my brother’s voice with the Mac program, sent directly from the digital recorder to the computer.
This story to tell was Archie published in the province newspaper list of a little program for kids to right in front of the parts the province holder doing. This is around the time when I was six and you were five and I was the spring of year and we’re living at home Dumbo I was 16 not in the sawmill camper kayaking camp and with her parents.
Recognizable as the same text (barely), but for my purposes, completely useless.
Trial 4: from the same digital recording of my brother, using Dragon, not initialized for his speech.
So this store so that is published promised to stand for some of you will strengthen the “proximal other small talk, however is the smallest way of the surrounds are 65 that is here and the whole town is.
Even worse. One can only hope that the program would learn over time, but that becomes less cost effective the more you have to mess around with it. I don’t plan to run 30 or 40 people through the initialization process.
I didn’t test Speech Recognition for Windows, but it was designed by Nuance, and from the descriptions in other reviews, it sounds like a Lite version of NaturallySpeaking.
Improvement Over Time
I wondered if I could train the computer to stop saying “Thai hacking” or “kayaking,” instead of the unusual expression “tie hacking,” if I used the expression a few times in different contexts. This is how it went…
Hacking ties Thai hacking hack time Half time hacking ties hat ties hat ties hack ties hack ties hacking tires Thai hacking hat Pack hack hack hat Pack hat hi tie tack packing ties it’s Okay Josh
(That last bit was to my dog, who was worried because his master was shouting at the computer. Obviously the learning experience has to go both ways.)
Be warned, when you start trying to input sound from devices other than your computer’s built in microphone, you run into technical problems. Several reviewers mentioned the advantage of having a good mic in getting accurate responses. The built-in Mac microphone seems to be pretty good. I wanted to input pre-recorded stories on my Zoom H5 digital recorder, and that required a different input adapter for a newer iMac. My old MacBook Pro had no trouble once I figured out how to set up the Sound in System Preferences. The direct line from the digital recorder helps the transfer of good quality sound, but of course the quality of the original recording (type of recorder, type of microphone, ambient noise, etc., etc.) makes a difference. In my experience, you just have to keep trying different equipment until the product is good enough for your purposes. Or not, as the case may be.
The Learning Curve
As I worked with each program, I learned to speak slower and more clearly, and was able to improve my (or the program’s) accuracy. Likewise, I assume the program will learn my voice and vocabulary better over time. Dragon has a “correct” function that allows you to go back and say “correct ‘Thai’,” which presumably also goes into the memory bank. I didn’t experiment with that.
Basically, Dragon’s program does more, but it is more complex to set up and learn. Mac’s is more intuitive and easier to use. Most reviewers say that both these programs improve as they get used to your voice and your usual vocabulary. I noted that under most conditions the computer could handle simple vocabulary like, “This is around the time when I was six and you were five.”
I suspect that over time both the Dragon program and the Mac program could be taught to do a decent job on the voice of an individual. I think it’s a matter of your individual preferences. My other brother, who is a great storyteller but doesn’t like writing, swears by the Dragon program, because it allows him to just talk. I learned to type in Grade 10 because my handwriting was illegible, so I’m pretty well conditioned to letting my fingers do the walking. I don’t see myself switching.
In general, I think the software designers have a long way to go before a person can jump into a driverless taxi and say, with a strong foreign accent, “I wanna go to the nearest good sushi joint.”
And I’m going to be manually transcribing a lot of stories in the next few months.