By David Shamah

June 9, 2011, Updated September 12, 2012

Invented by a renowned pianist this Israeli technology can add and adjust expressiveness to computerized voices, making them sound happy, sad, angry, calm – just like the real thing.

Gershon Silbert used his experience as a pianist to create natural sounding computerized voices.

The phone company AT&T brought many innovations to the world — phone operators, princess telephones, the “Baby Bells” – and also those monotone computer-generated voices you hear on the phone, computers and toys, sounding as unnatural as you could imagine. The name AT&T gave to this product — Natural Voices Text-to-Speech – is either someone’s idea of a joke or a case of wishful thinking.

“Those voices are not easy to take in long doses,” says Gershon Silbert, CEO of Israeli startup Vivotext, which is developing a more “human” approach for text to speech (TTS).

“There are many areas where more natural-sounding voices will have a major impact on the market, such as in interactive games, speech-enabled websites and audiobooks,” says Silbert.

Today, just 1.2 percent of all books are recorded. “This could be a multi-billion dollar industry, but you need expressiveness in order to make a book understandable, and currently the only way to provide that expressiveness is to hire humans to read and record books.”

Existing TTS technologies from top players — AT&T, Nuance, Loquendo and others — just wouldn’t cut it with listeners. Nor would those phony voices impress game players, toy makers and motorists who want to listen to their email while driving.

Like comparing a Model-T to a Caddy

Enter Vivotext, Silbert’s expressive solution to natural-sounding artificial voices. “Our text-to-speech technology is based on a multidisciplinary approach drawing on expertise from the fields of music performance analysis, phonetics, syntax, lexicography and digital signal processing [DSP],” says Silbert. “We have patents pending to cover our proprietary approach to expressiveness and the use of voice sample libraries.”

Comparing Vivotext to AT&T Natural Voice or Nuance Realspeak is like comparing a Model-T to the latest Cadillac, says Silbert. The older model is scratchy, bumpy and is barely functional, while the newer is smooth, rides like a whisper and features the latest technology.

“Vivotext voices sound human because they are expressive. Our technology can add and adjust expressiveness to computerized voices, making them sound happy, sad, angry, calm, inquisitive – just like human voices,” says Silbert.

Capturing every nuance of expression

The secret is based on music – specifically, the conversion of musical scores into human-like expressive performances – for the conversion of written text into natural-sounding speech. Just as variation in tempo, articulation and dynamics contribute to the effectiveness of a musical performance, speech attributes such as pitch, duration and amplitude are at the core of effective TTS, and are critical to conveying the full meaning of words and sentences.

Silbert knows music; he is a professional pianist who has cut several highly regarded albums. “We apply methods developed for music performance, called MOR (music objects recognition) to speech synthesis, and the result is highly intelligible enunciation and natural flow in a variety of speaking styles.”

Based on that technology, Vivotext has developed a large library of samples, applicable to any language, that allow programmers to load in a range of emotions to the voice. Vivotext derives basic expression automatically from the phonetic, semantic and syntactic analysis of the text — determining, for example, whether the sentence is a statement or question, simple or complex.

The analysis also takes into consideration additional expressive instructions provided by the use of punctuation, italics, underlining and capital letters. Expression is then determined by a speaking-style preference chosen from a menu. For example, the user can choose a “deliberate” style for news or an “enthusiastic” style for announcing the launch of a new product.

Major deals in the works

Studies have shown that the more closely consumers can relate to the artificial voices that speak from GPS devices, phone information services, websites, games, cell phones and remote controls, the higher their opinion of the product – and the more likely they are to spend money for the product or service associated with that voice.

“The market for this is huge, with many billions at stake,” says Silbert. “We are the only ones who have succeeded in developing such an extensive human-like TTS, and the market is taking notice. Anyone who uses artificial voices in their products absolutely loves what we are doing.”

Vivotext has two deals pending – one with a large US toy manufacturer and another with a major US audiobook publisher.

Based in the Mofet B’Yehuda incubator near Jerusalem, Vivotext is funded by the incubator and is working on a funding deal with several independent investors. The management team consists of Silbert, CTO Dr. Yossef Ben-Ezra and chairman Samuel H. Solomon.