I did my first Power of Sound presentation in 1998. Not using any visual cues was unnerving to begin with but practice and various memory training tricks helped make it fairly slick. By the time I’d delivered it maybe a dozen times I was offered 30 minutes in front of the marketing director for a big non-radio spending insurance brand. I was well rehearsed but on entering the boardroom at Capital he promptly announced something had come up and I now had just 5 minutes to convince him of the creative potential of sound and radio. Branding is hugely important in industries where product features are easily copied. Brands are defined by how we feel towards them - the emotional connection. Sound is how we receive most of the emotional information through our lives. I looked at my playlist and said I wanted to play just two audio clips, back to back, to prove that point:
It worked, he agreed. In under 5 minutes sound had managed to take him to the opposite poles of human emotion. Genuine human emotion delivered through the instrument we’ve evolved for that specific role, our voice.
There is a lot of chat around voice, it’s the hot topic of the moment, particularly with the impressive Duplex demo in the Google I/O keynote earlier this week.
Just after the UK launch of the Amazon Echo in the Autumn of 2016, I wrote a blog post titled "Why it’s good to talk, trust, think and feel", in which I explored the origins of human speech and the potential for synthetic voices where I linked to Wavenet, the work of DeepMind AI. They have been part of Google since 2014 and are undoubtedly behind many of the impressive aspects of Duplex. It’s funny as an audio creative I’ve always been drawn to natural, emotive vocal delivery, trying to distill and replicate its impact in my own presentations and yet when it comes the production of ads we often remove the imperfect, the umms, arrhs and breathes, unless it’s dialogue of course. However why shouldn’t they remain in some announcement, single voice scenarios. If they need to be added to enable trust in the delivery of a synthetic voice then perhaps we should be more forgiving in other circumstances.
The other noteworthy recent development in this area was the synthetic recreation of JFK’s voice to deliver the speech he never gave in Dallas - 1963, the day he was assassinated. This was the work of Edinburgh based Text to Speech specialist, CereProc.
There are some really interesting applications for this technology with A Million Ads, starting with simply testing how dynamic scripts might sound within our Studio pre-production, right through to voicing huge lists of store locations, retargeted product catalogues or all known first names to entire campaigns. The key creative aspects to believable synthetic voices are the same we are dealing with when ensuring that dynamic campaigns using human voices sound indistinguishable from non-dynamic broadcast style ads. Particularly making sure dynamic edit points are compatible with the way we naturally merge sounds in the way we speak. However longer term the idea of being able to synthetically sample and recreate people’s voices could have a profound effect on voice talent. I used a CereProc synthetic voice, that we considered the most believable called Stuart, for this Nissan Leaf pitch demo highlighting that lack of emotional engagement.
Of course synthetic voices currently lack genuine emotional delivery, but it would be naive not to consider their eventual improvement through artificial intelligence to the point where we can’t tell them apart from a human voice in certain circumstances. So we’re intrigued to experiment with synthetic voices to fully understand their capabilities as they develop. The future could involve applications for recreated synthetic voices of well known people who have consented for such use. We can licence a David Bowie song for an ad campaign, will we eventually be able to have it voiced dynamically by Sir John Hurt?