Passions & publishing Technology & Tools

When technology unleashes creativity: the renaissance of the digital spoken word.

Video might have killed the radio star, but the spoken word is alive and thrives. A (re) discovery path between technology and fresh ideas.

SECTION III Technology drives the product palette and blurs the lines among traditional ways we define media artefacts.

Taxonomies do not fully capture the evolution of the digital spoken word. A couple of simple schematics are here to help. In the first one 👇, I segment voice-led editorial experiences into three categories:

  1. pre-recorded & standalone, the “traditional” palette including audiobooks, podcasts & co. with a typical lean-back consumption and passive listening.
  2. Interactive & personalised: all the things you can build by combining pre-recording content (often atomised) into products users can interact with (e.g. meditation apps or Alexa skills).
  3. Live and social, that – despite the excessive hype around Clubhouse – is the next big thing, and will develop a lot in the next months.

Technology transforms also the most traditional spoken-word products: for instance, narrated news articles are rapidly becoming the new consumption standard within mobile news apps.

The next table brings one key message in it: the lines among short, mid and long-form contenta taxonomy that belongs to the media traditionare blurring. In the showcase section, I bring examples for each of the below:

SECTION IV A showcase.

None of the following examples are the first of their kind, but they are good to show how many more possibilities are there, and possibly inspire yours:

  1. Auditorial: voice and sounds for a more accessible web.
  2. Blurring the lines between podcasts and audiobooks.
  3. Narrated articles for extending your audience.
  4. Give content a second life with audio editions.
  5. Long-form journalism for the ears: Audm, Curio & Co.
  6. Atomising the radio news: the Swedish radio app.
  7. Atomic audio for the good vibes: Penguin Random House.
  8. Going live and social: what’s up after Clubhouse?


When we talk about “accessible web”, we usually think of screen-reading software for the visually impaired. The Guardian Lab worked with Google and the British Institute of Blind People to develop better formats, where spoken words, sounds and metadata provide a more compelling experience.

The resulting work, published in May 2021, is a 15 minutes feature exploring the devastating effects of the climate crisis and human-induced environmental destruction with the sounds of the natural world. Metadata are written more extensively than usual – ´alt´ tags, for instance – to provide screen-reading browsers with something consistent with the text prose.

Voice recordings and sounds, together with the web text, build a podcast-like narrative that the screen-reading browser activates. More than the multimedia story itself, the relevant outcome is some teachings about the potential of more effective use of voice recordings in articles and more careful treatment of metadata.


All the recent talk about podcasts sometimes overshadows the fact that the audiobook market is three times bigger than podcasting.

In the listener’s ears, however, the two products come closer. Who delivers the most interesting experience will also set the bar of consumers’ expectations. You do not need a traditional audio studio to design, edit and elegantly mix audio.

Here is the opportunity for audiobooks to go beyond the reading narrator and include other voice recordings (like many podcasts do), plus sounds and musical background.

Malcolm Gladwell, journalist, author and founder of the podcast studio Pushkin has published an outstanding example in April 2021. His latest title, The Bomber Mafia, was born as an audiobook and adapted to a book/ebook.

The opening of the audiobook is impactful: sirens, buzzing airplanes, bomb explosions. It is the true story of a group of engineers, scientists and military leaders that led to one of the deadliest nights of the II World War, the bombing of Tokyo in 1945, that many post-war analysts called a war crime. There is plenty of material in the military archive that Gladwell used for the development of this doc-fiction audiobook.

The Bomber Mafia is a sophisticated creation that lasts 5 hours and 15 minutes – the audio version of a binge-watching night. It is probably not something all publishers can still afford: 95% of audiobooks still use only human voice recording, 5% add minimal sound effects to keep the production costs to a minimum.

But the future might take two diverging directions: on one side, the automated narration made with Amazon Polly and the likes (as mentioned above); on the other side, premium audio-cinematic productions like the Bomber Mafia.


Narrated articles are becoming a new standard for news consumption. They can both live in your favourite news apps, attached to the original article, or as separate sections or apps. They can be read by their authors, by professional speakers or automated with synthetic speech. The German publisher Axel Springer (Welt, Bild) is so committed to audio to have recently developed its text-to-speech application, named Aravoices (launch date March 2021). The goal: develop proprietary “brand voices” for all Axel Springer products and in the future to sell the solution to other media and brands as well.

You can serve many use cases with narrated articles. It is not only about mobile consumption or solving readability issues. 

The Canadian Globe & Mail has addressed a relevant use case. Its audio section is available in more languages: English, French and Chinese Mandarin, so to serve – with automated translations and text-to-speech – the broader audience across Canada and overseas. The Globe and Mail users can select the preferred language, decide for a female or male voice and save audio articles for later, making personal playlists.

The Globe & Mail has been one of the first big publishers to use the Amazon Polly voices in 2019 (the latest has been The Washington Post in May 2021).


Making narrated articles is an investment into a convenience for users. It is also a chance to prolong the life cycle of feature content. The Economist developed its audio edition – an editorially curated selection of narrated articles for each weekly print edition – to solve a customer pain point: many readers could not cope with the daily abundance of content, often long reads, provided to them. And, while only 10% of The Economist app users listen to audio (source: Digiday), they represent the most loyal segment (source: Twipe Mobile). The audio edition, rolled out already some years ago, proved its value in 2020: it saw a record number of streams/downloads and unique listeners.


Audm (US), Curio (UK) and Articly (DE) are platforms turning long-form journalism into audio content. They license content from top quality publishers and aggregate it into mobile apps offered via a subscription model (priced between 5 and 7 euros monthly).

Audm made news last year when The New York Times bought it. The service does not build on an automated text-to-speech solution, relying instead on actual voice actors. Its content lineup is quality driven and ranges from The Atlantic to The New Yorker. New content is online every day, making those pieces available that you would read days or weeks later: investigative journalism, background stories, narrative journalism.

For most publishers, syndicating to these aggregators is a field test to evaluate how to develop their owned audio offerings: what content do people listen to more? And, if you take the road of audio-ready long-form journalism – again, blurring the lines with podcasts – how shall your newsroom write those pieces?

Assuming that narrated long-form articles can be an effective means to drive subscriptions and retention, this kind of journalism is not born for reading aloud. As podcasters know, sentences written for the ears are shorter, concepts are explained more conversationally and do not require the audience to go back to re-read or re-listen.

That is something that a Danish media startup, a membership-based news outlet called Zetland, discovered soon. The newsroom at Zetland tackled the issue from the beginning. They developed an original approach and optimised their writings both for the eyes and for the ears. The result is outstanding: Zetland reports a 90% completion rate for its audio stories!


Every time Big Tech comes with the latest developments, publishers join the bandwagon hoping for making a good business there, or for promoting their products.

A good example is the Alexa app is the Good Vibes by Penguin Random House. It provides a daily dose of inspiration with motivational quotes from bestselling books and authors. Users can listen to three quotes a day, learn more about the featured books, and receive book recommendations via email. Simple. Clever. Low-cost.

But there are also publishers that pursue more platform-independent approaches, like the state-owned Swedish Radio. When developing their new radio news app, they designed a completely new product based on micro-news clips (max 80 secs): atomised, interactive and powered by Artificial Intelligence.

The first thing users meet in the Sveriges Radio App is one playlist containing the latest 15 clips with the most relevant news from Sweden and the world.

The list updates continuously, so when you land there, you get always the latest news. The “playlist” format allows listeners to skip segments and navigate from one to the other.

More playlists are available, segmented by topic, geography, etc. To put together and dynamically update the lists, Sveriges Radio developed a simple algorithm.

Editors in the CMS rate every new clip according to three criteria: how “big” the story is, how relevant is it as of public service and its lifespan. This editorial assessment fuels the engine selecting what clips to add or remove from the playlists by every new update.

The Swedish Radio is building an entirely new editorial system: it promotes a new way to create content, short and standalone; it generates assets potentially applicable to many other applications, including the smart speakers. Future fit. And successful: over 200,000 people listen every week to the SR app, for around 3,5 million clips consumed weekly. The editorial work has generated first lessons on listeners´ retention: the first eight seconds matter. By that time, listeners decide whether to keep listening or skip the clip or abandon the playlist.


Buzzy and loved by venture capitalists, Clubhouse has been the talk of the town in winter 2020/21. Celebrated like a novelty, it ran soon into problems when many sessions went racist/sexist. Regardless of all this, Clubhouse has provided definitive proof that the time is ripe for a new dimension of social media: social audio.

To conclude this (re)discovery path into the digital spoken word, I share with you some considerations on social audio based on this thesis: social audio is nothing disruptive, but an evolution that many, if not most, current media can embrace.

Let us start from this: some forms of social audio have existed for years. The gaming community has been using audio chats for a long time, parallel to their live gaming sessions. The gaming chat app Discord is more popular and raised more money than Clubhouse by providing gamers with a well-designed product that translates spoken chats into text (solving a fundamental issue of video gamers, busy with both hands while playing).

For most Millennials, ´social network´ is not the synonym of Facebook, but WhatsApp audio messaging. The actual value of Clubhouse is to have magnified the attention of mass media and attracted big money, thus making room for a new gold rush for the next big thing in audio.

But, shall we expect one next big thing, or go for many focused, more vertical, more private and creator-led ways to bring live audio into our lives?

The newest developments speak in favour of the latter. Take the Locker Room. Born as a live sports audio app, it reminds the concept of community forums: the backbone of the social Internet 1.0. After the acquisition by Spotify, the app, renamed Greenroom, aims to connect creators with fans covering sports, music, culture.

On a different promise is built “Cappuccino”, an app that takes voice recordings from a closed group of friends or family and mixes them into a downloadable podcast, made to come to you the morning after. Okay, this seems a typical “Lockdown” use case, hopefully, doomed to future extinction.

However, this case suggests that social audio can live in the same space as podcasts, webinars and other live events. Together, they can create an extensive palette of hybrid experiences, mixing live and recorded, audio and video, text with audio and so on.

Think about how many virtual events need a video connection vs. how many can be alternatively “listened to/spoken with”. Or, how many live sessions you could record, transcribe and make available for your customers´ preferred consumption: as videos, podcasts, automated text summaries.

Do virtual stand-up meetings need a daily Zoom call when you can run on audio and automatically convert them into a quickly written protocol for your digital Kanban/Trello/Dashboard?

Same for gigs: my best memories of 2020 have been listening to some of my favourite Klezmer bands live and online, with the chance to have conversations after the concerts with artists and others passionate like me about Klezmer music. With social audio, the old habit of conversing with the artist (here we come into the space of creators and their communities – aka the passion economy) would lower the barrier to participation that video still poses (the cognitive and psychological burden = Zoom fatigue).

I stop here, and I leave you with a couple of additional suggestions for taking a 10.000-foot view on the future of social audio:

Max once a week. No spam, no ads, no fluff.

For any questions and enquiries reach out to me at Powered by Mailpoet.