[AI] Boring conversation? Let your computer listen for you

Sanjay ilovecold at gmail.com
Fri Jun 18 04:11:17 EDT 2010

          Smart new software tells you who's said what, how they've said it
          - and who really needs to shut up

by Colin Barras

MOST of us talk to our computers, if only to curse them when a glitch
destroys hours of work. Sadly the computer doesn't usually listen, but
new kinds of software are being developed that make conversing with a
computer rather more productive.

The longest established of these is automatic speech recognition
(ASR), the technology that converts the spoken word to text. More
recently it has been joined by subtler techniques that go beyond what
you say, and analyse how you say it. Between them they could help us
communicate more effectively in situations where face-to-face
conversation is not possible.

ASR has come a long way since 1964, when visitors to the World's Fair
in New York were wowed by a device called the IBM Shoebox, which
performed simple arithmetic calculations in response to voice
commands. Yet people's perceptions of the usefulness of ASR have, if
anything, diminished.

"State-of-the-art ASR has an error rate of 30 to 35 per cent," says
Simon Tucker at the University of Sheffield, UK, "and that's just
very annoying." Its shortcomings are highlighted by the plethora of
web pages poking fun at some of the mistakes made by Google Voice,
which turns voicemail messages into text.

What's more, even when ASR gets it right the results can be
unsatisfactory, as simply transcribing what someone says often makes
for awkward reading. People's speech can be peppered with repetition,
or sentences that just tail off.

"Even if you had perfect transcription of the words, it's often the
case that you still couldn't tell what was going on," says Alex
Pentland, who directs the Human Dynamics Lab at the Massachusetts
Institute of Technology. "People's language use is very indirect and
idiomatic," he points out.

Despite these limitations, ASR has its uses, says Tucker. With
colleagues at Sheffield and Steve Whittaker at IBM Research in
Almaden, California, he has developed a system called Catchup,
designed to summarise in almost real time what has been said at a
business meeting so the latecomers can... well, catch up with what
they missed. Catchup is able to identify the important words and
phrases in an ASR transcript and edit out the unimportant ones. It
does so by using the frequency with which a word appears as an
indicator of its importance, having first ruled out a "stop list" of
very common words. It leaves the text surrounding the important words
in place to put them in context, and removes the rest.

A key feature of Catchup is that it then presents the result in audio
form, so the latecomer hears a spoken summary rather than having to
plough through a transcript. "It provides a much better user
experience," says Tucker.

In tests of Catchup, its developers reported that around 80 per cent
of subjects were able to understand the summary, even when it was less
than half the length of the original conversation. A similar
proportion said that it gave them a better idea of what they had
missed than they could glean by trying to infer it from the portion of
the meeting they could attend.

One advantage of the audio summary, rather than a written one, is that
it preserves some of the social signals embedded in speech. A written
transcript might show that one person spoke for several minutes, but
it won't reveal the confidence or hesitancy in their voice. These
signals "can be more important than what's actually said", says
Steve Renals, a speech technologist at the University of
Edinburgh, UK, who was one of the developers of the ASR technology
used by Catchup.
An audio record preserves some of the social signals in speech that
are missing from a written one

An audio record cannot, of course, convey the wealth of social signals
that are available in face-to-face conversation - a raised eyebrow,
for example, or a nod of the head - and as meetings are increasingly
conducted by phone or online, participants in remote locations suffer.
So Pentland and colleagues at MIT have been analysing individual
speaking styles, and using the results to fill the gap. This kind of
speech analysis could, he claims, improve the quality of audio
conference calls by helping participants in a distributed meeting to
feel socially connected.

Pentland's work in this area is based on years of studying the
non-verbal signals embedded in speech patterns. Those studies have
revealed, for example, correlations between how interested someone is
in what's being said and how loudly they talk, or the frequency with
which they switch from talking to listening.

Working with PhD student Taemie Kim, Pentland has begun to use
some of these findings in a device to improve social signalling in
distributed meetings. Their "Meeting Mediator" measures how much time
four people in two separate locations participating in an audio
conference spend talking. If one of them hogs the conversation, all
four see that in graphical form on a screen in front of them.

This had a big impact on their behaviour, Kim and Pentland found. The
average speech segment - a measure of the time an individual spoke
before inviting others to take over - fell from 11.2 seconds to 9.2

The system also discouraged participants from splitting into groups
and beginning separate conversations. "The feedback was designed to
encourage balance and interactivity," says Kim. Just having that "in
their face" helped achieve this, she says. By extending such systems
to display on-screen variation in interest level as well, participants
phoning in to a meeting could get a better sense of the social signals
they are missing.

Pentland says that such tools, which move beyond mere recognition of
words, will help improve conference-call meetings. "'Reading' the
people rather than 'reading' the words can be a real game-changer for
collaboration," he says.

Technical telepathy: 09969636745

More information about the AccessIndia mailing list