How Smart Speakers Work

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:04):
Welcome to Tech Stuff, a production of I Heart Radios
How Stuff Works. Hey there, and welcome to tech Stuff.
I'm your host, Jonathan Strickland. I'm an executive producer with
I Heart Radio and I love all things tech, and guys,
stick with me. I am fighting off a cold. You'll
be able to hear it in my voice. I have

(00:25):
no doubt. But you know, I wanted to get you
guys a brand new episode. So we're gonna fight on
because the show must keep going. I think I think
this is saying, oh no, this cold medicine is good though.
All right, Anyway, I thought that we would do an
episode about smart speakers because I wanted to kind of

(00:47):
start this whole episode off with with an old man observation,
you know, get off my lawn kind of thing. And
this is from our resident old man, old man Strickland.
That meaning meaning me, So, when I was young, speakers
were dumb. Now I don't. I don't mean that speakers
were useless, or that they were terrible, or that they
were incapable of replicating certain frequencies or volumes of sound,

(01:12):
or that they were limited in some other way other
than they didn't quote unquote think they didn't connect to
any sort of computational engine in a meaningful way. You
might have a set of speakers plugged into a computer,
but that was just a one way communications tool, right.
It was just a way to provide an outlet for
sound that your computer was generating, nothing more than that.

(01:33):
But contrast that with today, when we have numerous smart
speakers on the market. These speakers act as a user
interface between us and the Internet at large, often facilitated
by a virtual assistant of some kind. Now with these speakers,
we don't just listen to stuff like music and podcasts
and the radio and you know, other traditional audio content.

(01:57):
We use them to find out information. We might link
them to our calendars so that we can get reminders
for upcoming appointments. We probably use them to ask about
the weather report. I use mine at home for that
all the time, or even more often than that, if
you're at my house, you'll hear us use it to
find out which foods are safe for us to feed

(02:17):
to our dog. My doggie, Tibolt, absolutely loves our smart
speaker because it frequently gives us permission to spoil him
with a carrot or a piece of banana. But how
do these smart speakers work, How are they able to
respond to our requests? And what are their limitations? How
safe are they? That's the sort of stuff we're gonna

(02:38):
be looking into in this episode of tech Stuff, and
we'll start off with the basics, which means we have
to start off with how speakers work in general. Now,
this is something that I've covered before on tech Stuff,
but I want to go over it again from a
high level because well, I just find it fascinating that
people figured out how to harness electricity to drive a

(02:58):
motor so that it could in turn cause components to
replicate a recorded or transmitted sound. And really motors being
too generous, but to drive an element to create vibrations
that could replicate a sound that was made into another component,
that whole thing just boggles my mind that people are
smart enough to figure that out. Okay, So to understand

(03:20):
how speakers work, it first helps to understand how sound
itself works. Sound is a physical phenomenon. Do do do do?
Sound is all about vibrations, and typically we experience sound
when we pick up on changes in air pressure that
enter through our ear canal and then affect the tympanic
membrane or ear drum. So it's all about these changes

(03:44):
of of of air pressure, all about air molecules transmitting
vibrations from a source outward in a radiating pattern from
from that source. So let's think of someone knocking on
a door. For example, you're inside a house, someone's knocking
on your door. When that person's hand hits the door,
it causes the door to vibrate, and that vibration transmits

(04:07):
to the surrounding air molecules on the other side of
the door. They get pushed through that vibration and then
pulled when the the wood is vibrating back towards its
original position. So the air molecules vibrate, those air molecules
cause the next surrounding layer of air molecules to vibrate
as well, and so on and so forth. It's like

(04:29):
a cascade or domino effect. You get these little pockets
of high and low air pressure that travel outward from
that door. It spreads further as it goes towards you know,
any distance, and if you are close enough so that
you can still detect those changes in air pressure. You

(04:49):
experience this by hearing the knocking on the door. Those
vibrating air molecules lose a bit of energy as they
move outward. Right, as they vibrate to the next layer,
you start to lo use a bit of energy with
each transmission of that So the sound gets quieter the
further away you are because there's not as many air
molecules vibrating, its amplitude as decreased. So if you are

(05:13):
in hearing range, you can pick up on those changes
of air pressure they encounter the tympanic membrane in your
ear canal. Those changes in pressure will cause a reaction
in your middle and inner ear set that will ultimately
get picked up by your brain that interprets it as sound. Now,
the frequency at which those fluctuations occur relate to the

(05:34):
pitch that we hear, so faster vibrations are higher pitches,
higher frequencies, higher notes. If you think of a musical scale,
we perceive the force of the changes as volume, so
lower forces lower volume right, and higher forces higher volume.

(05:55):
The human ear can hear a pretty decent range of
frequencies from twenty hurts, which means twenty cycles or twenty
waves per second past a given point of reference, to
twenty killer hurts. That's twenty thousand cycles or waves per second.
So yeah, the cycle refers to the frequency of the

(06:15):
wavelength of sound. The lower the frequency, the lower the sound.
All right, and then our brain has to make meaning
of all this, Right, it's not just that it's picking
up on it. Our brain interprets this and we experience
it as a sound we have heard. So it either
matches this perceived sound with one we've encountered before, and

(06:36):
then we say, oh, I know what that is. That's
someone knocking at the door, or they might be Holy Cala,
I've never heard that sound in my life. I have
no idea what it is. If the sound is language,
then our brains have to derive the meaning from the
perceived sound. We've heard someone say words such as you're
hearing me say this. Then our brains have to take

(07:01):
that collection of sounds and say, what does that actually mean?
What is the the context, what is the the intent?
What is the message here? Otherwise it would just be
you know, random noises that I'm making with my mouth. Alright,
so we have a basic understanding behind the physics of sound.
Now to talk about speakers and microphones and the reason

(07:21):
I'm going to talk about both of them is that
the devices complement one another. You can think of one
as being the other in reverse. Plus smart speakers we
have to talk about microphones anyway, because smart speakers have
microphones as well as the speaker element. So you can
think of this as one long process of taking the
physical phenomena of sound waves, transforming that physical phenomena into

(07:46):
an electrical signal, taking the electrical signal, and changing it
back into something that can produce the sound waves that
started the whole thing. So you're replicating the original sound
waves with this end device, which in this case is
allowed speaker. So the microphone is the part of the
process where you take the sound and you turn it
into an electrical signal, and the speakers where you take

(08:08):
the electrical signal and you turn it back into actual sound.
That's the simple way. But what's actually happening, Well, let's
talk about on a physical level. Sound waves go into
a microphone. So you've got these fluctuations and air pressure
that encounter a microphone. I'm speaking into a microphone right now,
so this is happening right now. Inside the microphone is

(08:30):
a very thin diaphragm, typically made out of a very
flexible plastic, and it's sort of like the skin of
a drum. So as the changes in air pressure encounter
the diaphragm, they cause the diaphragm to move back and forth. Well.
Attached to the diaphragm is a coil of conductive wire,
and that coil wraps either around or near a permanent magnet.

(08:54):
Magnets have magnetic fields. They have a north pole and
a south pole, and there's a magnetic field that surrounds
the magnet. And the electro magnetic effect means that if
you move a coil of conductive wire through a magnetic field,
it will produce a change in voltage in that coil,

(09:14):
otherwise known as electromotive force, and that means electrical current
will flow through the coil. Now, if you have the
end of that coil attached to a wire, a conductive
wire for that current to flow through, you can send
that current onto other components. So for our purposes, the
component in question would be an amplifier, and I'll get

(09:37):
to explaining why that is in just a moment, but
first let's talk about loud speakers, and the way allowed
speaker works is essentially the reverse of a microphone. You've
got your permanent magnet around or near which is a
coil of conductive wire. The wire is connected to a diaphragm,
one much larger and typically made out of stiffer material

(09:59):
that the plastic you'd find in a microphone. This is
the element inside a speaker that will vibrate, that will
push air and pull air as it moves either outward
or inward. The electrical signal comes from a source such
as the microphone we were just using a second ago
that comes into the loudspeaker and it flows through the coil. Now,

(10:22):
when you have an electrical current flowing through a conductive coil,
you generate a magnetic field because the laws of electromagnetism.
You've got the electro magnetic field generated as a result.
Now that field will interact with the magnetic field of
the permanent magnet. That the permnet magnet always has a
magnetic field. The coil only has one when electric current

(10:46):
is flowing through it. And as I said, we have
magnets to have a north pole and a south pole.
And we also know that when we bring two magnets
with their north poles together, they'll push against each other,
right because like repels like, But if we turn one
of those magnets around so that now it's a south
pole and a north pole, they attract one another, you know,

(11:08):
opposites attract. So by having the this magnetic field being
generated by the coil, uh, it starts to generate interactions
with the magnetic field of the permanent magnet, so they
start to push and pull against each other. Well, the
coil is attached to that diaphragm, so it in turn

(11:32):
drives the diaphragm to either push outward or pull inward.
That causes air molecules to vibrate, just as it would
with any other you know, source of sound, and it
emanates outward from the loudspeaker, so you get a representation
of the same sound that was going into the microphone

(11:52):
got converted into an electrical current. The electrical current then
was passed through a coil and next to a permanent
magnet to create the same sort of movement. It replicates
the movement of the original diaphragm in the microphone and
generates the sound. So you get the replication of the
sound that was made in the other location. It's pretty cool.

(12:15):
I think now I did mention earlier that you would
need an amplifier. And the reason you need an amplifier
is that the electrical signal generated by a microphone is
far too weak to drive allowed speakers diaphragm. You just
wouldn't have the juice to do it. It would be
much much less, uh powerful than what the speaker would need.

(12:36):
So chances are the diaphragm would either not move at
all because it would just be too stiff, it would
resist the movement too much, or it would move so
weakly as to generate little to no sound, so it
wouldn't do you any good. So the signal from the
microphone has to first pass through an amplifier, which, as
the name implies, takes an incoming signal and increases the

(12:56):
amplitude of that signal the volume. In other words, uh so,
it doesn't affect pitch, but it does affect the signal
strength and consequently the volume. And I've done episodes about amplifiers,
including explaining the difference between amplifiers that use vacuum tubes
and ones that use transistors, so I'm not going to
go into that here. Besides, it doesn't really factor into

(13:18):
our conversation about smart speakers anyway. It's just important for
it to work with a microphone and speaker setting. Now,
over the years, engineers have paired microphones and speakers and
lots of stuff. You've got telephones, you've got intercom systems,
public address systems, handheld radios, all sorts of things, so
that technology was well and truly mature. Before we ever

(13:41):
got our first smart speaker, there wasn't much call to
incorporate microphones into home speaker systems for many years. I mean,
what would you actually use a microphone embedded in a
speaker for? Before smart speakers, Typically you would have your
speakers like I'm talking about, like like sound system speakers.
You would have them hooked up to some other dumb

(14:02):
as in, not connected to a network technology. So it
might be a sound system or home entertainment set up
with a television as the focal point, or maybe even
you know, a computer for the purposes of playing more
dynamic sounds for like video games and and things like that. Um.
But for a very long time, these were all thought
of as one way communications applications, right, Like, the sound

(14:25):
was coming from a source and it would get to
us through the speakers, but we weren't meant to send
sound back through those same channels. The information was just
coming to you. You weren't sending anything back, But that
would all change in time. Now. One thing to keep
in mind about smart speakers is that they are the
product of several different technologies and lines of innovation and

(14:46):
development that all converged together. The microphone and speaker technology
is one of the oldest ones that we can point
to as far as the fundamental underlying technology is concerned,
the stuff that's been around since the late nineties century.
Now there is one other we'll talk about that's even older.
But I don't want to spoil things. I'll just mention

(15:06):
there is an even older line of development that goes
into smart speakers than the microphone speaker stuff of the
nineteenth century. Most of the other components, however, are much
younger than that. One big one is speech or voice recognition.
Creating computer systems that could detect noise was relatively simple. Right.

(15:28):
You could have a computer connected to microphones and they
could monitor the input from those microphones and any incoming
signal could be registered. Right, they could record an incoming
signal that would indicate the microphone had detected a noise.
That's child's play. That's easy to do. But teaching computers
how to analyze those signals and decipher them so that

(15:49):
the computer could display in text or otherwise act upon
that that sound in a meaningful way that was much
more difficult. There was an IBM engineer named William C.
Dirsh of the Advanced System Development Division who created an
early implementation of voice recognition. It was a very limited application,

(16:11):
but it proved that the ability to interact with computers
by voice was more than just science fiction. Within IBM.
It was called the Shoebox. Dirsh worked on this project
in the early nineteen sixties and what he produced was
a machine that had a microphone attached to it. The
machine could detect sixteen spoken words, which included the digits

(16:34):
of zero to nine plus some command indicators like plus
minus total, sub total. You get the idea. So you
could speak a string of numbers and then commands to
this device, then ask it to total everything and it
would do so. So it was more or less a
basic calculator with some voice interpretation incorporated into it. Now

(16:58):
there's a great newsreel piece about this shoebox. There's a
demonstration of it, and it came out in nineteen one,
and I love that newsreel because it has that great
music you would hear in the background of those old
industrial and business films. Anyway, there's also a helpful chart
that hangs in the background of that video where Dersh

(17:19):
is actually explaining how it works. You can see a
little bit behind him what the what is actually being
analyzed and uh he broke the words down into phonemes
and syllables, so phonemes being specific sounds that make up words. So,
for example, the digit one is a single syllable word

(17:40):
with a vowel sound right at the front. But you
also have the word eight that's another single syllable word
as a vowel sound right at the front, but it's
different from one phonetically in that eight also has a
plosive and has that hard t at the end. So
the shoebox was limited not just in what words it

(18:02):
could recognize, but also the types of voices it could recognize.
Get someone who has a different dialect or manner of speech,
and the machine might not be able to understand them
because they're not pronouncing the words the same way that
drsh did. This would be a big challenge in speech
recognition moving forward, and it's also an example of where

(18:24):
we find bias creeping into technology. And it's not necessarily
a conscious thing, but if you have people designing a
system and they're designing it based off their own uh,
you know, speech patterns, their own pronunciations, their own dialects,
then it may be that the system they create works

(18:44):
really well for them and less well for anyone who
isn't them, And the further away you are from their
manner of speaking, the more frustration you will encounter as
you try to interact with that technology. That's an example
of s and in fact, if you read the histories
of speech recognition and as we'll get too later natural

(19:06):
language processing, you'll see a lot of people say it
works great if you happen to be a white man,
because the manner of speech was being or the people
who were designing it were primarily white men who were
uh typically aiming for a a what is considered a
non accented American dialect somewhere in you know, the Eastern

(19:31):
seaboard side. But that meant that if you did have
an accent or a dialect, or you had a different vernacular,
that it was harder for the systems to actually understand
what you were saying. That's an example of bias. Well.
The general strategy was again to break up speech and
too constituent sound units, you know, those phonemes, and then

(19:52):
to susse out which words were being spoken based on
those phonemes, and that was done by digitizing the voice train,
forming it from sound into data that represented stuff like
the sounds frequency or pitch, and then matching up specific
signal signal signatures with specific phone nmes. So generally the
idea was that the computer system would monitor incoming sound,

(20:15):
convert the sound into digital data, compare that data that
had received with information stored in a database, and effort
to look for matches. Uh. The shoebox database was just
sixteen words and size. Later ones would be much larger,
but pretty quickly people realized this was not an efficient
way of doing speech recognition because the bigger the vocabulary,

(20:37):
the more work intens of it was to build out
those databases. So it wasn't something that people thought would
be sustainable for very large vocabularies. But the Shoebox marked
the beginning of a serious effort to create machines that
could accept audio cues as actual input, and as we'll see,
that's one important component for these smart speaker systems. I've
got a lot more to say, but before I get

(20:59):
into the next part, let's take a quick break. Now,
obviously we didn't jump right into full voice recognition right
after IBM S Shoebus innovation. The challenges related to building
automated speech recognition systems were numerous, even for just a

(21:21):
single language, because, as I said, you can have accents
and dialects. One voice can have a very different tonal
quality from another, people speak at different speeds. Teaching machines
how to recognize speech when the phonemes and pacing of
that speech aren't consistent from speaker to speaker, that's really hard.
This kind of gets back to the same sort of

(21:43):
challenges you have when you're teaching machines how to recognize images.
You know, you teach a human what a coffee mug is.
I always use this example, but you teach a human
what a coffee mug is, and pretty soon they can
extrapolate from that example and understand that coffee mugs can
them in all different sizes and colors, and you know

(22:04):
different designs and textures. We get it. Like you you
see a couple of coffee mugs, you understand machines though
they aren't able to do that. Machines, you know, you
have to give them lots and lots and lots of
different examples before they can start to pick up on
what things actually make a coffee mug. Same sort of

(22:25):
thing with speech, right, So if you don't have consistency
between speakers, it makes it very hard for machines to
learn what people are saying. Now, it didn't take long
for the tech industry at large to really dive into
trying to solve this problem. In ninete, DARPA, that's the
Research and Development division of the United States Department of Defense,

(22:45):
got behind speech recognition in a big way. Now, remember
darp it self doesn't do research. The organization's purpose is
to invite organizations to pitch projects that align with whatever
darpest goals are and and DARBA would provide funding to
the winning organizations to see these projects to completion if possible.

(23:07):
So DARK is really more of a vetting and funding
organization anyway. In n DARPA created a five year program
called Speech Understanding Research or s u are. The initial
goal was pretty darn ambitious considering the capabilities of the
technology at the time. The project director, Larry Roberts, wanted
a system that would be capable of recognizing a vocabulary

(23:30):
of ten thousand words with less than ten percent error.
After holding a few meetings with some of the leading
computer engineers of the day, Roberts suggusted that goal significantly.
After that adjustment, the target was going to be a
system capable of recognizing one thousand words, not ten thousand.

(23:50):
Nearror levels still had to be less than ten percent,
and the goal was for the system to be able
to accept continuous speech, as opposed to very deliberate speech
with pauses between each pair of words that would not
be really that useful. One person who was skeptical about

(24:13):
the potential success of this project was John R. Pierce
of Bell Labs. He argued that any success would be
limited so long as machines remained incapable of understanding the words,
not just recognizing a word based on phone names, but
understanding what the word is. That is. Pierce felt that
the machines needed some way to parse the language to

(24:34):
get to the meaning of what was being said. That's
an important idea that we will come back to in
just a bit now. Among the companies and organizations that
landed contracts with DARPA were a Carnegie Melon University BBN,
which actually played a big part in developing our ponette,
the predecessor to the Internet, Lincoln Laboratory, and several more
and very smart people began to create systems intended to

(24:56):
recognize speech and meaningful ways. The names of the programs
were a lot of fun. There was h W I
M that stood for hear what I mean as in
here as in listen hear what I mean. That one
was from BBN. CMU introduced hearsay, which was later designated
as Hearsay one, and then they came out with Hearsay two.

(25:17):
They also would demonstrate another one called harpy. Oh, and
there was a professor at CMU named Dr James Baker
who would design a system called Dragon in nineteen seventy
five that he would later leverage into a company with
his wife, Dr Janet M. Baker in the nineteen eighties,
and they had a very successful business with speech recognition software. Now,

(25:40):
I'm not going to go into each of those programs
in deep detail, but rather just mentioned that they all
helped advance the cause of creating systems that can recognize speech.
One of the big developments that came out of all
that work was a shift to probabilistic models, which would
also play a really important part in another phase of
developing the smart speaker. So what do I mean when

(26:00):
I say probabilistic? Well, as the name indicates, it all
has to do with probabilities. Essentially, systems would analyze incoming
phonemes and make guesses as to what was being said
based on the probability of it being a given word
or part of a word. The systems typically go with
whatever word has the highest probability of being the correct one.

(26:23):
Even with that approach, there are nuances to language that
are difficult to account for with a machine. So, for example,
you have homonyms and which you have two words that
sound the same but have very different meanings and potentially
spellings like right as in to write a sentence, or
right as in am I right? Or am I wrong?
Or you could have a pair of words that sound

(26:46):
like a single word and have confusion there, such as
a door. You can say a door you mean you're
meaning a single door a door to go into a building,
or you might say a dore as an I adore
this podcast you're doing, Jonathan. That's sweet of you, Thank
you for saying that. So computer scientists were hard at
work advancing both the capability of machines to make correct

(27:10):
guesses at individual phone names and then full words, as
well as figuring out a way to teach machines to
adjust guesses based on context. That requires a deeper understanding
of the language within which you're working. If you're aware
of certain idioms, you can make a good guess at
a word or phrase even if you didn't get a
clean pass at it right. So, for example, the phrase

(27:33):
it's raining cats and dogs just means it's raining a lot.
And if a system included a database that indicated the
phrase cats and dogs sometimes follows the phrase it's raining,
then the system is more likely to guess the correct
sequence of words instead of guessing something that sounded similar
but it's wrong. For example, if it said, oh, they

(27:55):
must have said it's raining bats and hogs, that would
not makes sense. So the systems estimate the probability that
any given sequence of sounds within the database matches what
the systems have just quote unquote heard progress in this
area was steady, but slow, and I'd argue that it
was also a reminder that concepts like Moore's law do

(28:18):
not apply universally across technology. Rapid development in one particular
domain of technology is not necessarily an indicator that the
same sort of progress will be observed in all other
areas of tech. We often get into the mistaken habit
of believing that Moore's law applies to everything. Alright. So
a related concept to voice recognition is something called natural

(28:42):
language processing, and this relates back to how we humans
tend to process information compared to the way machines tend
to do it. So we humans formulate ideas, we shape
those ideas into words and sentences. We communicate them in
some way to other people through that language. It may
be through speed you maybe through text. It may even
be through a nonverbal or non literary way, but we

(29:06):
communicate those ideas. Machines typically accept input, they perform some
process or sequence of processes on that input, and then
they supply an output of some sort. Machines do this
in machine language. That's a code that's far too difficult
for humans to process. Easily. Binary is an example of
machine language. Binary is represented as zeros and ones, which

(29:30):
would group together can represent all sorts of stuff. But
if you just looked at a big block of zeros
and ones, it would mean nothing to you. It's not
easy for humans to use, and then machines in turn
are not natively able to understand human language, so there's
a language barrier there. Because of that, people created different
programming languages. These languages provide layers of abstraction from the

(29:54):
machine language. They make it easier to create programs or
directions that the computer should fall low. So the person
who's doing the programming is using a programming language that's
easy for humans to use that then gets converted into
machine language that the computers understand. But what if you
could send commands to a computer using natural language, not

(30:14):
even programming language. You could just speak in Plaine vernacular,
whether it's English or any other language, the way humans
communicate with one another. What if a computer could extract
meaning from a sentence, understand what it was you wanted
the computer to do, and then respond appropriately. So imagine
how much time you could save if you could just

(30:36):
tell your computer what you wanted it to do, and
it took care of the rest. If you had a
powerful enough computer system with strong enough AI, maybe you
could even potentially do something like describe a game that
you would love to be able to play, like not
not a game that exists, a game in your head,
and you could describe it to a computer and the

(30:56):
computer could actually program that game. Well, we're we're definitely
not anywhere close to that yet, but we made enormous
progress with natural language processing. Now, the history of natural
language processing isn't exactly an extension of voice recognition. It's
actually more like a parallel line of investigation. And that's
because natural language processing doesn't require voice recognition. You can

(31:20):
have an implementation in which you just right commands in
natural language, you know, you type them out on a
keyboard and the machine then carries out those those instructions.
So much of the early work in natural language processing
was in text based communication rather than in speech. The
history of natural language processing includes stuff like the Turing test,

(31:41):
named after Alan Turing. So the most common interpretation of
the Turing test these days is that you've got a
scenario in which a person is alone in a room
with a computer terminal, they can type whatever they like
into the computer terminal, and someone or something is responding
to them in real time. Now it might be another person,
or it might be a computer system that's responding to

(32:04):
that person. You run a whole bunch of test subjects
through this process, and if the computer system is able
to fool a certain percentage of those test subjects, like
say thirty percent of them, that it is in fact
another human and not a computer, it is said to
have passed the Turing test, And typically we use that

(32:24):
to mean the machine has given off the appearance of
possessing intelligence similar to the one that we humans possess.
That gets beyond our scope for this episode, but it
helps point out that stuff like speech recognition and natural
language processing are both closely related to the field of
artificial intelligence. In fact, they really belong within the artificial

(32:45):
intelligence domain. The Turing test was more of a hypothetical.
It was a bit of a cheeky way of saying, Hey,
if you can't tell whether or not something is intelligent,
it makes sense to treat it as if it actually
is intelligent. After all, we assume that every human with
whom we interact possesses some level of intelligence. Based on

(33:06):
those interactions, so why should we not extend the same
courtesy to machines. Now, natural language processing would prove to
be another super challenging problem to solve. In computer science.
Early work was done in translation algorithms, and these were
programs that attempted to take phrases written in one language
and translate those automatically into a second language. At first,

(33:29):
that seemed pretty straightforward, but you realize that's also pretty tricky. Really.
For one thing, you can't just translate word for word
and keep the same order from one language to another.
The syntax or the rules that the language follow uh,
they could be different from language to language. In one language,
you might use an infinitive such as to record, in

(33:51):
the middle of a sentence, while another language might put
all the infinitives at the end of a sentence. So
in one language, I might say I'm going to record
a podcast in the studio right now, but in another
language it might come out as I'm going a podcast
in the studio right now to record. It starts to
sound like yoda. There was initial excitement around machine translation,

(34:13):
but once computer scientists and linguists began to see the
scope of this challenge, their excitement faded a bit. Also,
there was a lot of other stuff going on in
the nineteen sixties and seventies that was demanding a lot
of attention, such as the Space race. So for a while,
this branch of computer science was given less attention than
other branches, and by less attention, I really mean funding. Now,

(34:37):
when we come back, we'll talk a bit more about
the advances that were necessary to support natural language processing,
and we'll move on to how this would be another
important component in smart speakers. But first, let's take another
quick break. Okay, So early enthusiasm for an natural language

(35:00):
processing created a bit of a hype cycle that ultimately
crashed into the telephone poll of unmet expectations. That was
a really bad metaphor. Anyway, natural language processing went through
something similar to what we saw with virtual reality in
the nineteen nineties. You know, people saw what was actually achievable,

(35:23):
and then they compared that to what they thought they
were going to get, and those two things didn't match
up at all, and that really pulled the rug out
of funding for natural language processing, which men of course,
that progress slowed way down. It kept going, but it
was definitely on the back burner for a lot of projects.

(35:43):
When interest renewed in the nineteen eighties, there had been
a shift in thinking around natural language processing. Computer scientists
were starting to look at statistical approaches similar to what
was going on with speech recognition, building up probabilistic models
in which a computer can start making what amounts to
educated guesses at the meaning of a command or a phrase.

(36:06):
Machine learning became an important component on the back end
of these systems, and later artificial neural networks became an
important part as well. A neural network processes information in
a way that's sort of analogous to how our brains
do it. You have nodes or neurons that connect to
other nodes, and each node affects incoming data in a

(36:29):
certain way, performing some sort of operation on it, and
the degree to which they do that in one way
versus another is called the weight of that node. Computer
scientists apply weights across the nodes in an effort to
get a specific result in order to train these models.
So you might feed a specific command into such a system,

(36:50):
and you let it go through the computational process from
the beginning of the neural network through to the end,
and then you look at the result, and if the
result is correct, well, that just means the system is
already working as you intended it, which honestly is not
likely to happen early on. But if it's not correct,
then you start adjusting the weights on those nodes in

(37:12):
order to affect the outcome. I almost think of it
as like Plinko or pachinko, where you've got the little
coin and you drop it down and it bounces on
all the pegs and sometimes you're like you might think,
all right, well, this time it's going to go right
for that center slot, but it doesn't, and you think, well,
maybe if I remove some of these pegs or I
shift these pegs over a little bit, I can drop

(37:33):
it in that same spot and get hit the center.
It's kind of like that, except you're talking about data,
not physical moving parts. So you have to do this
a lot, like up to like millions of times in
order to try and train a system so that responds
appropriately to commands. And once it's trained, you can then
test new commands on the system to see if it

(37:55):
can parse them and respond appropriately. And in this way,
the system quote unquote learns over time how to respond
to commands. And then we have another component that's important
with smart speakers, and that's speech generation. So it's one
thing to have a machine either broadcast or play back
a recording of speech. It's another thing for a machine

(38:18):
to generate brand new speech. In computer science, we call
it speech synthesis. Now, this is the really old technology
I was alluding to at the beginning of this episode,
speech synthesis. If you want to be really, you know,
kind of technical about it, it actually predates every other
technology I've mentioned up to this point, at least in

(38:40):
its most rudimentary implementations. You have to go way back
to the eighteenth century the seventeen seventies, as when a
Russian smarty pants named Christian Kradsenstein was building a device
that used acoustic resonators. These these reads that would vibrate,
and it was in an attempt to replicate basic vowel sounds. Now,

(39:01):
even with such a working device, it would be really
difficult to communicate anything meaningful unless you were, i guess,
speaking whale like Dory and finding Nemo. But it would
be an early example of how people tried to create
mechanical systems that could replicate speech or elements of speech.
Another inventor named Wolfgang von Kimberland built an acoustic mechanical

(39:23):
speech machine and that used reads and tubes and a
pressure chamber, and it was all meant to replicate various
speech sounds. He had other elements to create sounds like plosives,
those hard sounds that I mentioned earlier in the episode.
So he had all these different elements that, working together,
could create parts of the sounds that we humans make

(39:47):
when we speak. He also built a supposed chess playing machine,
and it turned out that the chess playing part was
a hoax. So unfortunately, because that device was a hoax,
a lot of people dismiss his other work, which was legitimate.
So by fudging on one thing, he kind of cast

(40:08):
doubt on everything he had ever done. Skipping ahead quite
a bit, we get to Homer Dudley, which is a
fantastic name. He unveiled the voter or voice Operating Demonstrator
device at the New York World's Fair in nineteen thirty nine.
It consisted of a complex series of controls and it

(40:29):
sort of reminds me of something like a musical instrument,
kind of like a synthesizer, but with extra controlling units.
Like there was like a wrist element, there was a pedal.
There's a lot of stuff that made it very complex,
and with a lot of practice, you could create specific
sounds from this synthesizer. You could even create words or

(40:51):
full sentences, though from what I understand, it was incredibly
challenging to do. It was a very high learning curve,
but it demonstrate the possibility of a like tronic synthesized speech. Now.
There was a lot of work done in this field
by lots of different talented scientists and engineers, and someday

(41:12):
I'll have to do a full episode on the history
of speech synthesis. It's really fascinating, but it's far too
big a topic to cover in its entirety in this episode.
By the late nineteen sixties we had our first text
to speech system, and by the late nineteen seventies and
early nineteen eighties, the state of the art had progressed
quite a bit and we were starting to get to

(41:33):
a point where we could create very understandable computer voices.
They weren't natural, they didn't sound like people, but you
could understand what they were saying. And finally, something else
that would enable smart speakers and virtual assistance was the
pairing of improved network connectivity and cloud computing. That removes

(41:53):
the need for the device that you're interacting with to
do all the processing on its own. So, if you
think about or the history of computing, we used to
do main frames with dumb terminals that attached the main frame,
so the terminal wasn't doing any computing. It was just
tapping into the mainframe computer, which was sending results back
to the terminal. Then you get to the era of

(42:13):
personal computers, where you had a device sitting on your
desk that did all the computing and it didn't connect
to anything else. Then we get up to networking and
the Internet, where we suddenly had the capability of having
really powerful computers or grids of computers that were able
to take on processing power. Uh, and you just you

(42:35):
send the request out to the Internet and you get
the response back. That's the basis of cloud computing. So
your your command or message or whatever relays back to
servers on the cloud that then process it and send
the proper response to whatever device you're interacting with, and
then you get the result. So with the case of

(42:57):
the smart speaker, it might be playing a specific so
long or giving you a weather report or whatever it
might be. Now, if the speakers were doing some of
that computation themselves, that would be an example of edge computing,
where the processing takes place at least in part, at
the edge of a network at those end points. But
for now, most of the implementations we see send data

(43:20):
back to the cloud to get the right response, so
you have to have a persistent Internet connection. These devices
are not useful without that connection. You do have some
smart speakers that can connect to another device like a
smartphone via Bluetooth, so you could do things that way,
but without those connections, the smart speaker turns into, you know,

(43:41):
just a dumb speaker, or sometimes just a paperweight. Now,
this collection of technologies and disciplines are what enabled Apple
to introduce Sirie in two thousand and eleven, and Syria
is a virtual assistant. Series origins actually trace back to
the Stanford Research Institute and a group of guys Grouber,
Adamshire and dog kit Louse who had been working on

(44:04):
the concept since the nineteen nineties, and when Apple launched
the iPhone in two thousand seven, they saw the iPhone
as a potential platform for this virtual assistant that they
had been building, and they thought, well, this is perfect
because the iPhone has a microphone, so the assistant can
respond to voice commands as a speaker, so it could
communicate back to the user, it could do all sorts

(44:26):
of stuff. We can tap into the interoperability of apps
on the device. It's a perfect platform for us to
deploy this. So they developed an app once the opportunity
arose because apps were not available for development immediately when
Apple launched the iPhone, and once they did launch that app,

(44:46):
uh within a month, less than a month, Steve Jobs
was on the phone calling them up and offering to
buy the technology, which of course they would agree to
and it would become an integrated component in Apple's iPhone
line afterward. And that's where voice assistants kind of lived
for a few years. They mostly lived on smartphones like
the iPhone. But in November two thousand fourteen, Amazon introduced

(45:10):
the Amazon Echo smart speaker, which was originally only available
for Prime members, and it had its own virtual assistant
named Alexa, and thus the smart speaker era officially began. Now,
there are plenty of other smart speakers that are on
the market these days. There are products from Google like
Google Home. Uh, there are so no speakers that can

(45:31):
connect to services like Amazon's Alexa or Google's Assistant, and
we're probably going to see a ton more, both from
companies that piggyback onto services from the big providers like
Google and Amazon, and maybe some that are trying to
make a go of it with their own branded virtual
assistants and services. Smart speakers respond to commands after they

(45:52):
quote unquote here a wake up word or phrase. Now,
I'm gonna make up a wake up phrase right now
so that I don't set off anyone's smart speaker or
smart watch or smartphone or smart car or whatever it
might be. So this is just a fictional example of
a wake up phrase. So let's say I have a
smart speaker and the wake up phrase for my smart

(46:15):
speaker happens to be hey, they're Genie. Well, my smart
speaker has a microphone, so it can detect when I
say that, but really it's constantly detecting all sounds in
its environment. The microphone is always active. It has to
be in order to be able to pick up on
when I say the wake up phrase. So the microphone

(46:38):
is always active on most smart speakers. There's somewhere you
can program it so that it will only activate if
you first touch the speaker and that wakes it up.
There's some that you can do that with, But for
the most part, they're always listening. While the speaker can
quote unquote here everything, it's not listening to everything. In
other words, it's not mon of during the specific things

(47:01):
being said. At least that's what we've been told. And honestly,
that makes a ton of sense from an operational standpoint.
And the reason I say that is that the sheer
amount of information that would be flooding in from all
the microphones on all the smart devices from any one
provider that happened to be deployed all over the world,
that would be an astounding amount of data. And sifting

(47:23):
through all that data to find stuff that's useful would
take an enormous amount of effort and time and and
processing power. So while you could have all the microphones
listening in all over the place, finding out who to
listen to at what time would be a lot trickier
and probably not worth the effort it would take to
pull something like that off. So what these speakers and

(47:46):
other devices are actually doing is looking for a signal
that matches the one that represents the wake phrase. So
when I say, hey, they're Genie, the microphone picks up
my voice, which the mic then try inslates into an
electrical signal which gets digitized and compared against the digital
fingerprint of the predesignated wake up phrase. And in this case,

(48:09):
the two phrases match. It's like a fingerprint matching something
that was left at a site. So that turns the
speaker into an active listener rather than a passive one.
It's ready to accept a command or a question and
to respond to me. But if I didn't say, hey,
they're Genie, then the speaker would remain in passive mode

(48:31):
because it wouldn't have a digital fingerprint that matches the
one of the wake up phrase. Everything stays at the
local level, and none of my sweet secret speech gets
transmitt related across the internet. It's all staying right there.
At least that's what we've been told. And again I
don't have any reason to disbelieve this, but it is
something to keep in mind. You are talking about devices

(48:53):
that have microphones. Of course, if you have a smartphone,
you've already got one of those or a cell phone.
In general, you've got a device with a microphone on
it neck near you pretty much all the time. Now,
once I do make a request with my smart speaker,
the speaker then sends that request up to the cloud
where it gets processed, It's analyzed, uh, and then a

(49:14):
proper response is returned to me, whether that is playing
a song or giving me information I've asked for, or
maybe even interacting with some other smart device in my home,
such as adjusting the brightness of my smart lights in
my house. Now, if the system is not sure about
whatever it was I just said, it will probably return

(49:34):
an error phrase. So maybe maybe I'm too far away
from the speaker, so it's it couldn't quote unquote hear
me really well. Or maybe I've got a mouthful of
peanut butter or something as I want to do. Then
I'm going to get something like I'm sorry, I don't
know how to do that, or I'm sorry I didn't
understand you, and then I'd have to repeat it. Now,
smart speakers are pretty cool. However, they do represent another

(49:57):
piece of technology that you have to network to other devices,
including your own home network, and as such that means
that they represent a potential vulnerability in a network. It
doesn't mean they're automatically vulnerable, but it means that every
time you are connecting something to your network, then you're

(50:18):
creating another potential attack vector for a hacker. Right now,
if everything is super strong, it it doesn't really effectively
change your safety in any meaningful way. But if one
of those things that you connect to your network is
less strong than the others, you're looking at the weakest

(50:38):
link situation where a hacker with the right know how
in tools could potentially target that part of your network
to get entry into everything else. And when you're talking
about a smart speaker, you're talking about device that has
an active microphone on it. So potentially, if someone were
able to compromise a smart speaker, they would be able

(50:59):
to listening on anything that was within range of that
smart speakers microphone. So that's why you have to at
least be cognizant of that, do your research, make sure
the devices you're connecting to your network are rated well
as from a security standpoint, when you're setting things up
and you have to create passwords, create strong passwords that

(51:22):
are not used anywhere else. The harder you make things
the more likely hackers will just pass you by, not
because you're too tough to crack. Never get your into
your head that you're too strong to to be hacked,
but rather if there's someone who's weaker than the hackers
are going to go after that person instead. So just

(51:43):
don't be the weak person. Practice really good security behaviors,
and you're more likely to discourage attackers and they'll they'll
go on to someone else. Um, especially if you're talking
about newbies who don't really know their way around their
just using tools that other people have designed. They get
discouraged very quickly. They'll move on to someone else because

(52:05):
there's always another potential target. I'm curious about you guys,
whether or not you have any smart speakers in your life,
and uh if you find them useful. I find mine
pretty useful. I use it for a very narrow range
of things. I don't tend to use it. I definitely
don't use it to its full potential. I know that

(52:26):
because what's in the blue moon. I'll just try something
and I'm amazed at what happens when when I get
a response. But for the most part, I'm asking about
whether what I can feed my dog whether or not
it can turn on the lights and uh and and
that's about it. Are occasionally playing a song. Um, but
I'm curious what you guys are using them for. Reach

(52:47):
Out to me on social networks on Facebook and I'm
on Twitter, and the handle for both of those is
text stuff. H s W also use that those handles
if you have suggestions for future episodes. If you've got,
you know, an idea for either a company or a
technology or a theme in tech you'd really like me
to tackle, let me know there and I'll talk to

(53:08):
you again really soon. Text Stuff is a production of
I Heart Radio's How Stuff Works. For more podcasts from
my heart Radio, visit the i heart Radio app, Apple Podcasts,
or wherever you listen to your favorite shows.

All Episodes

Episode Transcript

TechStuff News

Follow Us On

Hosts And Creators

Oz Woloshyn

Karah Preiss

Show Links

Popular Podcasts

Stuff You Should Know

24/7 News: The Latest

Crime Junkie

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}How Smart Speakers Work