Everything We Learned from Starting An AI CX Company - OttoQa - Advice from a Call Center Geek!

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
This is advice from a call center geek a weekly
podcast with a focus on allthings call center.
We'll cover it all, from callcenter operations, hiring,
culture, technology andeducation.
We're here to give youactionable items to improve the
quality of yours and yourcustomer's experience.
This is an evolving industrywith creative minds and
ambitious people like this guy.

(00:21):
Not only is his passion callcenter operations, but he's our
host.
He's the CEO of ExpediaInteraction Marketing Group and
the call center geek himself,tom.

Speaker 2 (00:30):
Laird.
Welcome back everybody toanother episode of Advice from a
Call Center Geek the callcenter and contact center
podcast.
We try to give you someactionable items.
Take back in your contactcenter, improve the overall
quality, improve the agentexperience.
Hopefully can improve thecustomer experience as well.
Oh, how's everybody doing?
My name is Tom Laird.
Most of you guys know me by now.
I would hope.

(00:51):
It's been a while since we'vedone a podcast.
I have been slammed withExpedia things and with Auto QA
really launching and reallystarting to take off with lots
of companies, so I get a lot ofquestions on auto.
I wanted to answer as many ofthem as I possibly can of my
list here and talk abouteverything that we basically

(01:14):
have learned in the last 13, 14,15 months of starting from
scratch with an AI tool, doingit for our internal customers
first and then and then movingit to a product that customers
are actually paying for or using.
It's been really awesome.
So I'm live on LinkedIn, liveon TikTok.

(01:34):
If you guys have any questions,please, please, please ask away
.
I'm here to answer anything.
We'll do this as we normally doas a full AMA, but let's start
with some things that I think Ididn't think about and, I think,
some interesting things,starting with the models that we
use.
So when ChatGPT 4 came out isreally when we started to say,

(01:59):
okay, I think we could dosomething with an AI tool.
So, as most of you guys knowthat are following me, we
initially built AutoQA for ourinternal customers and that's
who.
We kind of tested it and kindof learned what we needed to do.
We thought it was going to bepretty easy.
It was nothing easy about it.
I know a lot of people just say, well, all you guys are doing

(02:21):
is building a GUI on top of alarge language model, and I
guess, yeah, that's kind of whatwe're doing, but there's so
many intricacies, so many thingsyou need to do about prompting,
so many things that our systemprompt has to do, that we
learned a ton.
The main thing that we learnedis ChatGPT 4 and ChatGPT 4.0 are

(02:45):
not good models for what we areusing this for.
I don't know if people disagreewith this.
I've not really talked tonobody's really talked to me
about what models they're usingin this kind of auto.
It's kind of CX space, right.
Everybody's kind of hum andhush on that.
I'm pretty open with it,although it's cheaper, right.
When 4.0 came out, it cut ourcosts in half, but our accuracy
was also cut in half.
So very quickly the excitementof 4.0 turned into man, this

(03:11):
thing.
Just, it's almost worse than 4.
So that's when we started tosay, all right, well, what other
models are out there?
And we tested pretty much allof them.
The ones, the only ones, thatwe have found to be absolutely
the best use case for us havebeen Anthropix products.
Right.
So we use Haiku to go dosummarizations.

(03:32):
Right.
So if somebody has 150 forms,we want to summarize the four
things that they've done.
You know, good or poorly, onthose 150 calls that we've
evaluated, we just send it outto Haiku.
That does a really good job ofjust kind of summarizing very
basic stuff.
We initially started with Sonnet.
We thought Opus was going to betoo expensive.
Sonnet was a huge step up from4.0, but it still did not give

(03:56):
us the same consistencies.
And it's kind of like anoxymoron to say, well, if
something's accurate and it'snot consistent, is really
accurate?
And I mean the answer is no.
But we would see.
You know, out of out of 10questions that we asked, sonnet
would be right eight or nine,but then one time, you know,
would just give us somethingweird.
So we worked through a lot andand, to be honest, that's where

(04:19):
we learned a lot of things froma prompting standpoint, that
would help with consistency.
Accuracy is basically just thecore prompt and if we're not
accurate, then we have aprompting.
We're telling it to dosomething.
There's wording wrong, there'ssomething there.
But when it came to just theconsistency of scoring, that's

(04:41):
where some really cool things.
We have seven kind of checkswithin our system prompt to make
sure that we are being asaccurate as possible with our
answers.
So I think you guys have heardme talk about this a little bit,
but we reiterate every singlequestion.
So even if it costs us more intokens, you know, whatever we're

(05:02):
doing, our mission is to be themost accurate and the most
consistent auto CX, auto QAplatform on the on the market.
So the one thing that we foundis that if we said okay, let's,
let's take a question like didthe agent, um, did the agent use
the proper greeting, so we willsay ask that question and

(05:24):
answer it large language modelbut then hold that like in the
back of your AI head and then weask the question again and if
both of those match, then boom,we give the answer yes or no or
whatever we're scoring.
If we still have some type ofinconsistency, it's a yes and a
no Then we say for a third time,don't ask the question, but go

(05:45):
back into the transcript, searchfor where you need to find that
answer right I'm paraphrasinghere and then whatever is
closest, score that.
And we found that to beextremely consistent too.
So we do about six or seven ofthese checks on every single
question, with some differenttechniques based on what we have

(06:06):
found to give us the mostconsistent answers.
So it's been really interestingto kind of learn the prompting
techniques and some of thosethings that you can be just.
I guess you can just be socreative with this.
And I think being a contactcenter, doing a tool like this,
like we know what we want, right, we're not just kind of looking

(06:28):
for specific keywords.
I think I love the phrase thatI've been using with this, which
is intent and outcomes.
Right, we're looking for agentintent and we're looking for did
the proper outcome take place?
And that's where we can scorethis thing at a little different
level than just again lookingfor specific keywords Although
we're really good at that Likeif you have a disclosure and
these specific keywords have tobe said exactly.

(06:49):
Obviously, ai can do that.
But AI can also quote, unquote,think right and get deeper into
empathy, get deeper into didthe statement that the agent
made when the customer said this?
Did it have an impact on theoverall sentiment of the call?
Right?
You can go a little bit deeperto some of the things that we

(07:09):
couldn't do with just analyticsand just looking for keywords.
And I think a lot of the modelsright now are trying to kind of
mesh these two things becausethey see, with the amount of
overhead that some of theselarger companies have, that just
using the, the large languagemodels can be, can be expensive
for them, you know where?
For us it's not that bad right,because I don't have a ton of

(07:33):
overhead.
We, we are doing this kind ofin a as a contact center would
do it, not as a software companydo it, which I don't know if
that's good or bad.
But I think those are some ofthe things that we have noticed
to make a lot of difference inthe actual prompting of the
question.
So we have found that obviouslythe system prompt is absolutely

(07:54):
vital to tell it the exactoutputs that we want.
We have found that we can'ttrust any of the large language
models to actually do thescoring, so we do the scoring.
We kind of programmatically doit on the back end.
Once the answer comes back yesor no, then we scored ourselves
with kind of programming.
So we know that it's alwaysgoing to be, you know, spot on

(08:17):
from there.
You know, tried to do it withthe language models but, just
again, they're not.
They're okay at math, they'renot great at math.
So you know we had someinconsistencies there early on.
That we, that we, we tried towork through the new model that
just came out.
Yesterday, as I recorded thisI'm recording this on on the
21st of June 2024, claude 3.5Sonnet came out and for us it

(08:44):
was like Christmas.
We didn't know it was coming.
We have been using Opus.
I guarantee you that there's noone else, no other auto QA
platform is using Opus andeverybody knows that Opus is the
best.
It's been beating up ChatGPT4.0 for a really long time but
no one's using it because it'sso expensive.

(09:04):
That is our core model thatwe've been using, for everyone
is Opus, right, so the mostexpensive model, but we're
getting the best results and wejust figured that that'll take
care of everything else else.

(09:24):
So when 3.5 just came outyesterday, it's about half the
cost actually a little bit morethan half the cost of Opus, but
what we're finding is that it isbetter than Opus as well.
Now we've only been testingthis for a day, we've been
running models.
So how we and maybe I'll getinto the calibration how we
calibrate these calls is acustomer will send us 10 scored
calls, so we'll have the calland we'll have the maybe it's a

(09:48):
PDF or it's a Excel spreadsheet,whatever it is of how the call
was scored.
We will put that into an Excelspreadsheet as kind of our base.
And then what we've been doingnormally is just running Opus as
the auto brain and then we'lljust put in, you know, each
question what did the humanscore?
What did auto score, also knownas CLOD3, opus, right?

(10:13):
Well, how did it score?
And then we would calibratefrom there.
Right, we would tag it blue ifwe thought that auto was a
little bit more correct.
We tag it red if we thought theauto was a little bit more
correct.
We tag it red if we thought thehuman was a little bit more
correct.
If the human was a little morecorrect, then we just kind of
change and tweak how we'reprompting, how we're asking that
question.
A lot of times it's not thepoint of being right or wrong,
it's being right or wrongcompared to the scoring culture

(10:36):
of the company.
Right, they might be reallystrict on certain things and
they have more leeway in otherthings where auto initially will
be really strict on something.
So we just kind of dial thatback and we can then match the
score and culture of everyorganization.
But what we did, startedyesterday, is we'd have a row
for the human, a row for we saidauto 3.0, meaning Opus 3.0, and

(10:57):
then auto 3.5, meaning 3.5,sonnet or sonnet 3.5, depending
on how you say it, and it's beenreally interesting.
So a lot of times we have seenthat with Cloud 3, opus we match
the human a little bit more.

(11:20):
But 3.5 has been actually we'refinding more correct.
So we're finding more quoteunquote human error in the calls
when we're using 3.5.
So we're moving to 3.5.
It looks like we're going totest this for another day or two
as kind of our core model.
It's been easier to kind oftweak and also we'll say this we

(11:43):
just ran a have a very largecustomer that we ran the first
10 calls with 3.5, and we werewithout any tweaking.
We were at about 93% or 94%accurate.
So we'll be able to get themwith our tweaking to 98% how
they're scoring, how the humanis scoring on their team.

(12:06):
So really excited about that Ithink that's been a little bit
of a difference too is thatinitial run is so much closer
now that we have less tweaking,which means less cost for us
because we're not running asmany kind of calibration runs of
the AI.
So I think that's kind of beenreally, really interesting.

(12:29):
I think you're going to see awar now, or a battle or a sprint
or a race whatever you want tocall it between Anthropic and
OpenAI.
You know I would bet you nowyou're going to see a ChatGPT 5,
.
You know, within the next month, right, I guarantee that
they're pushing it up.
You know Opus was the best, butthey were so expensive that I
don't think anybody was reallyusing them in a fashion like we

(12:53):
were.
So now to have that cost aboutit's still a little bit more
expensive than ChatGPT 4.0, butit's comparable and it's way
better, at least for our usecase.
I'm not saying 4.0 is bad right, but for the use case that
we're using it for it's not evenclose right.
We almost can't even use 4.0with how poorly it has been able

(13:17):
to ask questions of this data.
So I think that getting intosome of the tricks and some of
the things that we have found, Ithink calibration has been
something that we're doing andspending a lot of money on every
client to get it to the pointwhere we've been fully
calibrated.
I think it's been really coolto learn some of the prompting

(13:39):
tricks.
Right, like, like reiterationyou know Anthropic likes to use
XML text a lot more than thanChatGPT4.
So you know, learning thatthere is a full kind of prompt
engineer or a prompt, I guess,helper on the workbench of the
anthropic, of open, of of Claudea big problem, but again, I

(14:13):
think it goes a little bit tooin depth sometimes.
We found that the most successwe've had is is the human brain
when it comes to tweaking aprompt, unless we're way off or
something's kind of going crazy,but to get the format right,
right with the xml tag, uh tags,I think has been pretty helpful
.
With that there's, there's justso many more tools on the in
the anthropic pool.
That I think is cool.
One of the things that we'removing to now because visual is

(14:35):
starting to come out right andto be good on both of the models
is to you know kind of use theforms and kind of feed those
forms and let them see the formthat a customer has at the
beginning.
To start our initial build outRight.
The longest period of time thatwe have, or the longest length,
is the actual initial build outright.

(14:57):
So a customer will send us,like their rubric.
They'll send us all thequestions how do you score these
questions?
Here's some background on thisright.
So to build those out takes alittle bit of time and we have a
we do have an AI tool thatwe're utilizing to speed that up
, but that's something thatwe're looking to do to make this
even quicker.
We're currently about 10 daysto two weeks.
We've been longer for somecustomers, shorter for others.

(15:19):
On the build-out.
We won't go to proof of conceptuntil we're at least 96%
calibrated.
So I think that there's a lotto be said for the tools that
are coming out and for how smartthe models are getting to be
able to start to take otherpieces of data input.
That will help us kind of as wemove forward.

(15:40):
I'm very excited.
The next big step, I think fromthe model standpoint that we are
excited and we are preparingfor, is the one thing that we
stink at is that we can't seethe screen right.
We thought that 4.0 was goingto be the panacea, that we are
going to be able to take thescreen recordings and QA what an

(16:02):
agent is doing actually on thescreen as well, and while we can
get data from that right, wehaven't really been able to get
it to the point where it can QAit.
I think that's coming veryshortly.
That was a little frustrating.
That kind of was the oh my GodForos here.
Oh my God, I can do this kindof video and this imaging.

(16:24):
This is unbelievable.
But it's just the processingpower.
I think that it needed.
It just never could get to thatpoint where we could.
Put's been a little bit of astruggle there.
So we're still sticking withaudio.
So voice email, chat, sms, help, desk tickets those are all the

(16:47):
things that we're really goodat QAing.
I think the other thing iscustomers have been extremely
helpful with how we shouldroadmap this thing out.
I'm glad that we went out early.
You know certain things that outearly.
You know certain things thatcustomers want.
You know a lot of differentchanges to the dashboard that we
will be able to see, you know,starting even today with being

(17:09):
able to show differentiation howmuch better or worse an agent
is getting.
I think those basic type ofreporting is important.
You know auto-fails were noteasy right To do.
Auto-fails was something thatwe're now doing programmatically
.
So for a section of a QA formwe will say we'll just click it

(17:32):
into this is an auto-failsection.
So if any of them and we'll saywe'll click the auto-fail and
we'll say yes.
So if any of those questionsare answered yes, then we will
still score the call, but it'llbe a zero for the overall call,
although you'll be able to stillsee the sections.
So we've been able to tweaksome things there.
Rated calls we're getting muchbetter at.

(17:52):
I feel very comfortable nowgoing out with customers and say
, hey, I don't want a yes or no,but I want to scale of one
through 10, we have one customerthat is scoring this zero, one,
two or three right, and theyget points, you know, based on
those different levels and Ithink we're we're very close and
very accurate with those guys.
So I think that customer ispretty happy with where things

(18:13):
are going, cause that's a littlebit more subjective, right, the
difference between a two and athree, like they did this, but
they did a little better as athree compared to a yes or no.
They either did it or theydidn't.
There's a big difference inkind of how the AI needs to
think that through.
It's a little bit moresubjective, but there's some
prompting techniques that we'vebeen able to utilize for that as

(18:35):
well.
So, yeah, it's been a lot offun.
We are getting better and betterand better and better and
faster and faster of this everysingle day.
I'd love to build this, some ofthis stuff, out for you guys.
Again, remember, no, no, nocontracts right now, where
there's still no setup.
I don't know how much longerwe're going to be able to do
that for, but right now there'sstill no setup.

(18:56):
Total usage model, right.
So no seat licenses or any ofthat.
It doesn't cost you anything torun this from us building your
form out to a full proof ofconcept of you being able to do
100 to 200 calls on your own tocheck it out.
So we do all the work for you.
All you do is sit back and atthe end, score some calls,

(19:17):
answer some questions, if wehave, and then hopefully we can
really do some cool things foryou from a cost standpoint, as
you know, we can work.
The cost of this, compared to ahuman being, is pennies.
So I think that you know youcan utilize your teams in a lot
of different ways that are moreproductive as well.
So, thank you guys.

(19:39):
Again, check us out.
Autoqacom, xpeviausacom.
Love to talk with you guyseither on your outsourcing, also
on our AutoQA platform as well.

All Episodes

Everything We Learned from Starting An AI CX Company - OttoQa

Episode Transcript

Popular Podcasts

Stuff You Should Know

My Favorite Murder with Karen Kilgariff and Georgia Hardstark

The Joe Rogan Experience

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Everything We Learned from Starting An AI CX Company - OttoQa