Technical advances in document understanding

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Jerod (00:04):
Welcome to the Practical AI podcast, where we break down
the real world applications ofartificial intelligence and how
it's shaping the way we live,work, and create. Our goal is to
help make AI technologypractical, productive, and
accessible to everyone. Whetheryou're a developer, business
leader, or just curious aboutthe tech behind the buzz, you're

(00:24):
in the right place. Be sure toconnect with us on LinkedIn, X,
or Blue Sky to stay up to datewith episode drops, behind the
scenes content, and AI insights.You can learn more at
practicalai.fm.
Now onto the show.

Sponsor (00:39):
Well, friends, when you're building and shipping AI
products at scale, there's oneconstant, complexity. Yes.
You're wrangling models, datapipelines, deployment
infrastructure, and then someonesays, let's turn this into a
business. Cue the chaos. That'swhere Shopify steps in whether
you're spinning up a storefrontfor your AI powered app or

(01:00):
launching a brand around thetools you built.
Shopify is the commerce platformtrusted by millions of
businesses and 10% of all USecommerce from names like
Mattel, Gymshark to foundersjust like you. With literally
hundreds of ready to usetemplates, powerful built in
marketing tools, and AI thatwrites product descriptions for

(01:21):
you, headlines, even polishesyour product photography.
Shopify doesn't just get youselling, it makes you look good
doing it. And we love it. We useit here at Changelog.
Check us outmerch.changelog.com. That's our
storefront and it handles theheavy lifting too. Payments,
inventory, returns, shipping,even global logistics. It's like

(01:42):
having an ops team built intoyour stack to help you sell. So
if you're ready to sell, you areready for Shopify.
Sign up now for your $1 permonth trial and start selling
today atshopify.com/practicalai. Again,
that is shopify.com/practicalai.

Daniel (02:17):
Welcome to another fully connected episode of the
Practical AI Podcast. This isDaniel Wightnack. I am CEO at
PredictionGuard, and I'm joinedby Chris Benson, my cohost, who
is a principal AI researchengineer at Lockheed Martin. And
in these fully connectedepisodes where it's just Chris
and I, we try to dig into a fewtopics or deep dive into some

(02:40):
learning resources that willhelp you level up your AI
machine learning game. Lookingforward to this one, Chris.
I think in reflecting before theepisode, both of us going into
American Thanksgiving, which istomorrow, as we're recording
this, but going in with a lot ofa lot of gratitude for the year.

(03:01):
Just a lot a lot happens in inlife, and, it's a nice time to
kinda reflect and see see theblessings that we have, at
Thanksgiving. And, yeah, what ablessing to just keep doing this
keep doing this show for goingon eight years now.

Chris (03:21):
It's been a moment.

Daniel (03:24):
Having a lot of fun, making a few stepping on a few
minds along the way, but havinghaving fun generally. And I
think, yeah, thankful to ourlisteners as well, just to take
a moment to say thank you forsticking with us all all these
years. Chris and I have have alot of cool plans for the coming

(03:45):
year and there's energy behindthe show, lots of ideas going on
that we'll talk about soon. Butyeah, thank you to our listeners
for sticking with us.

Chris (03:55):
Couldn't say it better. Thank you to the listeners for
sticking with us. And and Igotta say, these fully connected
shows in a lot of ways are areso much fun. They're they're
among my very favorites becauselike we get to talk to these
most amazing guests in a typicalepisode, know, where you're like
talking to some of the smartestpeople in the world, and being
able to kind of understand howthey see it and learn. And and I

(04:17):
know our listeners go along forthe ride on that.
But I also love when we just,know, it's the Wednesday
afternoon before Thanksgivingfor you and me as we're
recording this. I know peoplewill be listening to it just
after Thanksgiving. But it's alot of fun just to jump into the
conversation. And I know I knowwe have some fun things to hit
today. So I'm relaxed andlooking forward to it, Daniel.

Daniel (04:37):
Yeah, yeah, for sure. And I don't know a more exciting
topic for the Thanksgivingdinner table than document
processing, which what I kind ofbrought forward today. Guess
what I was realizing, Chris, iswe talk a lot about large
language models. We've talked alot about We have talked a lot

(05:00):
about computer vision type ofthings on the show, maybe not as
much recently, but over theyears. We've talked, about all
the kind of chatbot stuff andall of that, but I think kind of
lurking below the surface of alot of work in industry is

(05:20):
document processing.
As the years have gone along andwe've kind of entered into the
generative AI kind ofrevolution, there has been also
this kind of stream ofinnovations in relation to
processing documents in anautomated way with models. And
of course that reaches verypractical places in terms of

(05:47):
everyday business work, right? Ithink often the most valuable
workflows that people have dayto day or maybe the most
annoying ones is, you know, thisperson sends me an email with
this document. I've gottaextract this or do this or,
create a summary of that. Or, Ihave new documents that are, you

(06:12):
know, regulations related tocompliance and I need to process
them and get them, you know,into somewhere.
And that's really kind of at thecenter of a lot of what happens
in businesses day to day. So I,yeah, I thought it would, you
know, as we hopefully aren't yetin a coma after eating too much

(06:33):
turkey, we could use this timewhen we're alert to talk about
some of that.

Chris (06:39):
Great point there. And before, like, I kind of hate the
name, like document processing.I think, like before everyone
out there goes to sleep, turnsus all goes, Oh my God, they're
talking about documentprocessing and goes to sleep.
This is pretty cool stuff. Andit's

Daniel (06:54):
important Modeling because it's wise.

Chris (06:57):
Yeah, absolutely. And it is productive and we pride
ourselves you know, on onbringing that that, you know,
practical, productive andaccessible approach to it. And,
and I think that's reallyimportant is like, I think one
of the differences in theconversations we have on the
show versus some other shows isthe other ones tend to chase the
headlines and the glam andstuff. And we're really focused

(07:19):
on like getting people into thistechnology so that they can use
it day to day in a fun way. Andyeah, and so, before you turn
off and go up, I'm gonna turnoff for Turkey on document
processing.
This is pretty cool stuff. AsDaniel said, this has been going
on, which just doesn't get theheadlines anymore like it used
to. And so it's really worthdiving into and saying, hey,

(07:42):
look at what's possible nowversus the last time we talked
about it.

Daniel (07:45):
Yeah. And probably what initially prompted this is, of
course, I mean, we've beenworking with some of these
models internally, but alsoDeepSeek did release a DeepSeek
OCR model, which people havebeen talking a lot about, which
represents at least part of thisstream of work that's been going

(08:06):
on around document processingmodels. Now, just so people kind
of have, I guess, a little bitof background or jargon kind of
where we're headed, my thoughtis we really need to kind of
pick apart some of thesedifferent kinds of modeling, how
they fit in and where they'repractical, maybe where they're

(08:27):
not practical. And inparticular, there is OCR, which
has been around for the longest,I guess, in terms of the things
that we'll talk about, which isoptical character recognition.
That's what that stands for.
Then there are language visionmodels, which is something that
has happened, or LVMs. Thenthere are, I guess, document

(08:53):
structure type of models, kindof like a Dockling, people
might've heard of Dockling. Andthen finally, there's kind of
this latest model, DeepSeek OCR,which is different from kind of
like what people might think ofin terms of OCR. And so there's
these different kind ofcategories or families of

(09:14):
methodologies here. And there'sreally, like you say, Chris, lot
happening in these differentareas, but that's kind of where
we're headed in theconversation, I guess, for those
listening as there's kind ofpick apart some of these things.
I don't know, Chris, how long Imean, I kind of remember OCR

(09:35):
happening for a very long time.I mean, neither one of us, I
think, grew up with computers,at least that had OCR on them,
or computers in general. I doremember in grad school, you
know, processing some papers orother things and applying some

(09:57):
type of OCR maybe in some toolson these. But yeah, what's your
history there?

Chris (10:03):
Yeah, well, I mean, early OCR was really not very good,
and this was you know, certainlybefore kind of the current
generation of AI. And I'm usinggeneration very broadly here,
like the the last fifteen years.And it's come a long way with
these new technologies andstuff. I know when I was
younger, some of the kind of preAI OCR technologies just were

(10:25):
like, I remember trying themwhen I was younger and kind of
going really not working forlike it's almost costing me more
effort than it's worth it. Sothings have changed
dramatically.
I mean, it's so good now andthere's so many approaches to it
as we're gonna dive into.

Daniel (10:41):
Yeah, yeah. And I think that maybe a good starting point
for that, if we just start withOCR, is really thinking about
the processing pipeline and thedifferent components that are
involved in it, because thatreally drives what compute is
needed, how fast it is, howperformant it is, you know, and

(11:01):
it kind of distinguishes it as acategory. So if we just start
with OCR, I think we could dothat. Now, just by way of
reference in terms of how thingsare processed through a kind of
quote classical OCR model or atypical OCR model, these would
be things like Tesseract orPaddle OCR, these sorts of

(11:22):
technologies that we're thinkingof. What happens is an image is
input and then ideally kind oftext or characters are output.
If we just contrast that,because everyone's talking about
LLMs now, typical processingpipeline with LLMs is, you know,
not images come in, but textcomes in, that text is split

(11:45):
apart into tokens. Those tokensare assigned indices, like
within a vocabulary. That kindof array of indices is embedded
into a dense representation by atransformer based model often.
And then what is predicted onthe output side is an array of

(12:11):
probabilities corresponding todifferent tokens, such that you
can know what is the mostprobable next token coming out
of the model. So you kind ofhave text come in, that text is
split apart into tokens, that'sembedded, and then output are
these probabilities of nexttoken.
So if we just contrast that withthe OCR model, first of all, we

(12:33):
have a different type of input,right? We have an image and that
image is made of pixels. Andoften, so we have this image,
it's made of pixels and theoutput actually not dissimilar
to the LLM. There is an outputof probabilities at the end.
It's just an output of kind ofprobabilities of characters.

(12:55):
So what happens is if you lookat a big image, it might have
regions of characters in it orwords. And what happens in the
OCR model is you take that bigimage with a lot of characters.
There might be some preprocessing on the image, like a
resizing or something, But thenthere is one kind of model that

(13:19):
detects the area or regionswhere there are kind of
characters or text, textregions. And then you take each
of these text regions and youput it through like a
convolutional neural network oran LSTM. And then that outputs
through a sequence model, aprobability of characters or the

(13:43):
probability of what charactercorresponds to that region,
right?
So essentially that OCR model,it's really just looking at that
big image, determining wherethere are characters or text
regions. And then for each ofthose, predicting what that
character or text region is,right? So that's how the

(14:04):
processing goes, which in someways is kind of seems kind of
almost brute force, right?You're splitting it apart into
all of these regions, right?

Chris (14:16):
As you were talking though, I was also thinking back
over the history of the show andwe're talking like, this is the
first time I think you've saidLSTM you know, in

Daniel (14:25):
a while

Chris (14:26):
in a bit. Yeah. How many years has it been since we
talked about that and, andrecurrent neural networks, you
know, which were also involvedin, and then kind of
transformers also starting tobridge the gap there. Wow.
Taking us back a little waysthere.

Daniel (14:43):
We're good. Taking us back. Yeah. So if you This is
really kind of in a lot of ways,a brute force type thing. You're
really splitting apart thatimage into these different
regions, and then for each kindof trying to detect which
character.
Now, similar to what you weresaying, we're talking about
maybe convolutional models orarchitectures, maybe LSTMs,

(15:08):
which is a long short termmemory recursive type of
network. These kind oftraditionally in these tools
like the OCR tools are rathersmall models by today's
standards. And as such, eventhough it's kind of you're brute
forcing all of these characters,they are fairly efficient in

(15:30):
terms of where you can run them.So I can run one easily on my
laptop. I can run it on a CPU.
I don't have to have a largeGPU.

Chris (15:38):
True. It's, you know, it's interesting is that
evolution and the different kindof branches of possibility in
terms of how you might approachthe problem have developed. Any
thoughts do you have any anykind of thoughts around kinda
like as we as we went from LSTMsand got to convolutionals and
then transformers started makingan impact on that. You know,

(16:00):
maybe after we come out of thebreak, we can talk a little bit
about kind of how those howthose evolved and why the
different selections became kindof primary over time.

Sponsor (16:15):
Well, friends, it is time to let go of the old way of
exploring your data. It'sholding you back. But what
exactly is the old way? Well,I'm here with Mark DePuy, co
founder and CEO of Fabi, acollaborative analytics platform
designed to help big explorerslike yourself. So, Mark, tell me
about this old way.
So the old way, Adam, if you're a a product manager or

(16:36):
a founder and you're trying toget insights from your data,
you're you're wrestling withyour Postgres instance or
Snowflake or your spreadsheets,or if you are you don't maybe
even have the support of a dataanalyst or data scientist to
help you with that work. Or ifyou are, for example, a data
scientist or engineer oranalyst, you're wrestling with a
bunch of different tools, localJupyter Notebooks, Google Colab,

(16:58):
or even your legacy BI to try tobuild these dashboards that, you
know, someone may or may not goand look at. And in this new way
that we're building at ABBYY, weare creating this all in one
environment where productmanagers and founders can very
quickly go and explore your dataregardless of where it is. It
can be in a spreadsheet, it canbe in Airtable, can be in
Postgres, Snowflake. Really easyto do everything from an ad hoc

(17:21):
analysis to much more advancedanalysis if, again, you're more
experienced.
With Python built in rightthere, NRAI Assistant, you can
move very quickly throughadvanced analysis. And the
really cool part is that you cango from ad hoc analysis and data
science to publishing these asinteractive data apps and
dashboards, or better yet,delivering insights as automated

(17:45):
workflows to meet yourstakeholders where they are in,
say, Slack or email orspreadsheets. If this is
something that you'reexperiencing, if you're a
founder or product managertrying to get more from your
data or for your data team todayand you're just underwater and
feel like you're wrestling withyour legacy, you know, BI tools
and and notebooks, come checkout the new way and come try out
Fabi.
There you go. Well, friends, if you're trying to get

(18:06):
more insights from your data,stop resting with it. Start
exploring it the new way withFabi. Learn more and get started
for free at fabi.ai. That'sfabi.ai.
Again, fabi.ai.

Daniel (18:25):
Yeah, Chris. So you were just kind of getting into, I
guess maybe why, assuming wehave OCR, right? That does work
in the sense that you canpredict characters, you can pick
out these text regions. So OCRmodels have obviously got better
over the years. So why is therea need for something else?

(18:48):
Why is there a transition tomaybe other architectures or
other things? So what I wouldsay is there's kind of, if you
think about that process of theimage coming in and you
splitting apart those textregions, you kind of end up with
all of this kind of plain textoutput and any sort of logic

(19:10):
around the reconstruction ofthat document, especially
related to the layout of thedocument is problematic, I would
say. And I would say these areoften highly dependent on the
actual quality of the pixelsthat are input. Remember the

(19:32):
pixels are input here and oftenthe images are kind of resized
on the inputs to these models,or they need to be just in terms
of the input size. So you've gotkind of this combination of
problems of not having anunderstanding of the layout, but
also requiring kind of cleanscans of the documents, if you

(19:54):
will, which is definitely adrawback of this approach, I
would say.

Chris (20:01):
Yeah, I mean, I can remember back in the day with
the traditional OCR, I mean,that was not just a problem, but
it was constant. Would use OCRon a document and you had to
pretty meticulously go throughthe document afterwards to
correct a lot of the error onthat. That didn't change really
until we got past thetraditional into more of the
vision based models. Sodefinitely seeing the

(20:25):
progression there.

Daniel (20:26):
Yeah, yeah. And I mean, that kind of naturally
transitions us into one of thethings that is now a part of our
world and helps with that, atleast a part of that problem,
the structure and layoutproblem, which are what are
called document structuremodels. Or one of the most

(20:48):
popular of these is calledDockling. And there's different
families of these. Dockling, itmight be confusing because
there's some models that arekind of labeled as Dockling
models.
There's also a toolkit calledDockling that IBM released,
which isn't actually just onemodel. It's a whole series of
pipelines and options arounddocument processing. But one of

(21:11):
the core concepts here, whetherit's in use in that library or
in reference to a model, is thata document structure model in
terms of what it does or thedifferences, it actually doesn't
do any OCR. It doesn't detecttext. It doesn't convert images

(21:36):
to text and this sort of thing.
What it does is it tries topredict the structure of the
document or a structuredrepresentation of the document.
Because remember with OCR, wedon't really have that, right?
We just have the prediction ofthese characters in these
different, you know, croppingsof the image. And so with

(22:01):
Dockling or a similar documentstructure model, what happens is
you have that document that'sinput, a document or an image,
And then what happens is that akind of parser extracts layout
primitives. So that might belike rectangles or certain

(22:24):
shapes or vectors or fonts.
And then a layout model, again,kind of part of this document
structure model. Layout modelthen makes predictions for what
those regions should beclassified as. So things like
titles or paragraphs or headingsor tables, etcetera. And then

(22:48):
output of the model rather thanpredicting characters again, so
I'm not getting text out ofthis, I'm not getting characters
or text. What I'm getting is astructured output representation
of the document, usually in kindof JSON, markdown, HTML format,
which basically tells me, okay,you put in this document, over

(23:10):
here is a table, over here is atitle.
This region corresponds to aheading. There's a paragraph
over here. And that way, whenyou have these more complex
documents, maybe two columnpapers or white papers with a
bunch of tables or data sheetsor that sort of thing, you kind

(23:30):
of have this structure laid out.You have the classification of
that structure. And so actuallya Dockling model or this type of
document structure model isoften used in combination with
an OCR model.
And it would kind of go likedocument comes in, you detect
all the structure of thedocument, right? Over here's a

(23:53):
table, here's a paragraph,here's a title. Okay, well now
let me send that title bit intoan OCR model and then actually
get the text associated with thetitle, right? And so now you've
overcome a little bit of thatlimitation of the raw OCR by
applying this structure on topand you can reconstruct the

(24:14):
document as a markdown documentwith all the tables and titles
and that sort

Chris (24:20):
of thing. It's funny as you kind of describe your going
through the process there as avery loose analogy, it reminds
me somewhat of For those of usin the audience who are
programmers like you and I, itreminds me a little bit of the
way programming languages arecompiled into this tree
structured format. It's calledan abstract syntax tree and

(24:43):
asked, you know, where it kindacaptures what regardless of what
the originating language is, itkinda captures the the essence
of what the program is beforeit's, you know, compiled into
machine code or whateverwhatever your target is. But it
kinda feels like Dockling isdoing a somewhat at a higher
level, obviously, but doing alittle bit of a similar thing in

(25:03):
terms of capturing all thatstructure out of the doc.

Daniel (25:06):
Yeah. Yeah. It would be like the OCR model has an output
of character probabilities.Right? The LLM has an output of
token probabilities.
The document structure modelactually has an output of this
tree structure or the treerepresentation of the structure
of the document. So it's thatkind of processing pipeline

(25:29):
where you pick apart theselayout primitives and then you
classify each one. So reallyit's kind of main piece of this
is the classification piece ofeach of these elements and then
assembling that into this treestructure, which yeah, is
certainly very useful. I thinkit's worth noting that this does

(25:52):
help you handle more complicateddocuments. Again, though, it
doesn't solve the textextraction piece.
So you still kind of need to addthat piece in. And often this is
more computationally heavy thanjust raw OCR, which can run on
CPUs often. I think I've runDockling models also on CPU or

(26:17):
on constrained environments. Ithink Hugging Face released a
small Dockling model, which isalso geared towards that side of
things. Obviously, you have thesame trade offs with any kind of
model size.
The smaller ones maybe don'thave the same level of
performance, but will run onmore constrained environments.

(26:39):
The larger ones maybe havehigher performance, but they
might need a GPU to run.

Chris (26:44):
As we talk about this, are you would you say that
Dockling is is like still a verymodern and current way of doing
things, given the fact thatHugging Face is releasing
models? And are there use caseswhere you would not necessarily
wanna go to this in your view?Like where might you say, I
don't like, I get the benefitsthat we've talked about. Where

(27:04):
might we say not the right fit?

Daniel (27:07):
Yeah. I I would say that you really kind of want to use
this when you need to preservethe structure of the documents
that are input and you maybehave complex structures, again,
like the data sheets or multicolumn or mix of columns and,

(27:27):
other things, this is reallyuseful at that point. But if you
just have like a raw scan that'srelatively clean and all of it's
just text and you need to detectall of that text, then maybe an
OCR model is totally Sufficient.And the structure model is
overkill, right? But yeah, Iwould say this is still very

(27:49):
much in widespread use now andquite powerful.
We've used it on a few differentprojects as well with good
success. It is still a modelthat I would say, even though
it's a little bit morecomputationally expensive than
OCR, we'll talk about languagevision models and deep seek OCR

(28:13):
here in a second, it is not atthe level of computation of
those types of models, whichmeans you could still embed it
kind of within your applicationor something, maybe run it on a
commodity GPU, that sort ofthing. So it is still really
useful in those ways as well.

Chris (28:33):
Thinking a little bit about different use cases, we
still today, like if you go anduse different types of Office
tools, and I don't necessarilymean Microsoft Office, but that
genre of productivity tools, andyou're doing file format changes
and stuff across. I know I knowrecently, I think, about a week

(28:54):
ago, I was trying to move a akeynote, just into a PowerPoint
context. And you would think in2025 we would have gotten past
that. I didn't. Do you thinkthis is something that is either
used at some level or could beused at levels in terms of
trying to to capture that kindof complex structure and get it
into a different format withoutlosing the gist of what the

(29:17):
communication was.
Is that is am I on target there?

Daniel (29:20):
Yeah. Yeah. I think the limitation, I guess, is in how
rich that description is. Right?Like you might get these heading
or you might get these labelslike heading, title, paragraph,
etcetera, table.
But ultimately, if you were toneed to reconstruct that, you
have to decide how you are goingto render a table, how you are

(29:42):
going to render a title, whichmay be very different than the
original keynote, let's say thekeynote presentation, and you're
going and putting it in GoogleSlides or something like that.
So actually that, I think thatrendering piece is still a quite
challenging one. What I wouldsay, maybe this is a

(30:02):
generalization because we'veactually used Dockling models in
other ways than what I'm aboutto say. But one of the very
frequent uses of these models isfor the processing of documents
that are feeding into, let'ssay, a Rag, a retrieval
augmented generation pipeline.Why would that be?
It's sort of because the cleanerand more context relevant you

(30:27):
can make those chunks of textinto your Rag system, the better
results you're gonna get in theresponses from the Rag system.
And so if you're just processingyour documents that have some
complex structure using OCR, allthe text might get jumbled up

(30:47):
and thus the knowledge and thecontext in the document is kind
of jumbled up. Even though allthe pieces are there, they might
be out of order or they might besomething like that. In the case
of Rag, you actually don't needto render anything. You just
need to parse it well andpreserve the structure, right?
So actually, think Dockling orthese document structure models

(31:08):
are a really good way to do thatdocument processing for input to
Rag pipelines, because there youprobably just need things to be
represented well in markdown orsome similar text format, not in
a cool PDF that you recreate orsomething like that.

Chris (31:26):
Yeah. I'm just thinking of like, wasn't too far back, a
year, a year and a half and Ragwas all new at the time and now
it is so embedded into ourworkflows. It it lots and lots
of organizations out there.

Daniel (31:41):
Yes.

Chris (31:41):
And and I'm thinking about the fact that I wonder how
many people out there are usingDockling in that capacity, you
know, as that input to thatworkflow. And it would probably,
you know, having the contextualaspect of of the information
saved structurally in that waywould probably yeah. I agree
with you. I mean, that thatmakes perfect sense intuitively

(32:04):
that you would definitely have arag system able to give you
better answers on that. Have youseen that in that use case much
out there or is that very muchone off?
What's your gut feeling aboutthat?

Daniel (32:15):
Yeah, definitely. I would say in particular toolkits
like Dockling, the toolkit, andthere's other ones like
MarketDown, which I think is atoolkit from Microsoft. We've
used those over and over in RagSystems and I know other people
do as well. Certainly peoplealso use vision models, which

(32:36):
we'll talk about here in asecond. But I would say, again,
in the Rag system, wannapreserve that structure.
You don't want things out oforder, but you really don't care
how they're renderednecessarily. You just need to
preserve the structure andordering. And so that works out
really good for RAG systems.

Sponsor (33:11):
So most design tools lock you behind a paywall before
you do anything real. AndFramer, our sponsor, flips that
script. With Design Pages, youget a full featured professional
design experience from vectorworkflows, three d transforms,
image exporting, and it's allcompletely free. And for the
uninitiated, Framer has alreadybuilt the fastest way to publish

(33:34):
beautiful production readywebsites, and now it is
redefining how we design for theweb. With their recent launch of
Design Pages, which is a freeCanvas based design tool, Framer
is more than a site builder.
It is a true all in one designplatform from social media
assets to campaign visuals tovectors to icons, all the way

(33:54):
down to a live site. Framer iswhere ideas go live start to
finish. So if you're ready todesign, iterate, and publish all
in one tool, start creating forfree today at framer.com/design
and use our code practical AIfor a free month of Framer Pro.
Again, framer.com/design.

Daniel (34:18):
All right, Chris. Well, there's a couple of, I guess,
variations on next, types ofmodels. Maybe it would be
helpful to talk about languagevision models or vision language
models first and then talk aboutkind of deep seek OCR, which is
kind of a different kind ofanimal. It's not OCR like we

(34:41):
talked about before. It's notvision model like we're about to
talk about.
But the vision model is actuallykind of diff Or it's more
similar to the LLM than the OCRmodel, I think. So a language
vision model, what that means isthat the input to the model can

(35:01):
actually be an image and a textprompt. And so this is often how
it works. Like if you go into amultimodal kind of chat thing
and you upload an image and say,Hey, what's going on in here?
Who is this in this?
Or what product is this in thisphoto? Or all of those sorts of

(35:24):
things. You want to ask aboutthe image or you want to ask
about, give it as extra contextto the language model. So the
language vision model actuallytakes an image and or text. And
then the output is similarthough to the large language
model in the sense that it'sjust going to output a stream of

(35:46):
probable tokens.
So this isn't actually, in onesense, this is not document
processing, but it could be usedfor that. But it doesn't have to
be used for it. So it could beused just to enhance the chat
experience or to have amultimodal experience or to
reason over images, right? Or toeven classify images. It's kind

(36:09):
of a general purpose reasonerover images.
And what happens is you kind oftake a large language model and
you add kind of a visiontransformer into the mix. And
the vision transformer takes theimage and converts it into an
embedding. The transformer pieceof the LLM takes your text and

(36:34):
converts it into an embedding.And then you smash both of those
embeddings together into avision plus text embedding. And
that's what's used to generatethe probability of the tokens on
the output.
So again, image or text comingin, text going out the other
end. And where this plugs intodocument processing is I could

(36:56):
upload an image of a document,right? And just as my prompt,
Hey, reconstruct this table inthis image, right? And maybe
that works. And it actuallyworks quite well depending on,
of course, the model and whatimage you put in and that sort
of thing.

Chris (37:17):
I'm kind of curious as you're kind of going through
that fusion process between thetext and the image thing. Do the
do you have any insight onwhether those operate kind of in
parallel and come together atsome point? Or like how how that
fusion how that fusion is ableto generate the better outcome?
Is it one of those things wejust know it does? Or do you

Daniel (37:40):
Yeah,
have any I think the key thinghere is that at least in my
understanding and our listenerscan correct me if I'm spewing
nonsense here. But in myunderstanding, part of it is
that, yes, there are these twopieces. And so the input, the
image input goes through thevision transformer. The text
goes through different layers ofa transformer network. Those

(38:04):
embeddings are generated,they're smashed together, but
that whole system is jointlytrained together towards the
output, right?
So it's not like you train theone-

Chris (38:16):
That makes sense.

Daniel (38:16):
And then
you train the other one and thenyou hope they work well
together. It's kind of like youjoin them together at hip to
start with. You train the wholesystem on, many, many, many of
these kinds of inputs andoutputs. And that's what,
obviously it's not interpretablein the sense of knowing how or

(38:40):
why it outputs certain things,but it is able to recreate that
probable output. And that wouldbe, I would say it would be a
major contrast with somethinglike using Docling plus OCR,
because then actually you do geta human observable structure of
a document out and textcorresponding to that.

(39:02):
With the language vision model,you toss an image and text in
and text comes out. And there'sno real interpretable connection
between the structure or contentof that text on the output and
any region in the images orspecific characters in the
images. It's all just us relatedvia the semantics of those

(39:27):
embeddings, not any sort ofstructure or anything like that.

Chris (39:31):
It's fascinating. It sounds like when you I'm just
kind of once again thinking backover the whole conversation and
the maturity that's evolving inthis capability. And so I guess
as we've kind of hit that point,like what's the next step in it?
Where do you see things going?

Daniel (39:51):
Well, I think the, or at least a next step, it might not
be where everything is going,but I think a next step is kind
of represented by what DeepSeekhas done with DeepSeek OCR. So
there are many language visionmodels or vision language
models. I've heard it both ways.There's the one we use kind of
Quinn 2.5 vision language model.We've used that one quite a bit,

(40:16):
really great model.
I mean, reality is the best ofthese are all coming out of
China, least at the moment ofthis recording in terms of the
vision language model side ofthings. So there have been these
models over time, but they havelimitations in the sense that
most of these vision languagemodels still assume a fixed

(40:38):
resolution of the input of thatimage. And they still require
huge training data sets and thatsort of thing. But I think one
of the main limitations is thisfixed resolution size, right? So
no matter the size of yourdocument, how it's structured,

(40:59):
all of that, you're gonna getthis fixed resolution, which
often does kind of createproblems.
And so what DeepSeek OCR haskind of done is that they
actually have kind of adifferent processing pipeline
that doesn't take So it doesn'ttake the whole image as a whole

(41:23):
image, but actually what happensis it takes the input image and
then it splits it apart intothese kind of image tokens, if
you will. So small vision tokensthat then are kept at their

(41:44):
higher resolution and they'recombined with the kind of big,
the whole image, right? So youtake the whole image, you
combine it with these visiontokens or a global full
resolution view, I think theycall it. So you get this global
page plus these tiles. And eachof these tiles are kind of

(42:08):
vision tokens, are smashedtogether with the global page.
And the idea is that youactually don't lose It's a way
of kind of representing thisimage or this document in a kind
of compact token sequence whereyou are not limited by the
resolution of your document. Andso what that means is that

(42:32):
DeepSeek OCR, at least in termsof how it seems right now, is
that it does a good job atpreserving certain shapes of
characters, line breaks,alignments, very tiny
mathematical equations ornotation, right? You get sort of
little dots or a caret abovemathematical notation. And so

(42:56):
really what DeepSeek OCR is kindof taking some of these ideas to
the next level and preserving alot of that information from the
larger document into these kindof full resolution tiles, which
can then be processed throughthe model.

Chris (43:13):
Could you talk a little bit about, like, when we're
talking about resolution, likewhat you could kind of level
set, what does resolution meanin this context? As we're
talking about specificresolutions and then a multi
resolution thing, can you kindof clarify what that is?

Daniel (43:28):
Yeah, yeah. So if I have just kind of reducing it to
thinking about a single page,right? If I have a single page I
represent that of a document, Irepresent that as an image, it
might be however many pixels byhowever many pixels, right?
Let's say 1,000 pixels by 1,500pixels, right? But in a vision

(43:53):
language model, typically,regardless of what image you
input, it's going to resize itto whatever, two fifty six by
two fifty six.
And if you imagine taking thatlarger page, smashing it down
into two fifty six by two fiftysix, you're gonna lose little
handwriting or diagrams or codeor equations or little tiny

(44:16):
fonts or footnotes, etcetera,all of that stuff. And so what
DeepSeek is saying, well, let'snot lose all of that context,
but let's also not have to, keepeverything in the same
resolution. Let's take the tile,let's tile this image. And now

(44:36):
we have the original resolutionof the document, but the tile is
not there. But we also don'tlose the ordering or the context
of where that tile fits becausewe have the global view of the
page.
And so it's kind of like when weput text into a transformer, we
actually don't lose the orderingoften either, right? We

(45:00):
understand where text is relatedto other text. And this is kind
of a similar concept where we'renot losing any of the
resolution, but we're also notlosing the structure of where
these kind of tiles are placed,if you will.

Chris (45:16):
That makes perfect sense. And so it's kind of the natural
progression. If we're going backa few years and talking about
the way convolutional neuralnetworks are working and the
fact that you were constantlyhaving to go to, you know,
reduce the size down, but thatcreated problems in terms of you
were doing analysis of what wasin the pictures, you know,
identification of of whatever,and and the lack of resolution

(45:40):
could sometimes make that achallenge. Yes. And this this
solves that in a particular way.

Daniel (45:45):
Yeah. Yeah. Which the kind of last, or I don't know,
current or last, I don't knowwhat generation we're in, bulk
of vision language models at themoment do not solve that because
they still force this kind offixed resolution. Now, at the
same time, DeepSeek OCR, it isalso a larger model. It does

(46:09):
require GPUs to run, but this isonly the kind of first
generation of these.
Similar to vision languagemodels, large language models,
I'm sure there will be a gradualshrinking of these models at
higher performances as more andmore people train them. And who
knows if this is the rightapproach, kind of quote right
approach to go down. But it isinteresting. One of the things I

(46:34):
find interesting here, Chris, iswe talk a lot about large
language models and for the mostpart, they all operate the exact
same way. And we've been talkingabout them operating the exact
same way for some time.
But if you look at theprogression of these models,
these multimodal models, aswe've gone through this
conversation, they all dooperate in quite different ways.

(46:57):
And so there's a lot of, to yourpoint at the beginning, from my
perspective, maybe from a nerdyperspective, document processing
is very much not boring becausethere's actually such a
diversity and such innovationgoing on here with much more
diversity on the model side andthe technical side than what you

(47:20):
see in large language models.

Chris (47:21):
And not only that, but our listeners have come through
this with us. This is probablynot something most of them have
been hitting on lately. And sonot only have they earned their
if they're in The US, at leasttheir Thanksgiving meal for
tomorrow By the time they'vedone this, but maybe coming out
of the holidays, they can goback into the office and and

(47:45):
kind of give an upgrade to therag system and and be wizards be
wizards at how effective rag isbeing for their organization.
Because I definitely learned abit along the way here about
that. I have a whole bunch ofuse cases in mind now.
I'm thinking, oh gosh, we can goback and do this and that and
the other. So fantasticexplanation of of these

(48:05):
different different approachesand kind of the timeline about
how they develop. So thanks fordoing that.

Daniel (48:11):
Yeah, of course. And happy Thanksgiving again, Chris.
Happy Thanksgiving to all ourlisteners. Hope you enjoy your
Tofurky.

Chris (48:20):
There you go. And even if you're outside The US, we are
thankful for you listening inand looking forward. Hope that
you have whatever holidays youcelebrate. We hope they're very
good going over the next fewmonths here.

Jerod (48:41):
Alright, that's our show for this week. If you haven't
checked out our website, head topracticalai.fm and be sure to
connect with us on LinkedIn, X,or Blue Sky. You'll see us
posting insights related to thelatest AI developments, and we
would love for you to join theconversation. Thanks to our
partner Prediction Guard forproviding operational support
for the show. Check them out atpredictionguard.com.

(49:03):
Also, thanks to BreakmasterCylinder for the beats and to
you for listening. That's allfor now, but you'll hear from us
again next week.

All Episodes

Episode Transcript

Popular Podcasts

Dateline NBC

Are You A Charlotte?

Stuff You Should Know

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Technical advances in document understanding

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Dateline NBC

Are You A Charlotte?

Stuff You Should Know

All Episodes

Technical advances in document understanding

Dateline NBC