Extract Information From Images and PDFs With R & LLMs https://3mw.albert-rapp.de/p/extract-information-from-images-and-pdfs-with-r-llms

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:02):
Hello, friends. We are back of episode 94
of the Our Weekly Highlights podcast.
If you're new, this is the weekly podcast where we talk about the excellent highlights and additional resources that are shared every single week at rweekly.0rg.
My name is Eric Nantz, and I'm delighted you join us from wherever you are around the world.
And I'm always joined in this February, but it's still my same cohost, and that's my choice here, Mike Thomas. Mike, how are you doing today?

(00:29):
Doing pretty well, Eric. It was kind of a long January here in The US, and it seems like we're in for an even longer February.
But
happy to be on the highlights today, and,
may your datasets continue to be available.
Let's certainly hope so.
I will say on Saturday, I had a good little diversion from all this,

(00:50):
stuff happening.
I was with about 70,000
very enthusiastic
fans at
WWE's
Royal Rumble right here in the Midwest, and that was a fun time. My voice has finally come back.
Lots of fun surprises,
some not so fun surprises, but that's why we go to these things so we can voice our pleasure or displeasure depending on the on the storyline. But it was a awesome time. I've never been to

(01:19):
what the WWE has as their, quote unquote, premium events. It used to be called pay per views at a stadium as big as our Lucas Oil Stadium here in Indianapolis. So I I had a great time and,
yeah. I'm I'm slowly coming back to the the real world now, but it it was it was it was well worth the price of admission.
That is super cool. That must have been an awesome experience. Luca Oils Lucas Oil Stadium, a dome?

(01:45):
It is. Yep. That's the home of the Indianapolis Colts.
It's been around for about
fifteen years, I believe now.
Last time I was there, I was at a final four, our NCAA
basketball tournament way way back when where we saw
my, one of my favorite college basketball teams, Michigan State. Unfortunately, we lose the butler that year, but it was a good time nonetheless.

(02:06):
So, we won't be weighing the the smack down on r for this. We are gonna we're gonna put over r as they say in the business, and that is an emphatic yeet, if you know what I mean.
Yeet.
But speaking of enthusiastic,
I am very excited that this week's issue
is the very first issue curated by our newest member of the our weekly team, Jonathan Kidd.

(02:34):
Welcome, Jonathan. We are so happy to have you on board the team. And as always, just like all of us, our first time of curation, it's a lot to learn, but he had tremendous help from our fellow r Wiki team members and contributors like all of you
around the world with your poll request and suggestions. And Jonathan did a spectacular job
with his first issue, so we're gonna dive right into it with arguably still one of the hottest topics in the world of data science and elsewhere in in tech,

(03:03):
and that is how in the world can we leverage
the newer large language models, especially in our r and data science and development workflows.
We have sung the praises
of recent advancements on the r side of things, I e with Hadley Wickham's Elmer package, which I've had some experience with. And now we're starting to see

(03:25):
kind of an ecosystem
start to spin up around this foundational package for
setting up those connections to hosted or even self hosted
large language models and APIs.
And in particular,
one of his, fellow posit software engineers,
Simon Couch
from the Tidymodels

(03:45):
team.
He,
had the pleasure of,
of enrolling in one of Posit's
internal
AI hackathons that were being held last year.
And he learned about some, you know, the Elmer package and Shiny chat along with others for the first time.
And he saw tremendous
potential on how this can be used

(04:06):
across different realms of his workflows.
Case in point,
it is about twice a year that the Posit team or the Tidyverse team, I should say,
undergoes
spring cleaning of their code base. Now what does this really mean? Well, you can think of it a lot of ways, but in short, it may be updating
some code that the packages is using. Maybe it's using an outdated dependency

(04:31):
or a deprecated
function from another package, and here comes the exercise of making sure that's up to date with, say, the newest,
you know, blessed versions of that said function and whatnot,
such as
the CLI package having a more robust version of the abort functionality when you wanna throw an error

(04:52):
in your function
as opposed to what our lang was exposing in years before.
It's one thing if you only have a few files to replace, right, with that stop syntax or our or abort syntax from our lang.
Imagine if you have hundreds of those instances. And imagine if it's not always so straightforward as that find and replace that you might do in a in an IDE such as r studio or positron.

(05:17):
Well, that's where Simon
in as a result of in participating in this hackathon, he created a prototype package
called CLIPAL,
which will let you highlight certain sections of your Rscript in RStudio
and then run an add in function call
to convert that to, like, a newer syntax, and in this case, that abort syntax

(05:43):
going from r lang to the CLI packages version of that.
The proof of concept worked great, but it was obviously a very specific case.
Yet, he saw tremendous potential here
so much so that he has spun up not one, not two,
but
three new packages
all wrapping the Elmer functionality

(06:05):
combined
with some interesting integrations
with the RStudio API package
to give that within editor
type of context
to these different packages.
So I'll lead off, Simon's summary here on where the state is on each of these packages
with the first true successor

(06:25):
to COIPow,
which was called
pal. This is comes with built in prompts, if you will,
that are tailored to the developer,
I e the package developer.
If you think of the more common things that we do in package development,
it's, you know, building unit tests or

(06:47):
doing our oxygen documentation
or
having robust messaging via the COI package. Those are just a few things.
But pow was constructed
to have that additional
context
of the package
and I e the functions you're developing in that package.
So you could say, highlight a snippet

(07:08):
and say give me the r oxygen
documentation
already filled out
with that that function that you're highlighting. That's just one example.
You could also, like I said, build in, like, CLI calls or convert those CLI calls from other calls if you wanna do aborts or messages or warnings and whatnot.

(07:29):
And that already has saved him immense amount of time with his package development, especially in the spring cleaning
exercise.
He does have plans to put Pal on CRAN in the coming weeks,
but he saw tremendous potential here.
That's not all. Mike, he he didn't wanna stop there with Pal because there are some other interesting use cases

(07:53):
that may not always fit in that specific package development workflow or the type of assumptions
that Pal gives us. So why don't you walk us through those? Yeah. So I'll there's two more that I'll walk through, and one is more on the package development side, and then one is more on the
analysis side for, you know, day to day our users and not necessarily package developers. So the first

(08:14):
of which is called ensure,
e n s u r e. And, one of the interesting things about
ensure
is that
it it actually
does a couple of different things that PAL does not do. And
PAL sort of assumes that all of the context that you need is in the selection and the prompt that you you provide it. But, when we think about in the example that Simon gives here writing unit tests, it's actually really important to have additional pieces of context that may not be in just the single file that you're looking at, the prompt that you're writing, or, you know, the highlighted selection,

(08:54):
that you've chosen.
You may actually need to have access to package datasets. Right?
That you'll need to to, you know, include in that unit test that maybe aren't necessarily
in the script or the the snippet of code that you're
focusing on, at the moment. So ensure, you know, goes beyond the context that you have highlighted or or is showing on screen and actually sort of looks at the larger universe, I believe, of, all of these scripts and items that are included in your package.

(09:23):
And it looks like, you know, unit testing here is probably the the biggest use case for ensure in that you can,
leverage a particular
function, like a dot r function within your r directory.
And if you want to scaffold or or really create, I guess, a a unit test for that,

(09:44):
it's as easy,
I believe, as, you know, highlighting the text that you or highlighting the lines of code that you're looking to write a unit test for.
And, just a hot key shortcut that will actually
spin up a brand new test dash whatever,
test that file. It'll stick that file

(10:05):
in the appropriate location under, you know, test test that for those of us that are our package developers out there. And it will start to write,
the those unit tests on screen for you in that test that file. And there's a nice little GIF here that, shows sort of the user experience. And it's it's pretty incredible,

(10:26):
that we have the ability to do that, and it looks really really cool. So I think that's, you know, really the main goal of the insure package.
Then the last one I wanna touch on is called Gander. And again, I think this one is a little bit more,
you know, day to day data analysis
friendly.
The,
functionality here is that you are able to

(10:47):
highlight a specific,
snippet of text or it also looks like Simon mentions that, you know, you can also not highlight anything and it'll actually take a look at, you know, all of the code that, is in the script
that you currently have open. And you by pressing, you know, a quick keyboard shortcut, it looks like,
you can leverage this add in, which will pop up sort of like a modal that will allow you

(11:12):
to enter a prompt.
And in this example, you know, there's a dataset on screen. Simon just highlights the the name of that dataset.
I think it's the Stack Overflow dataset, but it's just like Iris or Gapminder.
And he highlights it, you know, the modal
pops up and he says, you know,
create a scatter plot. Right? And all of a sudden,

(11:34):
the
selection on screen is replaced by a g g plot code that's going to create this scatter plot.
And he can continue to do that and iterate
on the code by saying, you know, jitter the points or, you know, make the x axis formatted in dollars,
things like that. And it's it's really, really cool how quickly he is able to,

(11:56):
really create this customized g g plot with formatting, with fastening,
all sorts of types of different things,
in a way that is is obviously much quicker and and more efficient
even if you are having to do some minor tweaks,
to what the LLM is going to return at the end of the day than if you were going to just, you know, completely write it from scratch. So, pretty incredible here. There's another GIF that goes along with it

(12:21):
demonstrating this. It looks like not only in this pop up window
is there an input for the prompts that you wanna give it, but there is also
another
option called interface,
which I believe allows you to control whether you wanna replace the code that you've highlighted,
or I would imagine whether you wanna add on to the code that you've highlighted instead of just replacing it, you know, if you wanna create sort of a a new line,

(12:48):
with the output of the LLM. So
really cool couple of packages
here that are definitely
creative ways to leverage this new large language model technology to try to
use,
you know, provide us with some AI assisted coding tools. So big thanks to to Simon for
and the team for developing these and sort of the creativity that they're having around,

(13:14):
leveraging these LLMs to help us in our day to day workflows.
Yeah. I see immense potential here and the fact that, you know, with these being native R solutions
inside our R sessions,
grabbing the context not just potentially from that snippet highlighted,
but the other files in that given project, whatever a package or a data science type project

(13:38):
with the information on the datasets themselves.
Like, that is immense value
without you having the really in a separate window or browser and say chat g p t or whatnot, trying to give it all the context you can shake a stick at and hope that it gets it. Not always the case. So there is a lot of interesting extensions here that, again, are made possible by Elmer.

(14:02):
And, you know, like like I said, immense potential here. Simon is quick to stress that these are still experimental.
He he sees, obviously,
some great advantages to this paradigm of
literally the the the bot, if you will, that you're interacting with is injecting directly into your r file that you're you're writing at the time.

(14:24):
Again, that's a good thing. It may sometimes be not so a good thing if it is going on a hallucination or something, perhaps, who we don't know. So
my my tip here is if you're gonna do this day to day,
if you're not using person control, you really should. In case it goes completely nuts on you, You don't want that in your commit history and and somebody asking you to code review. What on earth are you thinking there? Oh, it wasn't me. Really? Well, well, it kinda was when you're using the bot anyway. But nonetheless,

(14:55):
having version control, I think, is a must here. But I do see Simon's point
that other frameworks in this space, he mentions it kind of in the middle of the post,
are leveraging more general kinda back ends to interacting with your IDE or your

(15:19):
Git like, functionality
of showing you the difference between what you had before and what you have now after the after the AI
injection. So you could review that kind of quick before you say, I like it. Yep. Let's get it in, or
not so much. I wanna get that stuff out of here and try again. So
I would imagine again, this is Eric putting on speculation hat here. With the advancements in Positron and leveraging more general, you know, Versus code ecosystem,

(15:49):
extension ecosystem, that there might be even a more robust way to do this down the road on that side of it. But the advantage of the RStudio API package is leveraging
is that thanks to some shims that have been created on the Positron side, this works in both the classic RStudio and in Positron. And I think that's, again, tremendous value at this early stage for those that are, you know,

(16:12):
preferring, say, R Studio as of now over Positron, but still giving flexibility for the those who wanna stay on the bleeding edge to leverage this tech as well. So I think there's a lot to watch in this space, and
and Simon, definitely does a tremendous job with these packages at this early stage.
That's for sure. Yeah. I appreciate sort of the deposit team's

(16:36):
attention to UX because, again, I think that's sort of the most important thing here as we bring in, you know,
tools that create very different workflows than maybe what we're necessarily used to. I think it's important that,
we meet
developers and and data analysts and data scientists,
you know, in the best place possible.

(16:57):
And I mentioned at the outset that Simon is part of the Tidymodels,
ecosystem team. I will put some quick, plugs in the show notes because he is leveraging
on he's I should say, writing a new book called Efficient Machine Learning with R that he first announced at the R pharma conference last year with a excellent talk. So he's been really knee deep into figuring out the best ways to optimize his development both from a code writing perspective

(17:25):
and from an execution perspective in the tiny models ecosystem. So, Simon, I hope you get some sleep, man, because you're doing a lot of awesome work in this space.
I was thinking the same thing. I don't know how he does it.
And speaking of someone else that we wonder how on earth did they pull this off at the time they have,

(17:50):
our next highlight is, you might say, revisiting
a very influential
dataset
that made a tremendous
waves in the data
storytelling and visualization
space,
but with one of the new quartal tools to make it happen.
And long time contributor to our weekly highlights,

(18:11):
Nicola Rinne,
is back again, on the highlights
as she has drafted
her first use
of the close read quarto extension
applied to the Hans Rosling famous Gapminder
dataset visualization.
If you didn't see or or I should say, if you didn't hear our our previous year's highlights, we did cover the close read quarto extension that was released

(18:38):
by Andrew Bray and James Goldie.
In fact, there was a talk about this at the aforementioned PasaConf last year, which we'll link to in the show notes. But close read in a nutshell
gives you a way to have that interactive,
what you might call scrolly telling kind of enhanced
web based reading of a report, visualizations,

(19:00):
interactive visualizations.
You've seen this crop up from time to time from, say, the New York Times blog that relates to data science.
Other,
other,
reporting
companies out there or or startups out there have leveraged similar
interactive visualizations.
Like, I even had an article on ESPN of all things that was using this kind of approach. So it's used everywhere now. But now us,

(19:24):
adopters of Quarto,
we can leverage this without having to reinvent the wheel in terms of all the HTML styling and other fancy enhancements we have to make. This close read extension
makes all of it happen free of charge.
So what exactly is this tremendous,
report that Nicole has drafted here? She calls a gapminder,

(19:47):
how the world has changed.
And right off the bat, on the cover of this report
is basically a replica of the famous animation visualization,
plotting the GDP or gross domestic product per capita
with life expectancy on the y axis
with the size of the bubbles,

(20:09):
pertaining to the area around those,
and for each country,
represented there.
So once you start scrolling the page, and again, we're an audio podcast. Right? We're gonna do the best we can with this. She walks through those different components. First, with the GDP with some nice,
looks like a line plots that are faceted by the different regions of the world, getting more details on on gross domestic product,

(20:35):
and then getting to how that also is affected by population growth.
Again, another key
key parameter in this life expectancy,
calculation,
which gets to the life expectancy
side of it. And as you're scrolling through it, the plot neatly
transitions as you navigate to that new text on the left sidebar.

(20:56):
It is silky smooth, just really, really top notch,
user experience here.
And then she isolates what one of these years looks like. She calls it the world in 02/2007,
showing that kind of quadrant, that four quadrant section you have when you look at low versus high GDP

(21:17):
and low versus high life expectancy.
And as she's walking through this plot, she's able to zoom in on each of these quadrants as you're scrolling through it to look at, like I said, these four different areas,
and that's leveraging the same visualization. It's just using these clever tricks to isolate
these different parts of the plot. Again, silky

(21:39):
smooth. This is
really, really interesting to see how she walks through those four different areas
and then closing out with the animation once again that goes through each year from the '25 nineteen fifties all the way to the early two thousands
with, again, links to all of her code, GitHub repository, and whatnot.

(22:00):
But for a first,
first pass
at Close Read, this is a top notch product if I dare say so myself.
And boy, oh, boy, I am really interested in trying this out.
There was actually a close read contest that was, put together by Posit
late last year, and I believe the submissions closed in January this,

(22:23):
this past month.
But if you want to see how others are doing in this space, addition to Nicola's,
visualization here, we'll have a link in the show notes to, pause a community all the posts that are tagged with this close read contest so you can kinda see what other people are doing in this space. And maybe we'll hear about the winners
later on. But this one this one will have a good chance of winning if I dare say so myself. So I am super impressed with close read here, and Nikola's

(22:52):
very quick learning of it. Yeah. It's it's pretty incredible. And I was going through
the Quarto
document that sort of lives behind this, and it actually seems pretty
easy to get up to speed with this scrollytelling
concept.
It's pretty incredible.
I think there's a a couple different specific tags,

(23:14):
that,
you know, allow you
to do this, or maybe to to do it easy. It looks like,
there is a dot scale to fill
tag, that I believe probably
handles a lot of the,
zoom in zoom out sort of the aspect ratio of the plots or GIFs in Nicola's case that are being put on screen. Because

(23:40):
in her visualization, it's almost like there's this whole left hand side bar, right, that has a lot of the context and the narrative
text, that goes along with the visuals on the right side of the screen.
You know, some of the pretty incredible things that
I thought were interesting here is,
you know, not only was she able to, you know, fit a lot of these plots in a nice aspect ratio on the right side of the screen, but there's also actually a section of the scrolly telling

(24:08):
visualization
where she zooms in across four different slides, if you will, on four different quadrants of the same plot,
to
tell the story of these four different quadrants,
you know, one being low GDP per capita and low life expectancy, low GDP per capita and high life expectancy, and, you know, the other two as well, vice versa.

(24:33):
And it's pretty easy, it's pretty awesome, I guess, how the visualization
sort of nicely slides from one quadrant to the other
as you scroll to the next slide, if you will.
So this is for for any of the data vis folks out there, data journalism folks out there, I imagine that in order to accomplish something like this in the past, it was probably a lot of d three,

(24:58):
JS type of work, and
the end product here compared to
the quarto code that I'm looking at is it's pretty incredible.
And it just sort of gives me the idea that it's
a lot of the heavy lifting has been done
for us, in in the ability to create these quarto based scrolly telling

(25:20):
types of visualizations.
So I'm super excited about this.
You know, it made me go then the way back machine a little bit on this.
I'm gonna bring Shani in this because I love to bring Shani in almost all my conversations. But back in 2020, of all things,
I remember I had the good fortune of presenting,
I at the poster session at the conference, and I had my topic was kinda highlighting the latest innovations in the shiny community, and I was,

(25:49):
trying to push for what could we could we ever have something like a shiny verse or whatnot of these community extensions.
And to do this poster, I didn't wanna just do, you know, PowerPoint over anything. Come on now. You know me. But I leverage,
our good friend, John Coons.
He had a a development package way back in the day called Fullpage,

(26:10):
which was a way to create kind of a shiny app that had these scrolly telling like elements.
But I will say he was probably too far ahead of his time on that. I won't say it was that easy to use. And, frankly, he would probably acknowledge that too.
Here's my idea. I still have the GitHub repo of this, you know, poster I did.

(26:30):
I would love to have my hand at converting that to close read
and wait for it, somehow
embedding a Shiny Live app inside of it. Can it be done?
I think it can too. I think you'd be breaking some new ground, Eric. But,
if if anybody's up for that challenge, I know it's you. How did I just nurse night myself? Like, how does that happen, Mike? What

(26:54):
you must be hypnotizing me or something without even saying anything. I have no idea.
Peer pressure.
Now you may be wondering out there yeah. The Gapminder data, we we are fortunate that we have a great r package

(27:20):
that literally gives us this kind of data. So once Nicola has this package loaded, she's able to, you know, create this awesome close read, you know, scrolly telling type of report.
Well, there are many, many other sources of data that can surface this very similar important domain
such as what we saw in the Gapminder
set. And you may be wondering, where can I get my hands on some additional data like this so I can do my own, you know, reporting? Maybe with Close Read or Shiny or Quordle, whatever have you. Our last highlight is giving you another terrific resource of data

(27:55):
for these kind of situations.
This last highlight comes to us from Kenneth Tay, who is a applied researcher at LinkedIn,
and he has a blog that he is, his latest post
is talking about some recent advancements in this portal called
Our World in Data,
which I have not seen before this highlight, but it is

(28:16):
a, I believe, a nonprofit
organization
whose mission
is to create accessible research data to make progress against the world's largest problems. So you might think of, say, poverty, life expectancy, some of the other issues that, say, the Gapminder said highlighted.
But they wanna make sure that anybody that has the desire and the skill set to use, say, a language like R or whatever else

(28:42):
to produce visualizations
to really start to summarize and explore these data, that there is as less friction as possible to access these.
And, yes, you could access their portal. You could download the data manually on their website,
but it was earlier in 2024
that this group had exposed an API
to access these data.

(29:03):
So Kenneth, in his blog post here,
walks through what it's like to use this,
this, new API,
particularly to call it a public chart API because it is
the basis for, I believe, some interactive visualizations
that their web portal
is exposing here.
But because there is a an API now,

(29:25):
he brings back a little bit of old school flavor here, the h t t r or the hitter package.
That was one of those cases where I've been spelling it out all this time, but on on hitter two, the read me, Hadley literally says how it's pronounced. So thank you, Hadley. I wish I wish all our package authors would do that.

(29:45):
In case the baseball player didn't give it away.
Exactly. So great great hacks on the new on the new package itself. So
back to Kenneth's expiration here, he shows us how with the old school hitter along with a little tidy verse magic and JSON light under,
loaded into the session.
He needs all three of those because,

(30:07):
first, it's one thing to access the data itself, which apparently
are exposed as CSV files on the back end, but the API lets you grab these
directly.
But the metadata comes to that to that in JSON format. So he wants to use JSON like to help massage
some of that too.
So the first exploration of this and the snippet on the blog post is looking at the average monthly surface temperature around the world. So once he's get the he's got the URL

(30:36):
of the dataset,
then he assembles the query parameters, which,
again, in the role of APIs, you might have some really, really robust documentation.
Maybe some other times you have to kind of guess along the way. It's kind of roll of the dice, isn't it?
Yeah. I find the the latter to be the case more often, especially in

(30:56):
professional settings, unfortunately, which seems to make no sense.
Who would ever think that? But yet, I feel seen when you say that. Yes. Even as of this past week. My goodness. Don't get me started.
So luckily for this, there is a a healthy mix here, I would say. So he's got some query parameters.
So look at the version of the API,

(31:17):
the type of CSV, the return, which can be the full set or a filter set, which I'll get to in a little bit,
and whoever to use long or short column names in the dataset that's returned back.
And then, also, he does a similar thing for the metadata.
That's another get request,
as well,
and then he brings that through that content directly,

(31:40):
with JSON format.
So the metadata comes back as a list because most of the time when you return JSON back, it is bake basically a big nested list,
and that gives some high level information
on the dataset that is returning. So you get, basically a list of each
character string of the variable name and the description of that variable. So that's great.

(32:02):
Now the data directly,
again, setting up similar, setting up the query parameters.
This time, he's gonna demonstrate what it's like to bring a filtered version of that data right off the bat.
And that is where there's a little guessing on that because he went through the web portal of this,
played with the interactive filters that this web portal gives gives him,

(32:25):
and looked at the end of the URL. So if you're new to the way
requests are made for API, you might say get requests where you're running to grab something from the API.
More often than not, you'll attach different flags or different variables at the end of the URL
often in, like, key value type pairs

(32:46):
with an ampersand separating the different parameters. So once he explored this web portal, he was able to grok that, oh, yeah. There is a parameter for selecting the country.
So I'm gonna, you know, put that in the query parameter and feed it in the direct value.
And then once he does the get request on that, this is important here,

(33:06):
the contact
the content, I should say, that's coming back can usually be three different types of flavors.
The raw
might say binary representation of that value,
the textual value of it, or the format of JSON or XML version of it.
In this case, the it was a text value coming back because it's literally the CSV content as if you just had the CSV open on a new file in your computer.

(33:33):
That's how the text is coming back. So he feeds that into a read underscore CSV
directly. And lo and behold, you got yourself a tidy dataset as a result of that. So
and then with that, he just said a simple plot
of the time in year versus the temperature of the surface
across The USA
just to show that that's exactly how you would bring that data in. And there's a lot more

(33:58):
you can do with this type of data. But, again, it's a good example of, first, going to the documentation where it's available. But then when things maybe aren't as well documented,
yeah, nothing beats a little trial and error. Right? Sometimes that's the best bet we get, and that's how he was able to do that filtered
dataset pull. But, nonetheless, if you're looking for inspiration, I'm looking at similar data as we covered in the second highlight, but across a wide range

(34:25):
of world specific type of data. I think this portal has a lot of potential.
And, yes,
r is your friend. Again, we're grabbing these data from almost any source you can imagine. So really great blog post straight to the point. You could take this code and run with it today. And, in fact, a good exercise would be what would you do to convert that to the hitter two syntax, which shouldn't be too much trouble.

(34:48):
But, nonetheless, you've got a great example to base your explorations off of here.
Yeah. I I think it's just a good reminder
in general, especially for, you know, junior data science folks who are just starting out that
your data isn't always going to be in a CSV format. Yes. I know that our world in data allows you to export that. But a question that you should be asking, you know, in order to try to automate things as much as possible for yourself is

(35:15):
often, you know, is there an API,
right, for this this dataset or is there an underlying database that we can connect to so that I can just run my code directly
against that, run my script with one click as opposed to having to go someplace and download the data to a CSV first,
before I do my analysis. So, you know, if you can sort of automate a recurring

(35:39):
script that you have against data that that might be just updating but in the same schema on some particular basis.
I think, yeah, this is a fantastic example of leveraging our world and data's API to do that, some really nice,
base plotting, some really nice g g plotting as well, a pretty cool mix here That's been put together. And like you said, Eric, a great example of dealing with what's called a get request, which is where you're actually just modifying the suffix of the URL,

(36:08):
in order to filter the dataset that's going to get returned here.
So it's a really great example of doing that with a couple of different parameters that are being managed. I guess one parameter being,
tab equals chart, another one specifying the time or the date range,
that we're looking to get data back within. And then the last one being the the two countries here in the case of this last example where we're plotting the average monthly temperature

(36:35):
for the entire world and then, for Thailand as well. So, you know, two items in the legend
here. As you said, great great walk through blog post of using a publicly available,
API
to
wrangle some data and and make it pretty.
Yeah. The the the limit's only your imagination at this point. So like I said earlier, you could take

(36:59):
what Nicola made with her close read example, apply it to this kind of data, and and go to town with a a great learning journey.
Great for a blog post such as this, you know.
All if again, maybe, like you said, speaking to the the data scientists out there that are looking to get into an industry or or an a data science type of role,
it never hurts. Well, if you've got the time and the energy

(37:23):
to build a portfolio of things like this because you never know
just how useful that will be as you're trying to showcase what you find and what what skill set you have to generate insights
from data like this. Because not to not to pull the old,
back in my day,
syntax here, but we didn't have access to these type of data when I was looking for a job earlier. So take advantage of it, folks. It is here for the taking.

(37:50):
Speaking of what else you need to take advantage of, you need to take advantage of our weekly folks because if this isn't bookmarked for reading every single week, you are missing out because this issue has well more than what we just talked about here in these highlights.
We got a great batch of additional tutorials,
new packages that have been released, new events coming up. It's the full gamut.

(38:11):
So we'll take a couple minutes for our additional fines here. And, leveraging what we talked about at the outset of the show with Simon's explorations of interacting with Elmer,
a very common problem across many, many different industries and organizations
is
dealing with data that I'm gonna

(38:31):
go on a limb here is kind of trapped
in image or PDF format.
Because wherever you like it or not, there's gonna be some team out there that said, you know what? We have this great set of data here, and they act like everything is perfect access. And then you as a data scientist says, oh, yeah. Where are the CSVs? Where where are the parquet files if they're really up to date? Oh, no. No. They're in these PDFs.

(38:55):
Oh, gosh. Okay. What do I do now?
Yes. There are things like OCR
that can help you to an extent,
but with the advent of AI, there might be an even easier way to do that. So frequent contributor to the our weekly,
highlights and elsewhere, Albert Rapp has another post in his three minute Wednesday

(39:16):
series
on how he was able to leverage Elmer
to extract,
text from both an image file of an invoice as well as a PDF version
of that image and to be able to grab, you know,
certain numeric quantities like number of billable hours, time period, and whatnot.

(39:37):
I think this is a very relatable issue that, again,
many organizations,
big or small, are gonna have to deal with at some point. And I've seen projects being spun up at the hashtag day job where they're looking at ways of building this from scratch. Well, if you're an R user, maybe Elmer with its image extraction functionality

(39:58):
might get you 90% on the way there. Hashtag just saying. So excellent post, Albert and I may be leveraging this sooner than
later. No. That's awesome. We have some some projects that are doing the same thing with some of these self hosted open weights
models to be able to take a look at a PDF and extract
very particular pieces of information that we want from it, and we can tell it, you know, give us that

(40:21):
back in JSON form, and it allows us to, you know, leverage it downstream. Of course, you have to build a bunch of guardrails
around that to make sure it's not hallucinating
because it's a Absolutely. Box.
Yep. But it's it's pretty powerful stuff, and the accuracy that we're seeing is
is pretty shocking, pretty awesome.
But what I want to call out is an article by the USGS, which is the US Geological Survey

(40:45):
on mapping water insecurity in R with TidyCensus.
They just always do an absolutely beautiful job,
with data visualization.
All the code is here for a lot of these visuals actually deal with,
households that were were lacking plumbing in 2022
in The US, and then changes,

(41:06):
via,
I guess, barbell plots, they're called. I don't know if there's any other names for them. Lollipop plots?
Yeah. I've seen them thrown around interchangeably.
Yep. Yep. To take a look at,
improvements
in plumbing facilities,
particularly in, New Mexico and Arizona, which were the two states in based upon the 2022 census, that I think had the the lowest rates,

(41:30):
of of household plumbing. So it's, you know, it may be
a a niche topic for some, for lack of a better word.
But the the data visualizations that they have here on these choropleth maps are really, really nice. I I love the color palettes that they use. I I really love the walk through that they provide on the website

(41:51):
in terms of the code and the narrative around how they made the decisions that they made
to go from dataset
to visuals. I think it's a great job. You know, on the the East Coast here,
water scarcity is
not something that we really are concerned about. But I know on the West Coast, because we do a lot of our work in agriculture,

(42:13):
it's it's quite a big deal in terms of water rights and water access and things like that.
So I really appreciate
the work that the USGS is doing on this particular,
you know, niche.
Yeah. And I have a soft spot for the the great work they're doing.
My wife actually was fortunate early in her career to have an internship at the USGS and,

(42:36):
albeit this was a day where r wasn't quite as as readily used as it is now, but it's great to see this group in particular being really modern with their approaches. And, again,
top notch narrative, top notch visualization,
so really
exciting to see. And I believe we featured this group on previous highlights, so you wanna check out the back catalog for some of the excellent work they've been doing in this space, previously. So excellent

(43:01):
excellent find, Mike, and there are a lot more finds than just those. So, again, we invite you to check out the rest of the r weekly issue at rweekly.0rg.
We, of course, have a direct link to this particular issue in the show notes, but, also, you wanna check the back catalog
about both the issue as well as this humble podcast itself because we got so many great things to talk about here, so many great things to learn. As you heard, I've I've basically nurse signed myself for a new project, hopefully, this year that I can work on with Shiny and Close Reads. So we'll see what happens there. But, yeah, if you wanna see what else is happening and if you want to be a part of what's happening here in terms of what the readers are gonna see every week,

(43:40):
We value your contributions,
and the best way to do that is, again, head to rweekly.0rg.
You'll see in the top right corner a link to this upcoming issues draft
where you can send a poll request to tell tell us about that great new package, that great new visualization, that great new use case of shiny or AI or other technologies

(44:00):
that you can see in this data science community. We'd love to hear it. Again, all marked down all the time.
I I would stress again when
told me years ago, if you can't wear an r markdown in five minutes, he would give you $5, and he didn't have to give any money for it. So there you go, folks.
And, also, we love hearing from you. You can get in touch with us via the contact page in the episode show notes. You can also send us a fun little boost with the modern podcast app. Those details are in the show notes as well. And you can also get in touch with us on the social medias.

(44:33):
I am now on Mastodon these days with @rpodcastatpodcastindex.social.
I am also on Blue Sky as well where I am at rpodcast.bsky.social.
I believe that's how to say.
Mike, where can the listeners find you?
Yes. I am on blue sky for the most part these days at mike dash thomas

(44:56):
dot b s k y dot social.
Also on fa Mastodon a little bit, mike_thomas@faustodon.org.
And you can check out what I'm up to on LinkedIn if you search Catch Broke Analytics, k e t c h b r o o k.
And a bit of a shout out here, self plug that we are looking for a DevOps

(45:19):
expert. If you are somebody who has expertise in Docker,
little Kubernetes,
Azure preferred, but it doesn't really matter because we're all spinning up Linux servers at the end of the day,
we could use some help managing ours and our clients' ShinyProxy
environment. So any DevOps folks out there, please feel free to reach out.

(45:39):
I'm sure there are many of you out there. So, yeah, take up Mike on this tremendous opportunity.
I'm still learning the DevOps ropes. We share many stories about that in our adventures there. So that's that's a great great plug, Mike. And I'm also on LinkedIn as well. But, yeah, we'll,
we'll add that little, call out to the show notes as well if you're interested in pursuing that. Nonetheless, we're gonna close-up shop here for episode 94

(46:03):
of Haruki highlights.
Before I go, I wanna send a very hearty congratulations
to Chris Fisher and the team at Jupiter Broadcasting. You recently had episode 600
of Linux Unplugged.
Tremendous achievement, folks. You'll be seeing a boost from me in the coming days.
I don't know if we'll ever get there, Mike, but, nonetheless, that's a huge number for a podcast

(46:25):
of that size. So congrats to them. Well, we'll we'll get to 200 at least, and we'll see what happens after that. Alright. We got no place to stop.
Well, yeah. Me either. Yep. We'll see what happens, buddy. But, nonetheless,
we hope you enjoyed episode 94
of our week highlights, and we'll be back with another episode for one ninety five next week.

All Episodes

Issue 2025-W06 Highlights

Extract Information From Images and PDFs With R & LLMs https://3mw.albert-rapp.de/p/extract-information-from-images-and-pdfs-with-r-llms

Episode Transcript

Popular Podcasts

My Favorite Murder with Karen Kilgariff and Georgia Hardstark

24/7 News: The Latest

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Issue 2025-W06 Highlights

Extract Information From Images and PDFs With R & LLMs https://3mw.albert-rapp.de/p/extract-information-from-images-and-pdfs-with-r-llms

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}My Favorite Murder with Karen Kilgariff and Georgia Hardstark

24/7 News: The Latest

Dateline NBC

All Episodes

Issue 2025-W06 Highlights

My Favorite Murder with Karen Kilgariff and Georgia Hardstark