Issue 2025-W08 Highlights

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:02):
Hello, friends. We're back of episode a 96
of the Our Weekly Highlights podcast.
Oh my goodness. That's four away from the big 200. It's sneaking up on us folks, but in any event, this is the weekly show where we talk about the awesome highlights and additional resources that are shared
every single week at ourweekly dot o r g. My name is Eric Nance, and,

(00:24):
I'm I'm feeling a little decent now, but I will admit I've been under the weather lately. So we'll do my best to get through it, but I'm feeling good enough right now. But this is definitely one of those times where more than ever, I am so glad I don't do this alone because I got my awesome cohost,
Mike Thomas, to pick up the pieces when I fall apart. Mike, how are you doing? I'm doing pretty good, Eric. I feel about 50% healthy. I'm fighting it too. So, hopefully, you're 50% and my 50% can can add up to a hundred if that math checks out.

(00:54):
I do want a shameless plug, and we didn't even talk about this pre show, but were you on another podcast
recently that anybody can check out?
I will be. Unfortunately,
that post got sick too. So we had the postcode now. So
talked about a preshow. No. No. It's all good. It's all good. I will definitely link that out when it's published, but we're that'll be a a fun episode that will be Dakota radio program. So stay tuned on my social feeds for

(01:21):
when that gets out. But, yeah. Apparently, it's going everywhere even
across the country too.
So, nonetheless,
we are we're we're good enough for today. And, let me
you know? Gosh. It's been such a whirlwind week. I gotta figure out who curated this issue. Oh, wait. Oh, wait. Look in the camera, Eric. Yeah. It is me. It was me that curated this week's issue.

(01:43):
As always,
great to give back to the project, and this was a very tidy selection.
And as always, I had tremendous help from the fellow, our rookie team members, and all of you out there with your great poll requests and suggestions. We got those merged in, and some of those ended up becoming a highlight this week. So
without further ado, we're gonna lead off today with arguably one of the more spicy takes we've had on the show this year at least, and maybe even

(02:11):
maybe even the past year as well.
And this is authored by,
I'll just say, a good friend of mine from the community, Ari Lamstein, who I've
met at previous positive conferences. And in fact,
I have fond memories. He was actually one of the students at one of my previous shiny production workshops. So he's been
always, you know, trying to be on the cutting edge of what he does, and he does consulting. He's done all sorts of things.

(02:37):
But this came into my feed,
around the time of the curation this week, and he
has this provocative title,
is CRAN,
CRAN being the comprehensive r archive network,
holding our back.
So let's give a little context here, and then Mike and I are gonna have a little banter here about the the pros and cons of all this. So

(03:00):
Ari has authored a very successful package in the spatial, you know, mapping space
called coral plethora. I hope I'm saying that right.
But he's maintained that for years and years.
He would say that it is in a stable state. He did rapidly iterate on it over the years in the initial development,

(03:21):
especially when it was literally part of his job at the time to work on this package.
But, yeah, it's been in a stable state, you know, along those things where if it ain't broke, don't fix it.
Well, it was about a month or so ago. He was informed that
it was
his package was going to be archived.

(03:42):
Meaning, and for those aren't aware,
when CRAN archives a package, they will not support
the binary installation
of a package with the typical install dot packages function out of the box
because of some issue that has been raised
by an r command check
or maybe another issue that the maintainers thing should have been resolved

(04:05):
and end up not being resolved.
Now you may ask, well, what was the issue with the package?
It wasn't with Ari's package. It was of a dependency package
called ACS,
which I've not heard about until now.
Apparently, that got a warning.
And what was the warning about?
Well, if you go in the digging of the ACS packages,

(04:28):
post on CRAN where it gives a link to the command check warnings,
Try this one for for size, folks. The note, it was not a warning. It was a note.
Right.
And it says,
configure
slash bin slash bash is not portable.
Mike, when did r include a bash shell?

(04:48):
Do you know? Did I miss something here?
I must have missed it too. I mean, I have to imagine that this is when the right. This is when the the checks run, so it probably has something to do with the runner
itself and the the Linux box where the this is being executed on. This may be above my pay grade, but that is not an informative

(05:09):
message.
And it literally has
nothing to do with r itself or even, another language that a package could be basing off of such as c plus plus or Fortran or the like.
So
Ari, you know,
he did not like this, and, frankly, I can understand where he's coming from here. And he decided that,

(05:32):
well,
chloropleather
is not at fault here.
And you know what? I'll let the chips follow where they may,
and it is now archived. So his package is now archived
because of the ACS package being archived.
We I personally don't know who authored it, not that I really care to for the purpose of this discussion.

(05:54):
It has been archived, and now Choroplethor
is now archived as a result.
This,
so Ari and the rest of the post has some pretty candid viewpoints
on
the state of CRAN at this time.
I will admit he's not alone in some of the recent discourse I've seen online. I've heard, community members like Josiah Perry, who I have great respect. He's had some qualms about some recent

(06:18):
notes he's received from crayon as he's been trying to, I mean, either add a new package or
update an existing package.
And Ari, you know, really is being provocative here about
what impact is Kran having now on the future growth
of r itself.
And then
Ari has lately in the last, I would say, a few years,

(06:41):
been working on additional projects in Python. So he does compare and contrast
how CRAN compares
to Python's
arguably default
package
repository called PyPy.
Now
let's, let's have a little fun with this, Mike. I'm going to play, hopefully, the role of a good cop,

(07:04):
supposedly.
YCran,
even with these issues, is still
a valued piece of the our ecosystem
and that maybe this is just a very small blip in the overall
success of it. And then your job is gonna be able to talk me down on this. So let me you ready? Buckle up. Yeah. We we can do that. And I think as a disclaimer, I would say that I don't wanna speak for you, but both of us have mixed feelings on both sides of the fence. Yes. But I think we'll we'll go through the exercise of, good cop, bad cop here. I think that's a great idea. Yep. So

(07:43):
I've been a long time R user since 02/2006.
So I've I've I've I've known the ecosystem
for a while.
I dare say that as somebody new to the language, as I was getting familiar with Baysar and then suddenly I would have courses in grad school that talked about

(08:03):
some novel statistical methods and also my dissertation on top of that.
Without the CRAN ecosystem
of a way to easily
once I did my, say, WIP review or other research about the pack the methodology I needed, and then finding that package and then be able to install that

(08:24):
right away in my r session.
But that package being authored by leading researchers in those methodologies
and the fact that this was a curated set that I could have full confidence that the package I'm installing
is indeed going to work on my system,
and it has been has been approved from a curated, you know, curator type of group that the CRAN team is

(08:49):
and to be able to use that in my research.
Other
programming languages
do have a more automated system
where, yeah, you can throw any package on there, but you have no idea about the quality of it. You have no idea if it's going to destroy your system sometimes. You have no idea if it's even doing the right thing,

(09:10):
statistically speaking.
In my opinion, one of the great advantages of r is indeed that CRAN
is this targeted stable set
that's very akin to, say, the
the Debian type philosophy in the Linux world.
It's stable. You can rely on it, will not break on you,

(09:31):
and they are making a concerted effort to make sure that this is a reliable
network for you to install packages on. And I don't think ours is where it is today
without CRAN having that that curated group behind the scenes
to make all this happen.
What do you think?
Yeah.

(09:51):
So I think that maybe
the times have changed a little bit. And I think that
maybe five plus years ago,
people were developing packages
not
via GitHub,
this GitHub package development workflow that really exists now, I think, across the space. You know, sometimes you'll go to an old package

(10:13):
and you'll search for it on Google
or your your favorite LLM, I guess, these days. And you'll try to find, you know, the the GitHub repository that that sits behind that package. So you can take a look through the code and it doesn't exist. You're just taken to, like, the the PDF,
right, of the the package itself. And that's that's very frustrating

(10:34):
nowadays, but I guess that's probably reflective of maybe the workflow that used to exist, you know, five or ten years ago where we didn't have really this GitHub package,
development driven workflow. And this is something that was raised on on Blue Sky by Yoni Sidi. And they said, you know, back in 2016
that
we they thought the R community was sort of setting itself up for problems by not building CRAN not building sort of this infrastructure and software development life cycle,

(11:03):
geared towards really a focused GitHub package development where sort of most of,
packages are are developed right now, and CICD can be set up and all sorts of things like that. So I I think, unfortunately,
in a lot of ways, CRAN has not
caught up with the times.
And then just some of the rigor and inflexibility

(11:24):
that they have,
I think can be construed as
over the top,
for lack of a better word. I think archiving a package because of a note,
seems
absurd. And I think that the time windows for folks to fix these things
are unnecessarily
short. I remember
g g plot two, it was a year or two ago,

(11:46):
was almost archived due to a dependency,
you know, that was that had been archived.
And ggplot two has, like, a whole entire company behind it that can work on trying to rescue that. They have relationships
with the CRAN maintainers,
you know, the volunteers that that work on Krayaan
that the rest of us do not have. Right? We're limited to

(12:10):
emailing back and forth, and I'm sure anybody that's done that before,
you know, has has
struggled with that, for lack of a better word, in in some sense.
So I think we're we have this dichotomy where if you are an R user, yes, CRAN can be very beneficial. But if you are an R developer

(12:30):
of packages,
it can potentially cause, you know, more headaches than it solves.
I definitely
resonate with that. So let's let's put away our cap and and whatever uniforms here. Let's let's, let's be real here.
I definitely think that this was a heavy handed

(12:51):
instance.
I think that
it's one thing to ensure broad compatibility
across different architectures
that are supports.
But for something like a bin bash note that
admittedly was not affecting users up to this point anyway,
Ari has always been very responsive to user feedback on his packages.

(13:13):
He has not heard one iota in both his testing and others
about colorectal
being being affected by this.
I think this is an artifact
of
a a system that
has gotten the our, you know, community at large to where it is. I mean, like like, without CRAN in the very beginning, I still stand by our was not would not be as successful

(13:37):
as it is now. But as you said, Mike, this is a different time now. This is a different time where there are multiple personas, so to speak,
leveraging our going from more of an academic
type of, you know, statistical
environment for research. Now it's being used across industries. My industry in particular is really doubling down on it,

(13:57):
and it is being relied on in production way more than maybe someone might have thought five or six years ago. So
I do think that there needs to be,
a healthy assessment on where things can be improved upon.
I think transparency
is one of them. I think leveraging newer technology for automation is in another

(14:21):
because let's face it. Another key project that we covered on this show many, many times now
is the growth of our universe
where, no, it's not a human curated effort per se,
but it is also leveraging a lot of automation
to give,
you know, confident, reliable
package binary

(14:42):
installations
across different operating systems as well as new technology,
hint hint, web assembly,
to make this even more future state proof.
I think there is somewhere in the middle
that maybe eventually
either cran
or another effort that is slowly coming up into,

(15:05):
into the discussion, the multiverse project, which we'll be hearing, I think, more about this year,
where maybe there is a best of both for us. They still have a human in the loop, but yet take advantage of modern technology
that say our universe has pioneered
to make this hopefully
a better experience for everybody involved, for the consumer,

(15:26):
I e the user of the package,
but also the maintainer and developers of the package.
And and Ari's post, he does throw some statistics about how, you know, Python itself has a lot more, you know, metrics of their packages being downloaded and whatnot.
Well, admittedly,
Ari, that can be a loaded statistics, so to speak, because

(15:48):
I think and this is actually captured in a in a LinkedIn post that Ari put up his blog post with some great comments as well.
I agree with people like Joe Chang who commented on this.
The PyPI, yeah, maybe it's an automated thing, but it is a wild west
And dependency management in Python,
I'll stand on this soapbox.

(16:10):
Still leaves a lot to be desired, and hence, there are people that are very are
are are destroying and upping their environments left and right for a single project.
Joe in particular had a comment about he must have installed
pandas, like, over 50 or 60 times when he was testing various things with with his efforts in Python. So

(16:31):
it yeah. It's great that PyPI has this less friction approach to get something online for a Python package,
but you can go extreme in another direction, and it can cause
havoc in that regard. So
I still think there's a middle ground to be had here.
But in any event, do I I still think that this was a very heavy handed,

(16:54):
heavy handed approach here taken with with with archiving chloroprether
and and ACS as a result of this kind of note.
Because in the end, did it affect the users?
No.
Maybe it affected one esoteric build target that Cranston uses.
I won't say Solaris even though I do wanna say it because that's usually the butt of many jokes for

(17:16):
antiquated architecture.
But
I I I think it it's good to at least have these discussions and hopefully
with the efforts that are happening with with our universe and in the future, not so distant future,
the multiverse
project that we'll
we'll get to a middle ground somewhere.
Yeah. Yeah. I have to say, you know, the experience of installing in our package

(17:40):
is, I think, one of it. The big benefits of our over Python, you can install packages from within R Right. Without a second piece of software like Pip, which
is, you know, incredibly frustrating
for Python
newbies, especially.
And
I guess my other my only last other point would be and I'm not saying that this is the case with ACS. I don't know much about the package.

(18:02):
But if you are an R package developer, one thing that you can do to try to mitigate risk, in my opinion, is to take a look and see how actively maintained the dependencies
of your package are. And if you're going to add a new package, make sure you you take a look and see if there is somebody,
who's, you know, contributing to that package recently. It looks like they're maintaining it,

(18:25):
such that they would be responsive if an issue
like this did happen. There's a lot of R packages out there that haven't been touched in, you know, four plus years. If you you go back and take a look at the code base, and,
it's probably only a matter of time until something happens. And
if nobody's there to respond to it, that's

(18:45):
going to be an issue for all of the packages that depend on it. So
in my opinion, that's that's an additional way that you can try to mitigate some risk.
Yeah. Very valid point, Mike. And and my other key, you know, thought I had is that
a maintainer
should not be penalized
for having a stable package

(19:06):
where maybe a check does arise, but has nothing to do with, like, how r itself works
or anything like that. If it's having to do with this kind of arbitrary
Linux shell prompt that the package has nothing to do with, that should be treated differently than say, oh, wait a minute. Your use of, like, a a an object class system just completely broke or or a test completely broke, you know, whatever. That that's a different story. I don't think these notes are all created equal here.

(19:35):
And that that that nuance is not lost on me. That was a note, not a warning,
not an error,
a note, which
yes. I mean, you you obviously strive to minimize those, but, again,
those are not all created equal.
Likewise. And I guess maybe the last thing I would say is if you are somebody who's
developed our packages and you're looking to dabble into potentially developing Python packages, Ari's also,

(20:02):
drafted a a blog post on his blog called a guide to contributing to open source Python packages, which,
you might find very interesting. Very good, Mike. We will link to that in the show notes.

(20:23):
So up next in the highlights, well, it is 2025.
You're usually gonna hear something about large language models and how they're, you know, helping productivity or helping a data science type of pipeline.
And so our next high, we do have an interesting use case that admittedly
I even mentioned when I was doing using r for my dissertation
probably would have been really helpful in my lit review because we are literally gonna talk about how

(20:46):
large language models could help you in a more systematic approach of literature review.
And this post is coming to us from the seascape models group at the University
of Tasmania.
Shout out to Tasmania. That's a first on the highlights
in the duration of this box. That is awesome. Yes.

(21:08):
If you ever had any any doubt about how international r is in this footprint, you know, this is this is a proof right there. Nonetheless, this is the first time I heard about their research group, but none but they
I don't know who exactly authored it, but I'm gonna say from their research group,
they talk about, you know, a very practical approach
that in their research or how they've looked at,

(21:31):
you know, it's one thing to assemble all the resources or manuscripts or papers that comprise a lit review, but
actually,
So
in particular
So in particular, what they walk through in this post
is how you can leverage

(21:52):
an off the shelf, you know, large language model, you know, chat g b t like service.
And then to be able to take a set of PDFs
and extract the information from them.
And to take that text,
try to clean that up as well.
And then once you get that going for, like, a single type of manuscript, how you batch all this together.

(22:15):
So first, like I said, this is using the approach here is using one of the off the shelf
providers for generative, you know, AI, large language models.
They are using anthropic in this post, and I've heard good things about that. So, of course,
there's no such thing as a free lunch here. You're gonna need an API key

(22:35):
because that will be leveraged as you interact with the service
to grab the information back from their chat like bot interface.
And so they got some nice call outs of packages that help you with managing environment variables.
You know, typically, I'm a I'm an old school guy. I've always done the dot r environment file in my working directory.

(22:57):
There's this great package that I admittedly need to use more often called dot env
that will help you do this in a slightly more agnostic way.
But you just set up a dot e m v file in your home directory,
put in your anthrop anthropic key, and you're off to the races because dot m has a little low dot e n v function

(23:18):
to import that in. And then you can use that as your environment variable, and you're you're good good to go there. So
great great little package right off the bat for that side of it.
And then the packages that they're using to interact with the anthropic models
is called the tidy chat models package,
which I am not familiar with as well.

(23:39):
To do some
research on this,
to where this this package comes from, but looks pretty
straightforward here. You create
a chat object defining the name of your service, your API key,
and the version of the API.
But you could use other APIs as well. So I have

(24:00):
to I have to look at this, this package in more detail later on. Looks pretty nifty here.
Once you get all that set up, now
just like with any of these services, you gotta figure out
what you want to use for your prompt and how to perform that. So they have little, you know, basic example for adding a message based on the role that you

(24:21):
supply
and the prompt text.
Role is a key concept here because there's typically two roles here. There's a user role
and a system role
where you may get more precise control over, say, the system role,
but they give you some links to,
to determine which is best suited for you. In this case, they're gonna look at a more system role here

(24:47):
to give a little more granular control over the type of model that they're gonna use
in the l o m, you know, interrogation.
And you can add certain parameters such as temperature
or max tokens, which they have lots of links of documentation
on where you can
find more information here. But this kinda checks out with my
explorations of the Elmer package where I learned very quickly the prompt is the key here along with some other interesting things you can augment

(25:16):
with that chat or with that interface to that chat in,
that chat service.
So once you have all that, they've got them so they got it ready to go
to look at, you know, getting getting the text, you know, summarized, but you gotta get the text in there itself.
So in this case,

(25:36):
they've leveraged the PDF tools package
for a convenient way to grab the text from a PDF that you've downloaded on your computer
with the PDF underscore
text function.
But you also have to make sure
that you are able to authenticate
directly
to your API service because

(25:58):
you may get,
on top of the authentication, you may have to format the text effectively. So he he or she notes that the first time they ran this,
they got a 400 error from the API service because the formatting wasn't correct. Because when you extract text from PDF,
there could be some strange symbols in there, some strange
artifacts, and you gotta you gotta clean that up a little bit. So that's a good walk through on the practical ways of leveraging this workflow. But yeah. And the rest of the post, Mike, why don't you talk us through some of the challenges I'm learning that the authors had of this workflow here?

(26:34):
Yeah. You know, it's really nice that we have these APIs that allow us to do things
programmatically.
I think, like you mentioned, it wasn't quite as straightforward as,
extracting the the text from the PDF using that PDF tools package. There there was some cleanup for those characters. But once that
took place,
it was pretty straightforward to be able to interact with this this LLM, and I believe that's

(26:59):
much thanks to this tidy chat models
package
that exists,
that allows you to,
it has functions in it like add params, where you can set what's called, like, the temperature,
and the the max number of tokens
that, you expect to interact with, with respect to the LLM. You can add a particular message, in this case, a system prompt. And,

(27:24):
the the system prompt here was, you are a research assistant who has been asked to summarize the methods section of a paper on turtle mortality.
You will extract key statistics on sample size and and year of study and do not extract any more information beyond this point. So those system prompts are
intended to sort of fix the response or the way that the LLM

(27:49):
will respond to you,
prior to even
providing your prompt, your user prompt.
And then finally, there's a function called addMessage, which allows you to submit that user prompt. And in this case, it'll be the the text, that was extracted from the PDF.
In in order to,
send this to the LLM, there's a final function called perform underscore chat,

(28:13):
that will send that. You can save that as
at least save the output of that to an object. In this case, the author used an object named new chat.
And then to take a look at what the LLM gave you back, you can use the extract
underscore chat function,
from the same package against that object

(28:34):
that you saved. And it's it's pretty cool here to take a look at the results.
I will give one hot tip. So the results come back,
as text
where the LLM says, you know, based on the method section, here are the key statistics. Your sample size of 357
sets,
for large scale long line fishery, and the year of the study

(28:55):
was 2018. So it it is text that you would then have to parse, and there's a little bit of code that leverages a lot of functions from stringer and I think dplyr
as well in order to actually extract, you know, just the numeric values for the sample size and the year.
I will give a hot tip here just to based upon our experience in the past. We have done things like tell this system prompt to

(29:21):
or use the system prompt to tell the LLM to only return,
data in JSON format
with the following elements, like sample size and year of study. And that will actually spit back out JSON,
that you can, you know, then consume much more easily without having to do any
string
data wrangling, if you will. So that that may be, helpful to to some of those folks out there. But if you can ensure that the output that you're getting is is fairly standardized,

(29:51):
like it seems to be the case here, then, you know, hopefully, we can program the whole solution. Right? And we don't have to,
do
different parsing logic based upon, you know, different,
prompts that we are
essentially querying.
And and
so the code here is a fantastic walk through, pretty lightweight in terms of the number of packages that are being used, which is is awesome. There's some considerations here and some some great discussion at the end around cost uncertainty. If you're doing these things programmatically,

(30:21):
right, you need to make sure that you're
calculating,
estimating your cost. You know, most of these models out there have the ability to
set or most of these third party providers have the ability to set budgets, I think so that, you don't go over, you know, $10 or a hundred dollars, whatever it is that you,
you know, have linked in your account. I know at least the Claude models have that, which is really nice.

(30:47):
And
maybe the maybe the last thing that I will say just on a related topic while we're talking about LLMs is a
totally,
separate shout out, but there's this project out there called Continue. I don't know if you've heard of it, Eric. No, I'm not. It's a v s v s code extension, and I heard about it on Hugo Bowne Anderson's recent podcast. I think the, the Vanishing Gradients

(31:11):
podcast, which, interviewed the
one of the, I believe, chief developers of this continue project.
And it's an open source
coding assistant.
And you know me, an open source, you know, go away, Microsoft.
It it it does have the ability so you can select sort of your back end model that you want to use, and that could be OpenAI, it could be Claude, but it could also be a local model, like one of those from the Ollama project,

(31:41):
which is what I've been using,
one of the smaller ones.
And it has the ability to just chat with your code.
It has the ability to
bring in particular pieces of context, Like, you can use a a GitHub repository just specifying the URL, and it will crawl that GitHub repository,
you know, using your your GitHub p a t, if it's a private repository

(32:05):
that you can supply it with. And, it will essentially do all the text embedding work for you to,
you know, put that in a vector database or whatever it's called,
to be able to
utilize that as context that you can chat with. You can put multiple sources of documentation like that if you want.
And,

(32:26):
you can also highlight pieces of of code and, you know,
chat with the context being that particular highlighted piece of code. Does auto complete,
all those really good things. And it's a really, really cool user experience. And I would encourage
anyone out there who is looking for, you know, the the Copilot experience without

(32:47):
necessarily
wanting to,
interact with a third party and continue to leverage open source software to take a look at this continue project. I think it's maybe continue.dev,
but we we can link to it in the show notes. Yeah. We're gonna link to it. And while you were talking, I wanted to see, hey. Wait a minute. I wonder if I could plug this into Positron

(33:08):
in my
my bleeding edge evaluation of positron. The good news is this extension
is on the open VSX registry, meaning it's not locked into Versus code. You could use it on the variant. So I think I'm doing that later today, Mike. I'm gonna give this a shot because I have been hesitant
to get on the copilot train for even my open source stuff that I've been leveraging,

(33:30):
some other alternatives, and this may be one of them. So
just goes to show you there's a lot advancement here. And by the way, yeah, plus 100 to your tip about
giving the prompt some detailed information on how to get results back. Getting results back in a structured way like Jason
just opens up so many possibilities,
and I leverage that technique extensively

(33:52):
when I was making this, fun little
haunted places shiny app that leveraged an LOM for our pharma last year. I made sure that when I made it randomly generate the quiz questions,
give it back to me in JSON so that I could present it in shiny very easily with a with dynamic input input. So there's a lot
a lot at your fingertips here. I think this post does highlight.

(34:15):
Do some quick tests first in a specific use case, and you will have a lot a lot that you'll learn along the way. But we're only just scratching the tip of the iceberg with this, so to speak. There's a lot more to come, come in this space.

(34:43):
And last, but certainly not least, Mike, I will admit on a day like this when I'm not feeling like myself, I do, want to feel, and I do feel a little lazy about certain tasks. But we're not gonna talk about lazy and a negative connotation here because our neck our last highlight today
is literally giving us a very practical take

(35:04):
with the many different ways that lazy and laziness is
actually available to you as an R user depending on your context, depending on your workflow and the packages
that you're utilizing.
And so this last highlight comes to us from the, our hub blog and from a very brilliant group of authors. If I do say so myself, we got Mel Salman,

(35:28):
Athanasia,
my Winkle, and Hannah Frick. So that's a, that's a a trio of trios right there
to start off with. And so I'm gonna
introduce certain pieces of this, and I'll turn over the mic for the rest. But
if you're an r user the last few years,
there's probably one interpretation
that of lazy that you've heard throughout
your journey with our,

(35:49):
and that is a concept of lazy evaluation.
That is arguably one of our biggest selling points is the idea
of lazy evaluation.
And what this really means
is that if you have a function with arguments,
that those arguments are only going to be evaluated in the runtime, so to speak

(36:10):
when they are accessed.
So you may be able to pass like a huge
value for an argument. Maybe that's like a data frame or some other large vector.
If it's not used,
it's not gonna really matter. It's just gonna be there just in the in the definition.
Only until you actually call something
that leverages it

(36:31):
will
actually be used.
And,
there is a counter
concept to that called eager evaluation.
But in typical
default
behavior behavior for r, It is a lazy evaluation, and they have a link to a great chapter
from the advance r book that's authored by Hadley Wickham with another more, you know, thorough introduction to lazy evaluation.

(36:57):
And in fact, in base r itself, another concept that you may be familiar with is the idea of a promise
where it may be something that is
on tap to be evaluated,
but it's in essence more of a recipe to get to that value.
They call that an expression most of the time

(37:17):
or it could be from an environment as well. And again, only when you need it
will be evaluated in your memory. So that can be very
important depending on your workflow.
I mentioned promise. Right? Well, there is a very important part of the our ecosystem
that leverages
a different take on promises

(37:37):
with respect to, high performance computing and async processing.
And that comes from the future package.
The future package is authored by Henrik Benson, another really brilliant
researcher in the art community that I had the pleasure to meet at Paz at Confuc a few years ago.
And in the future package,

(37:58):
this promise is more thought of as a placeholder
of a value
And that and, again,
there are different ways to configure this with the future package.
You can have what's called a lazy future
or an eager future.
And in essence, when you define this future, the default behavior is actually be eager, meaning that when you define the function that has that future

(38:24):
encapsulated in the moment you define that and it runs something
right off the spot, it's gonna wait until that task is done.
However,
you can also feed in an argument of lazy equal true,
meaning that that future will just be set to run-in the background. It will not hog your our session, and you could do other things in the council and and do whatever you want, but only until you want that future value

(38:50):
will it actually do it.
So that could be important if you're new to the future package to figure out, well, wait a minute. I thought the whole point was that it could run-in the background. You gotta be careful
in how you define
how that future,
is is spelled out or initialized.
So, again, their version of lazy is not quite the same as the definition we heard in base are and whatnot, but that is important if you're into into that space.

(39:17):
And now we're gonna shift context to data itself because
a very key concept in database operations within the R language is the idea of lazy operations.
What does this mean? Well, in a nutshell,
with these database back ends, such as, say, MySQL
or SQLite

(39:38):
or others,
these queries
that you might define with the
with the help of, say, the DB plier package
that accompanies a lot of d plier like syntax. But for databases,
you may have a pipeline
of an analytical pipeline where you're gonna take data, maybe mutate a few things, summarize with group processing,

(40:01):
but only until you quote, unquote collect that result
will it actually be evaluated. So it's a way to efficiently run
SQL queries instead of in a typical data frame. It's gonna run all those steps one by one in memory
right away.
So that is a really important concept that the DB plier package,

(40:24):
that DB plier package surfaces.
And also more recently, the DT plier package
does a similar thing with data dot table as the back end for managing that data.
And again, much like the database, you know, paradigm we mentioned earlier, the lazy way of doing it is gonna capture the intent of those data processing steps,

(40:46):
but not actually do anything until that result is requested,
I e collected with a collect function.
Again,
really important selling point.
But there is in this in the realm of databases,
another newer contender
that has even more of a Nuance take on this, and that is the duck plier package because Mike and I are big cheerleaders for the duck DB back end. I absolutely love it.

(41:16):
So with duck plier, again, this is using duck DB on the back end, so to speak.
Now there can be a little bit of a problem here with respect to traditional d plier type usage.
Usually, things are eager by default in dplyr, like I mentioned, with a typical data frame.

(41:36):
But when the concept of DuckDV, one of the reasons we wanna use it as a whole is that we can optimize the queries,
optimize the computations
before they're actually run.
So Duckplier
does need this same concept of laziness as those traditional
packages like DBplier
actually need.
Now this is what's interesting here. The way Duckplier is pulling this off, we're getting a little in the weeds here,

(42:03):
is that it is leveraging
alt rep, which is one of the more fantastic
contributions
of ASR
of over the last two years
where it's more power
power behind vectorized
operations,
but it supports what's called deferred
evaluation.
The more specifically, and I quote from the post here,

(42:24):
alt rep allows our objects to have different in memory representations
and for custom code to be executed whenever those
objects are accessed.
So
that means for ductplier
that they can have a special version of these callbacks and other functions to interrogate
whatever is the root of that operation, say the query,

(42:48):
an analytical
summarize, or whatever have you.
So then duct plier by proxy
is actually lazy in terms of how it itself
runs as operations,
but it seems eager to you as the end user when you're running like a duct plier based pipeline.
So they got they got examples here where there could be cases where this is very important to utilize this functionality

(43:16):
and cases where it might be
might be,
more
more,
applicable to add a little more control to it or add a safeguard to it.
I've never played with this before, but there's a concept called prudence
to control just how automatic
this evaluation
of this laziness vault rep is is done here. There's stingy,

(43:39):
and then there's thrifty. I love these names, by the way. Those are really
creative, but they got examples in the post with the NT cars set
of the differences
between how these are these are, approached here.
So these this is something that you probably wanna look at with the recent version of duct plier. It it had an upgrade, I think, within the last few weeks or the last year. There's a lot of rapid development on it, and I think it's got

(44:06):
tons of potential
for leveraging high performer workflows
with database at the back end. And, again, a clever use of laziness
with respect to alt rep. So I am I'm eager to to try that out.
But, of course, there's way more ways of laziness and, you know, lazy evaluation
play a role in the rest of the our kinda typical workflows that you might have. So, Mike, why don't you take us through those?

(44:32):
Yes. A few more quick hitters for us in this blog post.
When we talk about lazy loading of data in packages, I think a lot of us have experienced this before. When you
you're in R. Right? You can quickly
access, like, the iris and the mtcars
datasets which are built into your installation
of R. I'd I'm not sure if they're loaded Eric, you probably have to help me with this a little bit. If they are loaded into memory prior to calling

(45:00):
them, prior to actually evaluating them.
But that's sort of this concept here where if you have an R package that does have a package dataset
in it and and sets the lazy data field in the description file to true,
then the exported datasets are are lazily loaded and they're they're available without having to call the data function,

(45:21):
right, for for those particular datasets.
But they're not actually taking up memory until they are accessed.
So that's
something interesting there. It's something that we've run into a few times actually. We have some some functions in some of our packages
that programmatically,
sort of, you know, with the use of, like, regular expressions and stringer, try to decide which

(45:45):
internal package dataset
you want to leverage in that function
and unfortunately you have to call
library on the package first in order
for that function to work you can't just name space it
or else it will fail. And I'm not sure if we've solved
that
yet. It's it's a bit of a workaround.

(46:06):
Is that something you've run into before, Eric? Yeah. The hard way quite a bit even with my goal in power, Shiny apps are on include an internal dataset
as, like, a way to have, like, me my colleague test or an example
set that the app would use.
I I I've I've had
to I've had to do some, you know, very weird hacks of, like, just doing an arbitrary command on that data frame to trick it to load in the memory before the function completes.

(46:35):
I don't really have a great solution for that. So, hey, Colin, if you're listening, maybe you could help me out with that, by the way. But, nonetheless,
that's where I've encountered that bugaboo the most.
Yes. Yes. No. That's that's a great point. And there's a couple of links here, I think, that may help discuss this concept further of lazy data.

(46:56):
There's the r packages book by Hadley Wickham and Jenny Bryan, then there's the also also the Writing R Extensions
book, which I think is is more sort
of authored by some of the core R developers
or,
you know, so I think from that perspective.
So those might be two good resources if you're interested in learning a little bit more about lazily

(47:17):
loading data in packages.
I love lazy logic that checks to see if something ever needs to be rerun, and that's sort of the concept of caching, right, in a in a broad sense.
And the authors here give the example of the lazy argument in the package down build site function,
which if that argument is set to true,

(47:38):
it will only rebuild articles and reference pages if the source is newer than the destination, which makes a whole lot of sense and can save a whole lot of time depending on how big your project is. And that's something that I have to talk to a client about today because we have a GitHub action that is taking way too much, way more time than it needs to take.
I feel seen about that. Absolutely.

(47:59):
Yep.
I digress.
Similar concept with the lazy test package
that helps you only rerun tests that failed during the last run.
And the last example here is regarding regular expressions. I had never heard of the terminology lazy
being applied to regular expressions,
but if your regular expression is finding

(48:22):
all matches,
of of whatever pattern you're looking for,
that's considered eager. And if it's only finding the first match or the fewest number of repetitions as the authors define it here as possible,
then it's considered to be lazy.
And in the example that they provide, the question mark character in the regular expression is what adds this laziness.

(48:45):
So ton of examples here, really, really interesting
blog posts. I think it's it's always interesting, you know, whatever these authors put out. It's some neat perspectives that maybe we don't think about or or have on a day to day basis. And,
I would say that if you you didn't get it already, there are a lot of different definitions around laziness

(49:05):
when it comes to programming and and our programming, especially.
They did omit one definition of laziness, which is the one that that takes place when people just copy and paste code from ChatGPT
and don't even look at it before incorporating it into their project or repository
or even worse, pushing it to production.
That's bad laziness as opposed to a lot of good laziness that we were talking about today. But Context is king as I say. And, yes,

(49:31):
we've we we both have had experiences where that's happened, and we're like, oh, boy. Is this what we're in for now?
Just my 2¢. Yeah. Yeah. Yeah. But, I I think it's a it's a viewpoint that's shared with a lot of people. But, yeah, lots of lots of great additional, you know, links in this post to dive into each of these in greater detail. As I said, I'm really intrigued by the duck pliers approach to this because I've never seen it kinda try to total lines. See both eagerness and laziness depending on the on the way you're interrogating

(50:01):
that. So I'm gonna do some homework after the the show about that because I'm trying to up my duck DB,
power here, so to speak, after that great workshop I took back at Pazitconf last year. I'm all in on that train. And and, yeah, in this case, lazy is definitely not a bad thing in many of the many of the approaches here.

(50:21):
And what else is not bad is our weekly itself. I would dare say we're not lazy in terms of how we curate the issue. That is very much, an eager evaluation
in a good way.
Normally, we do our additional fines. We are running a bit low on time, so we're gonna
close-up shop here and, again, invite you if you wanna help contribute to the project.
The best way to do that is with a poll request through our weekly itself and the upcoming issue.

(50:47):
If you found that great blog post that maybe spurs up a lot of discussion in the community like we had in Ari's post or a great technical deep dive or a great way to use a new r package out there. We're just a poll request away. All marked down all the time. The template's already there. Head to rweekly.0rg
for complete details on that. And we love hearing from you on the social medias. Great shout out to those that have gotten in touch and send us some good things on on social media.

(51:15):
But you can find me. I'm now on Blue Sky, where I'm at rpodcast@bsky.social.
I'm also on Mastodon where I'm at rpodcast@podcastindex.social.
And I'm on LinkedIn. You can search my name, and you'll find me there. And, Mike, where can the listeners find you?
Sure. You can find me on blue sky at mike

(51:38):
dash thomas
dot b s k y dot social or on LinkedIn,
if you search Ketchbrook Analytics, k e t c h b r o o k, you can see what I'm up to lately. Very good stuff. And, thank you again. We made it. In our 50%
workflow, we somehow made it. So that's why having a co host is a really good idea in these times. So nonetheless,

(52:01):
we will close-up shop here for our weekly highlights rep. So hun nine 96.
Yeah. We're far away from 200 folks. It's coming up soon. And we'll be back with episode a 97
of our weekly highlights
next week.

All Episodes

Episode Transcript

Popular Podcasts

Stuff You Should Know

The Joe Rogan Experience

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Issue 2025-W08 Highlights

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Stuff You Should Know

The Joe Rogan Experience

Dateline NBC

All Episodes

Issue 2025-W08 Highlights

Stuff You Should Know