All Episodes

March 12, 2025 • 41 mins

There are over three dozen US lawsuits between rights holders and developers of large language models, with the main point of contention being whether it was permissible “fair use” to train LLMs with copyrighted content. Newspaper organizations, authors, music publishers and visual artists have filed suits against OpenAI, Microsoft, Meta, Google and Anthropic, among other leading developers. Damages could run into the hundreds of billions of dollars, with the potential for milestone decisions this year. In this episode of Votes and Verdicts, Frankfurt Kurnit Klein & Selz partner Jeremy S. Goldman joins Bloomberg Intelligence analyst Tamlin Bason to discuss AI copyright issues and analyze how a recent rejection of fair use in an LLM dispute may influence pending cases.

See omnystudio.com/listener for privacy information.

Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:12):
Hello, and welcome to the Boats and Verdicts podcast hosted
by the litigation and policy team at Bloomer Intelligence, an
investment research platform of Bloomberg LP. Bloomberg Intelligence has five
hundred analysts and strategists working across the globe and focused
on all major markets. Our coverage includes over two thousand
equities and credits, and we have outlooks on more than

(00:34):
ninety industries and one hundred market indices, currencies and commodities. Now,
this podcast series examines the intersection of business policy and law.
I'm Tamlin Basin. I'm an analyst with Bloomberg Intelligence covering
intellectual property litigation impacting the tech sector. Now, often that
means a focus on patent litigation, but right now we're

(00:55):
in a period of a rising tide of copyright litigation
that could have profound impacts on the singular technology issue
of the past few years, that being artificial intelligence, more
specifically generative AI, which is made capable by large language
models that are trained on massive data sets. Now, the
sticking point is that much of that training data was copyrighted,

(01:18):
and we've seen dozens of lawsuits pop up in the
US brought by rights holders to some of those data
sets and against the developers of those large language models.
Today we're going to explore some of those lawsuits, and
we're going to spend quite a bit of time talking
about something called fair use. And we're going to be
doing all of this with the very capable help of
Jeremy Goldman, partner and co chair of the Emerging Technology

(01:40):
Group with the law firm Frankfurt Kernat Klein and Seals. Jeremy,
thanks so much for joining us today. Now let's start
with fair use. This seems to be the primary defense
that's going to be put forward by developers of large
language models. Now, fair use seems like a bit of
a get out of jail card, and that it doesn't
mean that the rights holders copyrights weren't in fringe, but
it does seem to mean that the infringement was justified

(02:03):
by the end use and geremany, what's someone accused of
copyright infringement have to show in order to get that
fair use protection?

Speaker 2 (02:12):
Yeah, well, thanks Tomlin for having me and happy to
talk about these cases and fair use. You know first,
just you know it, actually, fair use is encoded in
the United States Copyright Act in a section called you know,
seventeen USC. One oh seven, and that section actually says
that even though you might exercise one of the exclusive
rights of copyright, even though you might make a copy,

(02:34):
even though you might make a distribution of somebody's copyright
protected work, with section one of seven says that even
if you exercise one of those exclusive rights that make
up the bundle of sticks that is a copyright, that
it's not an infringement. It's actually not an infringement, and
it's fair use. And it just says if it's a
fair use, And then it says, well, what is a
fair use? And it says, well, it's a case specific

(02:57):
question about whether it's fair use. And Congress tells courts
that we're going to look at four factors to decide
in each instance, and they're non exclusive, but there's four
factors that they say courts should look at whether it's
fair use, and they're all encoded in the statute. We
can go through the four of them, but that's basically
in all the cases involving fair use, courts will look

(03:17):
and examine each of the four factors, weigh them, and
then decide whether on balance those factors tilt in favor
of it being a fair use or not being a
fair use.

Speaker 1 (03:27):
Yeah, and I think that's excellent important that this is condified.
But at the same time, it's very much is a
doctrine that has evolved through case law. I think it's
fair to say. I mean, there's there's been a number
of high profile cases in fair use, both in the
circuit courts and of course at the Supreme Court, but
probably relevant to what we're talking about today, to the
development of large language models, there's probably a few to

(03:48):
the standout, and I think one of those was the
Google Books litigation. Can you maybe tell us what that
litigation was about and why there might be some parallels
to these large language models.

Speaker 2 (04:00):
Sure, and you know, I'm happy to talk about the
Google Books litigation. I'll also mention that I was a
litigator that was that worked on those cases, and in
those cases I represented the Author's Guild in lawsuits that
were brought against both Google and against a consortium of
libraries that worked with Google and connection with the Google
Books program. My views today are just talking as a

(04:21):
lawyer to appinn on sort of the decisions that came
out in those cases, and certainly don't represent the views
of the author's guild, or the views of any parties
in those cases, or even the views of my law firm.
But I am happy, and I'm very familiar with those
cases and those decisions. So with that disclaimer, let me
just talk about those cases. Google made its mission to
make the entire world searchable, right, That was the goal

(04:41):
of Google was to make the entire world searchable. They
started with the Worldwide Web, of course, and you know,
you put into Google and put in search results and
get results back from the Web. And to do that
they had to copy, you know, essentially the entirety of
the World Wide Web in order for that to happen. Well,
Google decided that, you know, maybe that's not enough, and
we also want to make all of the world's knowledge,

(05:02):
including all of the books of the world, searchable. So
in order to do that, Google partnered with major university
libraries around the country, and they entered into agreements with
them to go to the libraries and take all of
their books, millions upon millions of books, many many of
which were protected by copyright still, and they set up

(05:24):
scanning stations and they scanned and reproduced and made copies
of millions upon millions of books in order to digitize
them and then run them through an optical character recognition
process and OCR process, and then ultimately for the purpose
of creating a search index so that when you do searches,
either on books dot Google dot com or even on
the main Google search engine, if you find keywords that

(05:46):
hit on words within the books, you would be able
to find those books. And what Google was doing was
displaying a snippet of where the keywords were located within
a book, and so you'd see that, oh, it was
in this book on this page or whatever. And the
other side the university libraries, as part of the deal,
they also got access to this search index, and there
was a website called hoti Trust where you could put

(06:08):
in keywords and they didn't show snippets, but they did
show that in this book on this page, you'd be
able to find that information. The university libraries also decided
that as part of this process, they were going to
start making certain works that were quote orphan works available
to members of the university library to make them available
online well authors around the country. Indeed, authors around the

(06:30):
world were not happy that Google was digitizing and scanning
and making use potentially for commercial purposes through Google, millions
and millions of books without permission. And you know, books
being at sort of the core of what copyright protects,
protects the rights of authors, it protects the exclusive rights
of authors. And they were felt that this was a

(06:50):
violation of their copyright, and so they filed lawsuits against Google.
They filed lawsuits against the university libraries alleging that the
digitization and scanning of the books was copyright infringemen. And
in those cases, Google and the university libraries defended themselves
primarily on the ground that, yes, we made copies, but
those copies were a fair use, and they argued that

(07:13):
it was transformative that books were written and published for
one purpose, and we were using them for quite a
different purpose, which was to make them searchable. Also, they said,
to make them available for people that were blind was
another example of what they were doing. And the court,
ultimately the court the case went up to the Second
Circuit Court of Appeals, and the court found that indeed

(07:35):
it was a fair use and held that this digitization
for the purpose of making it searchable was a fair use.

Speaker 1 (07:43):
Yeah, and I think you sort of alluded to it
then and Ultimately it did go up to the Second Circuit,
but this was by no means a swift resolution of
a case. I recall it going on for years and years.
There was potential settlements that I believe were ultimately rejected
by the court. But I think what to some extent,
what they showed is that these can be very difficult issues.

(08:04):
It was to sort of resolving court. Their use is
kind of a squishy content. There's very few bright line
rules totally.

Speaker 2 (08:13):
And i'd also at at a high level, you know,
that case sort of not just on the legal points,
does it have analogies to the current cases that authors
and creators are filing against the AI companies, but on
a on a feeling level, on an icky level, there's
a very similar sentiment where there was a sentiment that

(08:33):
you know, authors who are really you know, the Copyright
Act itself and indeed in the Constitution, it talks about
giving authors exclusive rights over their works for limited periods
in order to promote progressive science in the art. That's
the constitutional mandate of the Copyright Act, and authors, you know, felt,
here's this technology company coming in and scanning and digitizing

(08:55):
and making use and profiting off of our work and
they're claiming that it's use and that just seems wrong.
And I feel like there's the same kind of sentiment
now with the you know, the AI cases that we'll
get into.

Speaker 1 (09:06):
Yeah, I think that's absolutely right. And I guess just
to drill down a little bit further on sort of
you mentioned it right there, the transformative use. I believe
that's ultimately how Google was decided. But that's only one
kind of of the fair use factors. It's sort of,
I think, kind of baked in the and and the
fairs for you factor you want to really quickly sort
of run us through those fair use factors.

Speaker 2 (09:28):
Let me run through the fair use factors, and also
in doing so, we'll talk about transformative use, because indeed
that word of transformativeness doesn't appear in the fair use factors.
That's a judicial creation. It's a creation really of a judge.
Judge levolved in from a Harvard Law Review article that
he wrote about fair use, and then the Supreme Court
picked it up in a case involving a two life

(09:49):
cruse song. I won't go too far down that road,
but just to just to tick through the factors, okay,
I don't have it in front of me, but I
think I know it by hard, right, Which is the
first factor is the purpose at and character of the use.
The second factor is the nature of the work and
the nature of the work. And well, we'll go back
to the first factor because it's kind of the most important,
So I'll go through the four and then we'll go

(10:11):
back to the first. So purpose and character of the use.
The second one is the nature of the copyrighted work,
which is really like how close is it to the
core of copyright protection? And then you could think of
like a fictional work versus a nonfictional work. Copyright doesn't
protect ideas, so copyright will be more steadfast and protecting
a fictional work or a highly creative illustration, for example,
right versus something like an encyclopedia. The third factor is

(10:33):
the amount of substantiality of the use, right, how much
of it was used in relation to the whole sort
of a quantitative question. And then the fourth is the
harm of the use on the market. And so that's
really like a damages thing, like how harmful is this?
And really courts tend to weigh very strongly. The first factor,
which is the purpose in character and the fourth factor. Now,

(10:53):
just going back, like I said, to the first factor,
when we talk about the purpose of the use, the
way that this has as courts have sort of dealt
with this over the years, they've articulated the first factor
and broken it into two pieces. First, is is it
a commercial use or a non commercial use? Until recently
that factor, of that sub factor of factor one, got

(11:14):
very little, actually very little kind of deference. Right. People
often think, oh, if it's a nonprofit organization, they're going
to get a lot of difference. If it's a school,
if it's an educational institution. Indeed, that does provide some
it does provide some weight, but ultimately that's very rarely
been dispositive in a court's analysis about whether it's commercial
or non commercial with the US is the most important

(11:36):
factor and what courts have paid the most attention to
and what starts getting to the core of modern fairies
jurisprudence is the purpose of the use. And there what
the Supreme Court has said in this case of originally
involving the case involving two Life Crew and a kind
of a parody of what was held to be a
parody of the song Pretty Woman by Roy Orbison, and
a dirty version by Two Life Crew. The Court held

(11:59):
that some like a parody is transformative because it's taking
the original work and it's commenting on the original and
shedding new light on the original, sort of adding something new,
and it is the new work has a new purpose
from the original. It has transformed the work from what
it was before into what it was later. And the

(12:21):
Supreme Court weighed in on that in the mid nineties
in that Two Life Crew case, and hadn't really weighed
in on it again until just what last year or
two years ago when they decided that a work involving
Andy Warhol, and we can I'm sure, I imagine we'll get
to that work involving Andy Warhol and Andy Warhol work
was not held to be fair use. So though I
should be careful the way I say that the particular

(12:44):
use of an Andy Warhol work was held out to
be fair use, But the transformative use test kind of
controls frequently. And what you see in the cases involving
fair use is the defendant trying to argue that their
use is new and transformed against the original one, and
the plaintiffs saying, no, we did it for this purpose

(13:04):
and you're using it for the same purpose.

Speaker 1 (13:07):
That's very very well spoken, and also you crushed it
on those various factors for not having them good job,
law school training, coming back. So I guess you drew
a lot of parallels between Google Books and so the
litigation that's going now with the lms. But I'd like
to point out a distinction that I've written about, and
it's that Google Books was not necessarily Google's core business model.

(13:30):
You mentioned, yes, they wanted to digitize everything, but books
are really only a small portion of this. I don't
think it was ever expecting to make massive amounts of
revenue necessarily from Google Books. It was kind of a
side project. Obviously implications for authors, but for Google maybe
not core to the business model. I think that's that's
different than what we see now with Open AI or

(13:50):
Anthropic or some other lms. The LLM is the business model,
and in the US copyright infringement damages they can be massive.
Of course, you can have for willful infringement. I believe
it's up to one hundred and fifty thousand dollars per work.
We're talking about millions and millions of potential works being used.
So I think this really does potentially oppose sort of
an existential threat to those business models if the fair

(14:14):
use defense doesn't hold up. So I think this is
why that I think it's so important as this field
is still kind of in its infancy. But let's turn back,
I think to the Warhol case, and also I think
the way Warhol was implemented more recently in probably the
first AI case that we've had, and that being Ross Intelligence.
Now you wrote about that, and I think in your

(14:35):
post you said this case helps delineate the boundaries of
acceptable fair use of a copyrighted material for an AI
model training. I guess can you tell us why was
Ross what Ross did outside of those accepted boundaries.

Speaker 2 (14:49):
Well, we don't know what the accepted boundaries are, right,
we don't know the extended accepted boundaries what we have now,
And just to give a little context for listeners here
is you know, you have now upwards of forty cases
for copyright infringement or copyright adjacent type claims that have
been asserted against AI models for using copyright protected materials

(15:10):
without permission and without payment. And you know the you know,
the big ones and the most famous ones being sort
of like New York Times filing a lawsuit against Open
AI or Sarah Silverman filing similar types of lawsuits, and
those cases are against and it's sort of important to
understand that for your question to understand the differences of
the models and why this case is important those cases.

(15:33):
Let's just take the New York Times versus Open AI.
That's a case by publishers and authors against a generative
AI large language model general purpose. That's sort of like
underlying it is this you know GPT model that's being
trained on billions and billions of points of training data
and has kind of limitless purposes. And people can put

(15:55):
those models to use in all sorts of industries for
all sorts of purposes. And you know, one application being
chat cheept for example, the case by Thompson Reuters against
Ross Intelligence, which was brought way back in in twenty
twenty actually, so it really predates a lot of this.
That's a case, you know where Thompson Reuters, which is
you know, provides among other things, legal research tool West Law,

(16:17):
which is you know, my go to legal research tool
of choice, and they filed a lawsuit against Ross Intelligence,
which is a legal focused AI research tool, an AI
powered Legal research tool, and they claimed that Ross Intelligence
used Westlaw's headnotes, which are you know, summaries of points,
legal points of holdings and what you know. Thomson Reuter says,

(16:39):
those are copyright protected, and they claimed that Ross Intelligence
use them to train the Legal Tool AI model and
claimed that this was copyright infringement. Ross Intelligence argued that, no,
we don't ever output the West Law headnotes and our
users transformative and its fair use. And in that case
the court held that it would not fair use. And

(17:00):
that decision came out just February eleventh, very recently, and
it is the first meaty, substantive fair use decision that
is in this existential question that you talked about in
the beginning, which is is the use of copyright protected
materials to train artificial intelligence models copyright infringement or is
it fair use? And in this particular case, the court

(17:22):
said that it was not fair use. Very important to understand, though,
is the particular facts of this case. As we said,
fair use is a case by case analysis. It's very
facts specific, and there are important distinctions between the facts
of the Ross Intelligence case and the facts of cases
like New York Times versus Open AI whether those and

(17:45):
we can talk about what those differences are and why
they might matter, whether those will be enough to change
the outcome in those cases. As I called it, the
trillion dollar question. I say trillion dollar because of what
you said, right, I mean, you're talking about massive copyright damages,
potential copyright damage is in and an industry that is
now valued. Probably you're a Bloomberg, you're in a better

(18:05):
position to tell me. But I imagine it's a trillion
dollar industry now, and you know the outcomes of those
rulings could really uh you know, have an impact on that,
on that industry and the amount of damages that could
be at stake there.

Speaker 1 (18:18):
Yeah, I mean, I mean AI has certainly been the
key theme in the investment community for years and years,
and the evaluations seem to sort of double every every
few months. So it's a massive amount. But yeah, let's
let's sort of dive into too. Why I think you
mentioned Ross was not generative AI. I think I think
the judge was pretty clear that to draw that distinction.
But will that necessarily make it it different in how

(18:41):
Ross applied I guess Warhol. When we do start dealing
with the generative AI defendants.

Speaker 2 (18:47):
I I don't know. That's that's that's one that I'm
not going to give you a I don't have a
crystal ball, and I'm not I'm not even going to
give you my personal take on that. It's too that's
too dicey, that's too dicey. Right, what I here's what
I'll tell you. Let's just talk about why it might
matter as a why it might be a relevant distinction
and not what we call in the legal world of

(19:08):
distinction without a difference, right. And so sometimes you have
factual differences that don't have any legal bearing, and sometimes
they have legal bearing, and let's talk about why it
might matter. Here's an example. And this just goes to
the fact that the Ross Intelligence model was not a
generative AI. What that model allowed you to do was,
you know input. You would put in an input with
a legal research question and asking, you know, the AI model, hey,

(19:30):
what's the law on X, Y and Z, and then
it would come back with relevant judicial opinions, which are
public domain, and say, here's a judicial opinion that answers
the relevant model. The point of the West Law headnotes
why they used it was to help train it to
understand the way that people talk, because not everyone talks
like a judge and a judicial opinion. People talk more
like West Law headnotes than they talk like judges and opinions.

(19:51):
So they use those West Law headnotes to train the
AI to talk like a human and not just like
a lawyer or a judge. So they never generated in
generate new texts. Contrast that with something like a chat sheept,
which generates original texts that presumably in the outputs are
not going to You know, it could, but that's a
different question. The outputs likely don't infringe the underlying materials

(20:15):
that we're input into the system. Courts have in cases
involving fair use looked at things like is the new
use going to create more content that's creative and new
and useful and ultimately driving and further to the constitutional
mandate that I mentioned before in the Constitution, Article one,

(20:38):
Section eight, Clause eight says to promote the progress of
science and the useful arts, we're going to give authors
exclusive rights over their works for limited periods of time.
That's what copyright is. It's giving exclusive rights to authors
to promote progress of science in the arts. One thing
that a court might look at is does generative AI

(20:59):
help promote progress and science and useful arts by creating
all of this output that's useful for society. I think,
you know, another sort of angle on this which isn't
really about the generative AI, which is more like the
ultimate use of it. But another factor that plays into
that and is very important to understand the less about
the generative AI, which is the application of a large

(21:20):
language model or the application of the GPT for example,
that underlies open AIS model or an anthropic equivalent. Right,
is what data is used to train those models and
to make them able to do the extraordinary work that
they do. And what the court found relevant in the
Raws Intelligence case is that they were using these West

(21:43):
Law headnotes in order to train it to provide something
that ultimately was found to be sort of competitive with
West Law. But what they said was, the court said,
you know, that's material that you could have gotten a
license for, right, you could have paid. It's not like
you're dealing with that large a body of work like
you could have created your own headnotes. You know, if

(22:05):
like Thompson Breuter's in the court said this, Thompson Broider's pay,
you know paid people authors, which is the way it's
supposed to work under copyright and pay them to go
and create headnotes, and like they get to protect those
and you have to pay for them, and you could
and there's really was nothing preventing getting a license from
a Lexus Nexus or a West Thought or just hiring

(22:26):
authors to create their own headnotes. Now it's difficult to see.
And I think an argument that open AI is going
to the open AIS anthropics of the world, I believe
will try to argue in those cases is to say
we couldn't license all of this stuff that is needed
to change these large language models. Let's emphasize large. We

(22:47):
needed to be really large. An argument that I think
that's going to be made, and whether this is going
to prevail again, I'm not gonna opine on it, but
I think the argument will be something like, we need
all of the line wage of the world, we need
all of the culture of the world, we need everything
that's out there, and we couldn't possibly under any circumstances,
license everything that we need in order to teach these

(23:11):
models to speak language the second of LLM and to
understand culture and understand who we are. And so there
was there was a need and courts look to see
under the fair use factors like was there really a
need for what you were taking here without sort of
a license or without permission? And the argument again it
is a novel type of argument, and it's also there's

(23:33):
like a good rejoinder that I see the author's cure make,
which is like, well, just because you can't have it
for free, maybe means you don't get it. It doesn't
mean like you get to just take it, right. But
I think the argument would be we need everything, we
need everything, and there's no world and there's no possibility
of ever having a licensing model in order to get
everything in the world.

Speaker 1 (23:52):
Yeah, that's a really interesting point, and I think sort
of again goes to Ross Intelligence where sort of they
were in some way is sort of vertically focus. They
were sort of designed to compete with Thompson Reuters in
this area of headnoe cases. And I think that might
to the extent that it might have some read through
to the generative AI companies. It's to some of these
cases where I also see in some of the complaints

(24:14):
the potential for that to be similar. I think some
of the visual artists against some of these AI models
that also output images that could be licensed. Also, I
think there's a new one by dal Jones that I
think also raises in the complaint and least argument that
you're actually doing this so that you can license sort
of newsflow news data. So I think that's what we

(24:35):
might see potential read through. But I think also, as
you mentioned, I think anybody who says they know how
this is going to go, there's no possible way. One
you have of wards of forty cases, they could go
very different ways. And two, the question is who is
going to decide fair use? Now Again, in rosson Intelligence,
it was decided at the summary judgment stage, but at

(24:56):
first the judge didn't want to decide it at that stage.
So I think it's a Again we talked about these
very hard concepts and judges struggle with them, and jer
Hurry's are certainly going to struggle with them if they
are given at least some of these very used fatactors.
Do you think we're going to see sort of maybe
potentially diverging precedent here, some judges decigdning some juries, assigning

(25:19):
some circuits, taking different stances on this.

Speaker 2 (25:21):
I think it's very I think that is more likely
than not, and I think that's not I think that
it's more likely than not. I also think that in
general with jurisprudence, when you have multiple cases that deal
with you know, similar but not identical issues, but typically
you have is judges try not to reject the reasoning

(25:43):
of their brethren and sisters in the other courts. Rather,
what they try to do is to distinguish their cases
on the facts. And there's a lot of ground for
doing so in these cases. So if, for example, a
judge in one of the cases in the New York
Times case and stop an AI or any of the
other cases, even if they don't adopt the exact reasoning

(26:04):
of Judge Bibis who issued this decision in ross Intelligence,
or if it goes up to the third Circuit of appeals,
even if they don't agree with the reasoning, what a
judge will try to do first is say our facts
are different and so we come out differently. It's more
likely that courts will do that than to, you know,
head on, disagree with the underlying reasoning. You know that

(26:24):
said right now, we just have a district level case.
It almost certainly I'd be surprised if it doesn't end
up on appeal. There's every reason to believe that a
circuit court could disagree with Judge Bbis. Judge Vibas is
sitting by designation. He's a circuit judge who's sitting by
designation in the district court. And like you said, just
a few months ago, he came out arguing that there

(26:47):
was a good ground for fair use. Some was going
to allow Ross Intelligence to make its fair use case
to the jury. And then you know, something changed and
he decided that it wasn't fair use as a matter
of law. So these are really close calls. I think
it's really likely that it goes up to a circuit.
I think it's very likely that judges in the Ninth Circuit,
for example, which tends to be fairly liberal around fair use,

(27:09):
and the Second Circuit, the judges tend to be fairly
liberal around technology and fair use. It was the Second
Circuit out of New York that held that Google Books
was fair use, and it's in the Ninth Circuit that
covers you know, San Francisco and all of the tech companies.
That doesn't mean that they're going to hold that this
is fair use. It does mean that you do have
a situation where you could end up with circuit splits,

(27:30):
and you also could no one would be surprised if
this does end up in the Supreme Court, and whether
the Supreme Court has the copyright expertise to handle a
case like this is a real question, and you know,
something that's creating a lot of uncertainty around what's going
to happen.

Speaker 1 (27:49):
Yeah, absolutely, and I think Warhol I don't know if
people were necessarily stunned by the outcome, but I believe it.
It was seven to two, so it was fairly a
sizable majority. And now we should point out that often
IP in general, and also copyright isn't necessarily have the
partisan vent that a lot of other issues necessarily do.
But still seven to two was a fairly strong opinion

(28:09):
and again against fair use in that case. So it's
definitely an evolving landscape. I would say one thing that
we haven't necessarily talked about is sort of the input
versus output distinction in this debate. So you can either
have infringement based on the input that is, when the
LM digests, scans, whatever it does to get that information

(28:32):
into the system to train on that data, that can
potentially be an infringement of an author's rights. Also, potentially
the AI output could infringe There's also an entirely separate manner,
and that's whether ALLM can actually produce copyrightable content. But
we're going to leave that aside for now. That's why
the entirely separate conversation where we don't have necessarily litigation

(28:53):
to dive into that. But the input output, What are
your views on that and how that might play out?

Speaker 2 (28:59):
Yeah, I mean it can play out in a few
different ways. So one, let's just talk about how the
output can play into the questions that really involve the input.
And you know, the easiest way to think about that
is in the cases that have been filed against the
against the LM's like open AI. There is an effort

(29:19):
by authors in some of those cases, including The New
York Times, to argue that our articles are being reproduced
almost verbatim when users ask certain queries. Right, so user
put an input, tell me what happened in this article,
and according to the New York Times complaint, CHATCHPT will
with certain prompting output material that the New York Times

(29:42):
argues is substantially similar to the articles. Now if that's true,
then it becomes a very much easier case for the
New York Times. Right then it's like, well, what's the
difference between, you know, that versus just a pirate website
that copies New York Times articles and makes them available
for free. If you're able to just go into this
end and reproduced articles in full, there's sort of little

(30:03):
doubt that that would be sort of an infringement. Open
Aye is really contesting that that's the way the tool
is supposed to be used, and arguing that, you know,
the only way that you were able to get chat
gpt to display those outputs was by basically taking the
model and beating it with a stick until it came
out with the articles that were verbatim copies. And so

(30:24):
it's much easier under copyright, much easier to argue that
the technology you've created is supplanting the original if the
outputs are generally substantially similar to the inputs, right, that's
sort of like pretty pretty intuitive. And so the authors
are frequently trying to focus on the outputs, and they've
tried different theories. Another theory that was tried and rejected

(30:46):
was in one of those cases involving images. The argument was,
because all of your images that you output from your
model are trained on our copyright protected images, every thing
that is output from your model is a derivative of
our inputs. Even if the output is not substantially similar

(31:06):
to any particular input, all of the outputs are derivative
works of the original. A novel argument, but was rejected
already on a motion to dismiss by the court, because
the law of a copyright is that in order for
something to be a derivative work, it has to be
substantially similar in some way to the original. And so
the court said, no, that's not the way derivative works go.

(31:27):
So in many of the cases, the plaintiffs and the
authors are having to focus primarily on the inputs and
to say that there's still infringement. Even if the outputs
are not substantially similar, they're infringing. Let me just say
one other category of why the outputs matter, which is
sort of a connected question, and one that the defendants

(31:50):
the models are happier to have better ground for them
to argue, which is this. Listen, if you're the user
and you're controlling this platform, and you create an output
that because of your prompting and because of your directions,
turns out to be infringing. And then, for example, you
make some commercial use of that image. Right, take you
create an image, you put it into an advertisement, and

(32:12):
you start selling shoes with this, you know, infringing image.
The models would like to say that's kind of on you,
and that you were the one who engaged in that,
and that the model had no volitional conduct. And part
of copyright is you kind of have to It's not willful,
it's not like a malevolent thing, but you at least
have to have some volitional intent, Like you know, if

(32:34):
you sneeze and you have an accident, often you can
get off the hook, like it's an involuntary movement. Right.
They kind of want to argue that if somebody is
using the tool in a way that it's not supposed
to be used, or create some kind of infringing content,
then who should be responsible. Should it be the user
of the platform or should it be the model. And
that's the other question that sort of hasn't been answered

(32:56):
yet by courts.

Speaker 1 (32:57):
So it sounds like potentially there's some contributory issue.

Speaker 2 (33:00):
That's that's right, Well, that's what the that's what the
place just would argue that's.

Speaker 1 (33:02):
Yeah, yeah, yeah, so that's uh that that could get
even more tricky to wagh through. So when you're you're
talking about how the New York Times, I think that's right.
In their complaint, they're like, these are regurgitation of our articles,
I think, and some of the author's complaints potentially of
the author's guild or maybe the original Silverman complaint also
had that, And it strikes me as also interesting because

(33:23):
we also many have divergence there because a news article
that is largely a regurgitation of potentially a historical event
might have less copyright protection under sort of the second
fair use factor than potentially a novel. So so it's
there's so many different directions that this could evolve that, Yeah,
it's going to be really interesting to see how this
plays out over the probably in the next two years

(33:43):
or so. Just really quickly, I want to touch on licensing.
You sort of mentioned before that a potential argument will
be that we can't license the world, but they certainly
are licensing a lot. We've seen so many large licensing agreements.
I think the largest ones I can remember is Google
with Reddit for I think something like sixty million dollars

(34:05):
a year. New York Times, of course, is sued. A
lot of rather newspaper publishers have sued, but a lot
of them have also just agreed to licensing agreements with
whether it's Open Ai, whether it's Google, whether it's some
of these other lms. So how does that potentially cut
against ALM in fair use if they can go out

(34:25):
and license a lot of this content, or does it
maybe not impact it? Or again who knows?

Speaker 2 (34:31):
Yeah, I mean so one of the challenges in just
going back to the Google Books case that we talked
about in the beginning, one of the challenges in that
case for the authors was proving that there was a
licensing market for the uses that were being made by
Google Books, and that was a challenge, and the court

(34:53):
picked up on that challenge. And that fourth factor, which
is like the harm to the market for the copyright
work the court found in that case was that there
was no market for licensing books for the purpose of
creating a search index. No one ever had paid any author.
I want to use your book so that I can
make it searchable on the Internet. Without displaying the book,

(35:14):
people won't be able to read the book you'll just
be able to find where the information lives inside of
a book. No one had ever paid for that, and
so the court had difficulties seeing how that was harming
the economic interests of the author's copyright. Okay, contrast that
with what you just talked about, where you have a
fledgling licensing market for using copyright protected works to help

(35:38):
train AIS. And when you have the AI companies that
are trying to defend themselves with fair use, are they
cutting their nose to spite their face by entering into
these agreements and creating a market that didn't really exist before.
You know, that's a new novel market that now exists.
And now the plaintiffs in those cases have fodder to

(35:58):
argue that there's a market harm and it's a there's
a cognizable developing and you know, evidenced licensing market. The
question is whether that will really undercut the argument that
I was talking about you just raise, which is like
we need to digitize the entire world, because yeah, so
that's number one, which is just because we can license

(36:20):
some of the materials from certain players, that's a far
cry from the entirety of the world's knowledge, right, And
so and that's not it. We still couldn't license nearly
enough data to train all these models to the point
where they are. And you have new players in the
market like deep seek, which had originally said something like
we don't need to use that much training data, but

(36:42):
it was kind of an artificial argument because they needed
to do that based on the models that had already
been developed using the entire you know, these extraordinarily massive
amounts of training data. So there's that. The other important
argument that I imagine will come out is by the
by the AI platforms, is you're comparing apples and oranges,

(37:03):
that this is not getting the entire corpus of the
Internet and sucking it into our model. When we enter
into these license agreements, in many cases, what we're paying
for is not just the just the underlying like raw data.
We're paying for other things that are more valuable to us.
For example, you're giving us access to archived materials that

(37:25):
are not openly available online. You're giving us access to
materials that are behind a paywall. You're also allowing us
in many instances to display the outputs or like you know,
some of these agreements that are out there and I've
worked on many of them, I see many of them.
They say things like, you can display x you know,
one hundred or tens or whatever it is, X number

(37:47):
of words from the article or from the original book
or whatever it is, and display that to users in
response to searches, provided that there's a link back, et cetera.
Sometimes a link, sometimes not. And those are rights and
privilege that are not part of what the AIS are doing.
So they would say, you know, it's not really fair
to look to those license markets and look to those

(38:07):
you know, agreements as evidence that there's a licensing market,
because this is a totally different sort of arrangement and
that's what we're paying for. Again, I'm not calling balls
and strikes here. I'm just saying that these are some
of the arguments that I think will be made and
some of the distinctions that will attempted to be drawn,
and that will you know, could could sway the court
wumber or the other. Yeah.

Speaker 1 (38:28):
Absolutely, And I think it also sort of shows the
appetite to train these models on things like news content,
on things like novels. I think I was reading some
executive some own companies like there's no way better way
to train an l m than than on novels, seeing
how people speak, you know, it's just sort of some

(38:50):
of the the the best data that they can get
to train these models on. And I think it'll be
interesting to see see. I do think these cases shape UPO.

Speaker 2 (38:58):
I do think what's what's what's encouraging, because look, I'm
somebody who it's certainly in this context I can see
not to see both sides, but I certainly feel an
affinity towards authors and artists and creators right. I think
that copyright does exist to support their work and their effort,
and they should be paid for their work and their effort.
I think that there's a lot of doom and gloom

(39:19):
and that's also appropriate and understandable and fear around what's
going to happen with these models, and totally totally get it.
What's also nice to see is I've seen authors get paid, like,
you know, all of a sudden, get a check for
like twenty five hundred bucks, you know, and of a
book that hasn't sold in many, many years, and all
of a sudden, like some publishers are entering into deals

(39:40):
and like authors are getting a check for a few
thousand bucks for a work because they've entered into a
deal to you know, use it to train an AI
or you know, make some kind of use in it.
So that to me is also encouraging and so sort
of from a business standpoint, you really see the development
of these new markets, and that to me is exciting
and encouraging. Although again I know that there's a lot
of anger and fear and you know, litigation over the

(40:04):
broader issues, but it is nice to see that there's
it's nice to see authors getting paid, including in connection
with these new markets. Yeah.

Speaker 1 (40:10):
Sure, And I think overall I look at the markets,
and my feeling has been that I think the markets
or to some extent underestimating just how much of an
overhang this could potentially be for large language models. And
I think, you know, our discussion today sort of touched
on just how tricky this situation is, how fair use
can go one way and in another way one court,

(40:31):
it may certainly be fair use, another court maybe not.
Distinction is drawn. I think we're going to see some
more summary judgment decisions ripen in twenty twenty five. I
believe one of the meta cases, I believe they're breathing
that earlier part of this year, and I think there's
a case with judge alsop out in California with Anthropic
where I think they're scheduled to do both summary judgment,

(40:51):
fair use and sort of class certification later in the year.
So I think as possible we start to get some
clarity on this in twenty twenty five, maybe moving into
twenty six, but it's gonna be an absolutely fascinating area
to watch, I think, over the next few years. So
I think we're going to leave it there. I think
we've covered a lot of ground. Jeremy, it really appreciate

(41:11):
you coming on today and all of your your really
invaluable insights on this, I think, and it is really
good to hear from from a pretictioner's perspective. I'm not
wrong that these are challenging issues. It's it's a yeah,
it's gonna be very interesting to see how this plays out. Jeremy,
thanks so much.

Speaker 2 (41:28):
Thanks to one anytime,
Advertise With Us

Host

Elliott Stein

Elliott Stein

Popular Podcasts

On Purpose with Jay Shetty

On Purpose with Jay Shetty

I’m Jay Shetty host of On Purpose the worlds #1 Mental Health podcast and I’m so grateful you found us. I started this podcast 5 years ago to invite you into conversations and workshops that are designed to help make you happier, healthier and more healed. I believe that when you (yes you) feel seen, heard and understood you’re able to deal with relationship struggles, work challenges and life’s ups and downs with more ease and grace. I interview experts, celebrities, thought leaders and athletes so that we can grow our mindset, build better habits and uncover a side of them we’ve never seen before. New episodes every Monday and Friday. Your support means the world to me and I don’t take it for granted — click the follow button and leave a review to help us spread the love with On Purpose. I can’t wait for you to listen to your first or 500th episode!

The Breakfast Club

The Breakfast Club

The World's Most Dangerous Morning Show, The Breakfast Club, With DJ Envy And Charlamagne Tha God!

The Joe Rogan Experience

The Joe Rogan Experience

The official podcast of comedian Joe Rogan.

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.