Mastering spaCy: Build structured NLP solutions with custom components and models powered by spacy-llm

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
Welcome to the deep dive, where we slice through the
information clutter to bring you the clearest, most important insights. Today,
we're taking a bit of a shortcut to becoming well
informed about a really powerful tool in natural language processing NLP.

Speaker 2 (00:14):
It's called Spacey, that's right, and it's an interesting one.
If you think of those huge language models, you know,
like chat, GPT, maybe is a big powerful food processor. Okay,
then Spacey is more like your practical, really well optimized
kitchen knife. It's a library that's specifically designed to help
you get actual work done.

Speaker 1 (00:34):
So moving beyond just theory.

Speaker 2 (00:36):
Exactly beyond just academic concepts and do efficient practical application.
And we're going to uncover some surprising depth today. I
think from you know, basic text processing right up to
integrating with the latest AI stuff.

Speaker 1 (00:48):
Sounds good, And our mission for this deep dive is
basically to give you a comprehensive but still really accessible
understanding of what Spacey can do. We're drawing from quite
a few sources, including the excellent book Mastering Spacey. Okay,
let's untack this. So to kick us off, what's the

(01:08):
absolute core thing our listeners should get about Spacey.

Speaker 2 (01:12):
Well at its heart. Spacey is this incredibly fast open
source Python library, and it's really built for production ready
NLP applications.

Speaker 1 (01:22):
Production ready. That sounds important.

Speaker 2 (01:24):
It is, a lot of its speed comes from using
Python for the really performance critical bids, so it's highly
optimized but still easy to use within Python.

Speaker 1 (01:33):
Aha. So it's not just another like academic tool set.
It's built for real world stuff from the.

Speaker 2 (01:37):
Get go precisely. That's a key difference compared to maybe
something like NLTK, the Natural Language Toolkit, which historically at
least was often more focused on students researchers. Spacey, you're
hitting the ground running for deployment.

Speaker 1 (01:48):
You mentioned it's built to get work done? Is that
like the official philosophy pretty much?

Speaker 2 (01:53):
Inus Montani? What are the core? Creators? Often talks about this.
The goal is genuinely to help people do their work efficiently.
They're not trying to build some massive do everything system.
Oh okay, it's more about providing these sharp, reliable tools
like that knife, to fit nicely into whatever you're already doing.

Speaker 1 (02:10):
Got it and getting started? Is it complex?

Speaker 2 (02:15):
Not? Really? It works with modern Python runs on you know,
the usual operating systems, Windows, Mac, Linux.

Speaker 1 (02:21):
And best practice is probably virtual environments. Right.

Speaker 2 (02:24):
Keep things clean, Oh, absolutely, always a good idea for
any Python project. Keeps your dependency sorted.

Speaker 1 (02:29):
Now you mentioned something important. The language models aren't built
in correct.

Speaker 2 (02:33):
That's a key point. Spacey itself is the framework, the tools,
but for the statistical smarts, things like tagging parts of
speech or finding named entities, you need to download a
language model separately.

Speaker 1 (02:45):
Like Encore websism, that kind of thing exactly like encore
webdism for English.

Speaker 2 (02:49):
This quick command line thing Python dash M, Spacey download
oncre webism that downloads the small English model, gets you
the core pipeline components.

Speaker 1 (02:58):
Okay, and once you've got that, how do you sort
of of see what it's doing.

Speaker 2 (03:01):
Oh well, that's where displacy comes in. It's Spacey's built
in visualization tool, and it's fantastic. How So, it just
makes really complex linguistic concepts much easier to grasp visually.
You could see dependency parses how words connect, or see
named entities highlighted right in the text. It helps you
spot patterns almost.

Speaker 1 (03:22):
Instantly, so you can actually see the analysis.

Speaker 2 (03:24):
Yeah, you can try it online. There's a demo, or
you can run it locally from your code. Even in
Jupiter notebooks. It's super helpful for understanding what's going on
into the hood.

Speaker 1 (03:32):
Okay, so set up, done, model, downloaded, visualization. Ready, let's
talk about the core processing. You mentioned a pipeline.

Speaker 2 (03:39):
Yeah, I think of it like an NLP assembly line.
When you load a model, say using spacey dot load,
you get back this NLP object, right, And when you
feed text into that object like doc NLP, this is
some text. It runs the text through a sequence of
processing steps.

Speaker 1 (03:55):
The pipeline components exactly.

Speaker 2 (03:57):
The default pipeline usually include. It's a tokenizer, a tagger
for part of speech, a dependency parser for sentence structure,
and an entity recognizer or any R component. Each does
its specific.

Speaker 1 (04:10):
Job, and the output is this doc object.

Speaker 2 (04:13):
Right. The doc object holds the result. It's not just
the text, it's the text broken down into tokens, and
each token is enriched with all the linguistic features found
by the pipeline.

Speaker 1 (04:23):
Let's break down that pipeline. First up, Tokenization and sentence
segmentation sounds simple, just splitting words, Ah.

Speaker 2 (04:29):
Well, it's a bit more nuanced than just splitting on spaces.
Tokenization is breaking the text into its smallest meaningful parts,
the tokens, words, numbers, punctuation. They all become tokens. Okay,
But here's a surprising detail. Unlike most other pipeline components,
the default tokenizer doesn't rely on a statistical model.

Speaker 1 (04:47):
Oh which does it use?

Speaker 2 (04:49):
It uses really carefully crafted language specific rules, which makes
it very fast and predictable. And you can even customize it.
You can add special cases like telling it how to
handle slang or specific abbreviations.

Speaker 1 (04:59):
Let's teach it lemmey should be lemon me exactly.

Speaker 2 (05:02):
That kind of thing gives you fine grain control.

Speaker 1 (05:04):
And sentence segmentation, finding sentence boundaries that's.

Speaker 2 (05:08):
Actually often more complex than tokenization. Think about abbreviations like
misder or complex punctuation. Spacey has a unique approach here.
What's that It often uses the dependency parser, which understands
sentence structure to help figure out sentence boundaries really accurately.
It's quite a sophisticated design choice.

Speaker 1 (05:28):
Interesting. Okay, Next step lematization getting the root word yep.

Speaker 2 (05:32):
The lemma is the base or dictionary form. So like
you said, eating eats eat tape, they all boil down
to the lemma eat.

Speaker 1 (05:39):
How useful is that in practice?

Speaker 2 (05:41):
Oh, incredibly useful. Think about a chatbot for booking flights.
A user might say I want to fly, or show
me flights or I flew yesterday.

Speaker 1 (05:49):
Right, different forms of the same core idea exactly.

Speaker 2 (05:52):
Lemonization reduces fly flights flu all down to fly, so
your system only needs to look for that one base
form to understand the core intent. It simplifies things massively.

Speaker 1 (06:02):
Makes sense, and you could use it for other things too,
like place names.

Speaker 2 (06:05):
Definitely maybe sometimes angel Town when they mean Los Angeles.
You can actually add custom rules using something called an
a tribune ruler to map Angeltown to the canonical Los
Angeles lemma. During processing insures consistency.

Speaker 1 (06:18):
So Spacey processes the text, applies these steps and stores
the results you mentioned. Container objects, doc, token span.

Speaker 2 (06:26):
Right, these are your main ways of accessing the processed information.
The doc object represents the whole processed text. Okay, if
you loop over a doc like for token and doc,
you get individual token objects.

Speaker 1 (06:37):
And each token knows things about itself.

Speaker 2 (06:39):
Loads of things. A token object holds the original word,
it's lemma, it's part of speech tag, it's dependency relation.
It also has boolean flags like token dot is punk,
token dot is currency token dot like earl, token dot
latham wow.

Speaker 1 (06:52):
Okay, So you can check if a token looks like
a URL or a number easily yep.

Speaker 2 (06:56):
And it knows it's entity type if it's part of
one like token do type might be person or or worg.
It even has a token dot shave attribute that gives
you a kind of abstract representation of the words orthography,
like is it capitalized, is it all digits, et cetera.
Really useful for rule base matching.

Speaker 1 (07:13):
And span What does that fit in?

Speaker 2 (07:15):
A span? Is just a slice of the dock representing
multiple tokens. Sentences are span objects. You can get them
via doc dot sense. Named entities are also span objects,
accessible via doc dot NZ. So doc token span or
how you navigate and use the process.

Speaker 1 (07:31):
Text got it. Let's move into some of those linguistic
features part of speech tagging pos tagging. That's identifying nouns, verbs, adjectives.

Speaker 2 (07:39):
Exactly, categorizing words by their grammatical role in the sentence.

Speaker 1 (07:43):
And how does space you figure that out? Is it
just a dictionary look up?

Speaker 2 (07:46):
Oh no, it's much smarter than that. It looks at
the word in context. The surrounding words heavily influence the tag.
It uses sequential statistical models trained on large amounts of texts.

Speaker 1 (07:56):
So the same word could get different tags.

Speaker 2 (07:58):
Absolutely. Think of the word book, I read a book
noun versus I want to book a flight verb. The
context tells the tagger which role it's playing.

Speaker 1 (08:08):
And why is this useful beyond just grammar?

Speaker 2 (08:11):
Well, it's really important for understanding meaning, especially for word
sense disambiguation, figuring out which meaning of a word is intended.

Speaker 1 (08:18):
Can you give an example?

Speaker 2 (08:19):
Sure, take the word beat. It can mean many things,
But if the pos tagger confidently tags it as an
adjective adj, as in I'm totally beat, you know, it
almost certainly means exhausted. Ah.

Speaker 1 (08:31):
I see. The tag helps narrow down the meaning.

Speaker 2 (08:33):
Precisely, even if the verb or noun tags might still
be ambiguous. Beat the drum versus follow the beat. The
adjective tag is often quite specific. It adds a layer
of understanding, even if lamonization kind of flattens out things
like verb tense.

Speaker 1 (08:47):
Okay, that makes sense. Next up, dependency parsing. This sounds
a bit more complex. Mapping sentence relationships.

Speaker 2 (08:53):
It is complex but incredibly powerful. Dependency parsing represents the
grammatical structure of a sentence not just as a flat sequence,
but as a tree of relationships. It shows how words
depend on each.

Speaker 1 (09:04):
Other head and dependent exactly each.

Speaker 2 (09:06):
Word except usually the main verb. The root has a
head word it modifies or relates to, and a specific
dependency label describes that relationship, like N subject phenomenal subject,
or dubject for direct object. Why go to all this trouble, Well,
what's fascinating here is that sentences aren't just sequences of tokens.
They have this deep, inherent structure, and understanding that structure

(09:27):
is absolutely crucial for many real world NLP tasks, like
what think about chatbots or a machine translation? You need
to know who did what to whom. Consider I forwarded
you the email versus you forwarded me the email.

Speaker 1 (09:42):
Same words, totally different meaning exactly.

Speaker 2 (09:44):
Dependency parsing helps the system figure out that I is
the subject the one doing the forwarding in the first sentence,
and you as the subject in the second. It disambiguates
the roles based on the grammatical structure unsubject, DUBJIOJ relationships.
Without that, I understand user intent would be much much.

Speaker 1 (10:02):
Harder, right, That makes the importance clear. Okay, what about
named entity recognition any R spotting real world objects?

Speaker 2 (10:09):
Yep. A named entity is basically anything that can be
referred to with a proper name or a quantity. So
people's names, company names, locations, dates, monetary values, percentages.

Speaker 1 (10:21):
The categories seem pretty standard person or or GPE geopolitical entity.

Speaker 2 (10:27):
Those are common ones, yes, but the specific set of
entity types is actually quite flexible and often depends on
the data of the model was trained on or the
specific task you have in mind. How so, Well, if
you're analyzing financial news, entities like money and percentage might
be way more important and frequent than say, work of art.
The model needs to be tailored or chosen based on

(10:49):
the domain.

Speaker 1 (10:50):
And how good as any are these days.

Speaker 2 (10:51):
It's gotten incredibly good. The state of the art methods
often use those transformer architectures we mentioned earlier. They're very
effective at understanding context to identify entities accurately.

Speaker 1 (11:01):
Okay, And sometimes the default tokenization or entity spans might
not be quite right. Can you fix them?

Speaker 2 (11:08):
Yes? Absolutely. Spacey provides a really neat mechanism called doc
dot retokenize it lets you merge multiple tokens into one,
or split a single token into several.

Speaker 1 (11:16):
Why would you need to do that?

Speaker 2 (11:17):
Well, maybe an entity like New York City got split
into three tokens, but you want to treat it as
a single unit for analysis, you can merge them. Or
maybe a typo resulted in San Francisco being one token
and you want to split it.

Speaker 1 (11:29):
Ah okay, So for cleanup and normalization.

Speaker 2 (11:33):
Exactly, merging is usually simpler. Splitting can be a bit
more involved because Spacey then needs to figure out the
linguistic features and dependencies for the new tokens you've created.
But it's a very powerful tool for practical adjustments.

Speaker 1 (11:46):
Let's shift gear slightly to rule based matching. You mentioned
regular expressions can be tricky. What Spacey's alternative.

Speaker 2 (11:54):
Spacey offers the matriclass, and it's designed to be a well,
a much cleaner, more readable, and definitely more maintainable alternative
for finding patterns and text compared to rejects.

Speaker 1 (12:05):
Why is rejects problematic?

Speaker 2 (12:06):
Regular expressions can just become incredibly dense and hard to read,
especially for complex patterns. They're also easy to get subtly wrong,
which can lead to bugs that are hard to track down,
and they operate purely on.

Speaker 1 (12:17):
Strings, and the match is different how.

Speaker 2 (12:19):
The matcher works with token objects and their attributes. You
define patterns not as strings, but as lists of dictionaries,
where each dictionary specifies the attributes.

Speaker 1 (12:29):
A token must have like low to match the word
hello regardless.

Speaker 2 (12:32):
Of case precisely, or is punched true to match any
punctuation mark or liken them true for number. Like tokens,
you're matching based on linguistic features, not just character sequences.

Speaker 1 (12:44):
That sounds much more robust.

Speaker 2 (12:45):
It is, and you can use extended syntax too. You
can match based on token length length check off a
token is in a list I note or use boolean
flags like east digit I, sulfa I supper great for finding, say,
emphasized words in all cans.

Speaker 1 (13:00):
Does it have rejects like operators like optional parts.

Speaker 2 (13:03):
Yes, you can use operators like bunds to make a
token pattern optional. Think about matching names like Barack Obama
but also Barack Hussein Obama. The middle name token can
be marked as optional, and you have operators like plus
one or more and zero or more for specifying occurrences,
similar to rejects. There's even a really useful online demo

(13:25):
on the Spacey website where you can build and test
matcher patterns interactively.

Speaker 1 (13:29):
Okay, that covers matching specific patterns. What if you have
like a huge list of things to find, say thousands
of product names, right.

Speaker 2 (13:37):
Creating individual matcher patterns for thousands of specific phrases would
be well, not very efficient or practical.

Speaker 1 (13:44):
So what's the solution for that?

Speaker 2 (13:46):
Spacey provides the phrase matcher. It's optimized specifically for efficiently
scanning text against large lists of multi word phrases or dictionaries.

Speaker 1 (13:53):
How does that work?

Speaker 2 (13:54):
You give it a list of doc objects representing the
phrases you want to find, like Angela Merkele, Donald Trump,
Alexis ceparus. It then uses a really efficient algorithm to
find all occurrences of those exact phrases in your target text,
much faster than running thousands of individual rules.

Speaker 1 (14:10):
Very useful for terminology lists or gazetteers exactly.

Speaker 2 (14:14):
And it can even match based on token attributes, not
just the exact words. For instance, you could match based
on the shape attribute, which is handy for finding structured
data like IP addresses or specific code patterns and log files.
Even if the exact digits change.

Speaker 3 (14:28):
So you have the matcher for flexible patterns and phrase
matcher for large lists. How do you integrate these findings
back into the main spacey doc. That's where the span
ruler comes in. It's a pipeline component that lets you
use rules to find very similarly to matcher patterns, to
directly add span objects to your doc add themwhare to
doc dot sense. You can configure it to add them

(14:49):
to doc dot en, so effectively adding rule based named entities.
Or you can have it add them to a custom
span group like doc dot spans my custom patterns, so.

Speaker 1 (14:57):
You added to the pipeline like other components.

Speaker 2 (14:58):
YEP, NLP, dot X a pipe span ruler. Then you
provide it with your patterns. For example, you could define
a pattern to find every instance of the word chime
and label it as an OARG entity.

Speaker 1 (15:09):
What if the regular ner model also finds entities? Do
they clash?

Speaker 2 (15:14):
Good question. You can configure the span ruler. You can
tell it whether your rule based entities should overwrite entities
found by the statistical ner model, overrit true or not
overwrite falls. You can also set it up so that
statistical entities don't overwrite your rule based ones gives you

(15:35):
control over which source of entities takes precedence.

Speaker 1 (15:37):
Okay, this rule based stuff seems really practical. Can we
talk about some specific recipes like real world extraction examples?

Speaker 2 (15:45):
Absolutely, here's where it gets really interesting, showing Spacey's power.
So you can easily build patterns to extract things like ibands,
international bank account numbers, or phone numbers, these highly structured
numeric things.

Speaker 1 (15:57):
Okay, what else?

Speaker 2 (15:58):
Think about? Social media? Could create patterns to find mentions
expressing opinions, like matching the sequence business name plus iswaz
bay plus Maybe an adverb plus an adjective.

Speaker 1 (16:08):
Like finding cafe X was really great.

Speaker 2 (16:10):
Exactly that pattern structure cafex was a adverb adjective. Could
pick up cafe x is good, Cafe y was very slow,
restaurant z will be amazing. Helps you gauge sentiment clever.

Speaker 1 (16:20):
Other examples.

Speaker 2 (16:21):
Hashtags are easy. You can match the hashtag symbol followed
by tokens that meet certain criteria like IC or ICEULFA
to reliably pull out things like hashtag deep learning or
hashtag weekend fun.

Speaker 1 (16:33):
And what about slightly more complex entities?

Speaker 2 (16:36):
You can even use patterns to refine entities. For example,
maybe the ner just picks up Smith as a person.
You could use a match or pattern to look for
a preceding title like mister AM's doctor nump, and then
retokenize to merge the title and the name into a single,
more complete entity span Miss Smith.

Speaker 1 (16:54):
Wow. Okay, that's quite granular.

Speaker 2 (16:56):
Control, it really is. These rule based tools, combined with
the linguistic features, give you a lot of power for
precise information extraction.

Speaker 1 (17:04):
Let's push deeper now into understanding meaning and intent. How
does spacey help with semantic parsing figuring out what a
user actually wants a great.

Speaker 2 (17:11):
Way to explore this is with data sets like eighty
zis the airline travel information system. It contains thousands of
real user requests about.

Speaker 1 (17:19):
Flights like show me flights from Boston to Denver exactly?

Speaker 2 (17:22):
Or what's the cheapest flight? What meals are served on
flight x? Analyzing these requires understanding not just the words,
but the underlying goal.

Speaker 1 (17:33):
Where do you even start with something like that?

Speaker 2 (17:35):
Well, a really crucial first step, honestly, is just looking
at the data yourself. Read through a sample of the utterances,
get a feel for the common patterns. The types of
entities involved the grammar people use.

Speaker 1 (17:47):
What kind of things would you look for in the
eightiest data.

Speaker 2 (17:51):
You'd quickly notice people specifying origins and destinations. But it's
not enough just to spot Boston and Denver. You need
to capture the relationship from Boston to Denver. You'd see
the importance of prepositions like from to in Those little
words carry a lot of semantic.

Speaker 1 (18:06):
Weight, So you need more than just finding keywords.

Speaker 2 (18:09):
Definitely, you need to understand the relationships between the words.
And that's where Spacey's dependency matter.

Speaker 1 (18:15):
Comes in another matcher. How's this one different?

Speaker 2 (18:17):
Well, the matcher looks for sequences of tokens based on
their attributes. The dependency match looks for patterns based on
the syntactic dependency relationships between tokens.

Speaker 1 (18:26):
Ah. Using that dependency parstry we talked about earlier.

Speaker 2 (18:29):
Precisely, it lets you find patterns like a verb connected
to a noun with a direct object relationship dub J.
This is key for identifying intent.

Speaker 1 (18:40):
Can you give a quick linguistic primer on that objects?

Speaker 2 (18:43):
Sure? So? Very Basically, you have transitive verbs which need
an object to act upon, like I bought flowers flowers
is a direct object, and in transitive verbs which don't
like I slept okay. And sometimes there's an indirect object too,
like I gave him the book book is direct him
as direct. The dependency matcher lets you specify these relationships
in your patterns.

Speaker 1 (19:04):
How does that help find intent in the flight examples.

Speaker 2 (19:08):
Well, you could define a pattern looking for a verb
like show or find that has a direct object TOBJ
like flights. That pattern defined using dependency relations would match
show me flights, find flights, I need you to show flights, etc.
Capturing the core intent regardless of the exact phrasing.

Speaker 1 (19:25):
That seems much more robust than just keyword spotting.

Speaker 2 (19:28):
It is, and you can build more complex patterns. What
if someone says, show all flights and fares. The dependency
matcher can use the conjunct dependency link between flights and
fares to recognize that the user has two related intents
connected by and.

Speaker 1 (19:44):
Okay, that's powerful, But this raises a question. Once you've
used these matchers to figure out the intent, say book flight,
how do you store that information with the doc?

Speaker 2 (19:54):
Great question. You don't want that information just floating around
Spacey has a mechanism for this extension attributes exten attributes. Yeah,
you can define your own custom attributes on doc token
or span objects. So you could create an attribute called
say doc dot intent. The underscore indicates it's a custom
extension and how.

Speaker 1 (20:11):
Do you set that attribute?

Speaker 2 (20:13):
You typically create a custom spacey pipeline component use a
special decorator at language dot factory to define it. Inside
this component's call method, which processes the doc. You'd run
your matcher or dependency matcher, figure out the intent and
then set doc dot intent.

Speaker 1 (20:29):
Book flight so you can tailor the pipeline to extract
and store exactly what you need exactly.

Speaker 2 (20:34):
It makes spacing incredibly flexible and extensible for specific tasks.

Speaker 1 (20:38):
Now we touched on performance earlier. What about processing large
data sets like the full eight is corpus with thousands
of utterances. Doing them one by one sounds slow.

Speaker 2 (20:49):
It would be processing doc NLP text for each of
the four nine and seventy eight utterances individually would take
quite a while.

Speaker 1 (20:56):
So what's the efficient way?

Speaker 2 (20:57):
The key is the NLP dot pipe method or language
dot pipe. If you're using the base class.

Speaker 1 (21:03):
How does pipe help?

Speaker 2 (21:04):
It processes the text as a stream and crucially it
buffers them internally and processes them in batches. This allows
Spacey to leverage optimizations and parallel processing much more effectively.

Speaker 1 (21:15):
And the speed difference is noticeable.

Speaker 2 (21:17):
Oh, absolutely dramatic. The sources mentioned going from something like
twenty seven seconds for processing the eighties data set one
by one down to under six seconds using NLP dot pipe.
It's the standard way to process large volumes of text efficiently.

Speaker 1 (21:29):
Okay, essential for any real world application. Let's pivot now
to the really cutting edge stuff, transformers and large language
models LMS. The transformer architecture kind of kick things off right.
The attention is all you need paper.

Speaker 2 (21:43):
Yes, that twenty seventeen paper was a landmark. Transformers really
revolutionized NLP.

Speaker 1 (21:49):
What problem were they trying to solve?

Speaker 2 (21:50):
Well, Previous models like LSTMs process texts sequentially. This meant
they could struggle with long range dependencies for getting information
from the beginning of a law text, and they weren't
easily parallelizable, which limited training speed.

Speaker 1 (22:04):
And transformers fix this how with attention exactly.

Speaker 2 (22:09):
The core innovation is the self attention mechanism, often implemented
in a multi head attention block. Instead of just looking
at the immediately preceding words. Self attention allows the model
to weigh the importance of all words and the input
sequence when calculating the representation for a single word.

Speaker 1 (22:24):
So it looks at the whole context at once.

Speaker 2 (22:27):
Sort of yeah. It calculates a words embedding its representation
by taking a weighted average of the embeddings of all
other words in the sequence, where the weights the attention
scores indicate relevance. This lets it understand language much more
deeply in context.

Speaker 1 (22:44):
What was the big aha moment with this?

Speaker 2 (22:47):
A major one was that transformers could generate dynamic word vectors.
Older methods like word two VAC gave the same vector
for bank every time, but a transformer can understand a
context and give a different vector for bank and riverbank
versus bank in investment bank.

Speaker 1 (23:02):
That's a huge leap in understanding nuance.

Speaker 2 (23:04):
It really was. And libraries like Hugging Faces Transformers Library
now provide access to literally thousands of these pre trained
transformer models.

Speaker 1 (23:12):
How does Spacey integrate with these? Can you use transformers
within a Spacey pipeline?

Speaker 2 (23:16):
Yes? Absolutely. A great example is text classification. Let's say
you want to classify Amazon product reviews as positive or negative.
You can use Spacey's text categorizer component.

Speaker 1 (23:26):
Which is trainable.

Speaker 2 (23:27):
Right, it's a trainable pipeline component. You'd prepare your training
data the reviews labeled as positive or negative using Spacey's
example object, and then serialize it efficiently using doc ben.

Speaker 1 (23:38):
How do you manage the training process itself?

Speaker 2 (23:42):
Spacey has a really nice configuration system. Instead of hard
coding parameters, you define everything the pipeline components, model settings,
hyper parameters, data paths in a single configuration file configured
on CFG.

Speaker 1 (23:56):
Why is that better?

Speaker 2 (23:57):
It makes your experiment incredibly reproducible. Yeah, there are no
hidden de faults. Everything is explicit in the config file.
You can then train your pipeline directly from the command
line using spacey train.

Speaker 1 (24:07):
And can you include a transformer in that pipeline for
text classification?

Speaker 2 (24:12):
Yes, you can configure the pipeline to include a transformer component.
This component generates those context to wear embeddings we talked about,
which are then fit into the text categorizer. Often, adding
a transformer significantly boosts the accuracy of the classifier because
it has a richer understanding of the text's meaning and sentiment.

Speaker 1 (24:30):
Okay, so let's name some names. What about famous transformer
models like Bert and Roberta? What makes them special?

Speaker 2 (24:37):
Right? Bert bi directional encoder representations from transformers was a
huge step. Its key innovation was being bi directional during
pre training.

Speaker 1 (24:46):
Meaning it looked forwards and backwards in the text simultaneously.

Speaker 2 (24:49):
Yeah. Previous models were often unidirectional or combined separate left
to right and right to left models. Bert used a
technique called masked language modeling predicting hidden workds to learn
context from both directions at the same time. This gave
it a deeper understanding.

Speaker 1 (25:05):
And it produced those dynamic word vectors.

Speaker 2 (25:08):
Yes. It also used some special tokens like cls at
the beginning of sequences and stap to the separate sentences,
and it used word piece tokenization, breaking words into common
subword units like playing might become play and hashtag in.
This helps it handle large vocabularies and even words it
hasn't explicitly seen before.

Speaker 1 (25:25):
What about Roberta? How did that improve on? Burt?

Speaker 2 (25:28):
Roberta developed by Facebook AI basically took the Bert architecture
and optimized the training procedure. They used things like dynamic masking,
changing the masked words during training, trained on much more
data for longer, and removed one of Burd's training objectives
next sentence prediction finding. It didn't always help. These changes

(25:49):
generally led to better performance on downstream tasks compared to
the original Bert models.

Speaker 1 (25:53):
Okay, so transformers are powerful pretrain models. What about the
even bigger ones, The large language models are llms? How
do they fit in?

Speaker 2 (26:01):
Lms are essentially an evolution or maybe a scaling up
of those pre trained language models like bird. We're talking
models with vastly more parameters two three is one hundred
and seventy five billion, for instance, trained on absolutely enormous
amounts of text.

Speaker 1 (26:14):
And code, and they can do well almost anything text related.

Speaker 2 (26:17):
They're incredibly versatile. Yeah, translation, summarization, question answering, code generation,
creative writing. They've shown promise in specialized field like medicine,
law education too.

Speaker 1 (26:27):
But they're not perfect, right, What are the downsides?

Speaker 2 (26:30):
Definitely not perfect? There are key limitations. One is this
year computational cost training and even running them requires massive resources.
They can also be slower to generate responses compared to
smaller models, and crucially, they have this tendency to hallucinate.

Speaker 1 (26:46):
Hallucinate meaning they mix stuff up.

Speaker 2 (26:49):
Essentially, Yes, they can generate responses that sound perfectly plausible
and grammatically correct, but are factually incorrect or nonsensical. They
don't inherently know things. They're predicting probable sequences.

Speaker 1 (27:00):
Of words, so you need to be careful how you
use them.

Speaker 2 (27:02):
Very careful. A lot of work goes into prompt engineering,
carefully crafting the input prompt to guide the LLM towards
the desired accurate output.

Speaker 1 (27:11):
How does space help manage interactions with llms.

Speaker 2 (27:14):
There's a package called spacelm. It provides a structured way
to integrate llms into spacey workflows. It treats interactions with
an LM as defined tasks like summerization or entity.

Speaker 1 (27:25):
Extraction, and it uses prompts.

Speaker 2 (27:26):
Yes, it uses GINGA templates to define the prompts for
these tasks. You can use built in tasks, or you
can define your own custom tasks.

Speaker 1 (27:32):
Custom tasks like what.

Speaker 2 (27:34):
For example, the sources mentioned creating a custom task to
extract specific quotes from a text and the surrounding contact sentences.
You define the prompt template to ask the LLM for
this specific output, and you also define how to parse
the llm's potentially messy response back into a structured format
that Spacey can use.

Speaker 1 (27:54):
So spacelm provides a bridge and some structure for using
llms within a more controlled space environment.

Speaker 2 (28:00):
Exactly, it helps make using lllms more systematic and reproducible.

Speaker 1 (28:04):
Let's circle back to training your own models. We talked
about NR. When would you actually need to train a
custom ANYR model instead of using a pre trained one.

Speaker 2 (28:12):
That's a common question. The rule of thumb is, if
a pre trained Spacey model like Encore welding performs reasonably
well on your data, maybe gets say seventy five percent
accuracy or higher on the entities you care about, you
might not need full custom training.

Speaker 1 (28:27):
What would you do then?

Speaker 2 (28:28):
You could potentially find tune the existing model, or more often,
you'd use other Spacey components like the matcher or span ruler.
We discussed to add rules that catch the specific cases
the pre train model misses or gets wrong, kind of
like augmenting it.

Speaker 1 (28:42):
But when is custom training unavoidable?

Speaker 2 (28:45):
It's usually necessary when your domain has many important entity
types that are just completely absent from the pre trained models.
Think about highly specialized fields, specific financial instruments, unique biological
gene names, custom product codes very niche legal terms. If
the pre train model doesn't even know these categories exist,

(29:06):
rules alone won't cut it. You need to teach a
model from scratch or significantly fine tune one, and that.

Speaker 1 (29:12):
Involves getting data and labeling it exactly.

Speaker 2 (29:15):
Data collection is the first step. Then comes annotation, manually
labeling examples of your text with the entities, parts of speech, dependencies,
whatever your model needs to learn.

Speaker 1 (29:24):
Are there tools for that? Annotation sounds tedious?

Speaker 2 (29:27):
It can be, but there are great tools. Prodigy, also
from the makers of Spacey, is a very modern annotation
tool that often uses active learning to be more efficient.
It suggests labels you can firm or correct. They are
also open source options like Nertwig, which integrates with Jupiter Notebooks.

Speaker 1 (29:42):
Okay, so you annotate your data using one of these tools,
then what then.

Speaker 2 (29:45):
You convert that annotated data into Space's efficient binary format
doc ben. You typically split your data into training and
evaluation sets. Then you use Space's training system Spacey Trained
with a config file to train your custom and eer component,
and finally use spacey Evaluate to see how well your
trained model performs on the unseen valuation data.

Speaker 1 (30:07):
What's fascinating here is the possibility of combining models. Can
you use your custom trained ner model alongside one of
Spacey's pre trained ones.

Speaker 2 (30:18):
Yes, and that's often a very powerful approach. You get
the best of both worlds.

Speaker 1 (30:21):
How does that work technically?

Speaker 2 (30:22):
First, you'd package your custom trained pipeline component, maybe the
one that recognizes fashion brand entities, into an installable Python
package using the spacey package command.

Speaker 1 (30:31):
Okay, so it's like distributing your own mini model exactly.

Speaker 2 (30:35):
Then use another command, spacey assemble with a special configuration file.
This config file tells Spacey how to build a new
pipeline by sourcing components from different places.

Speaker 1 (30:44):
So you could say, take my custom fashion brand component
and also take the GPE location and money components from
the standard encore webs as a model.

Speaker 2 (30:51):
Precisely, Spacey assemble pulls these components together into a single,
unified a pipeline that can recognize entities from both your
custom training and the general purpose pre trained model. It's
a very neat way to create highly specialized, yet broadly
capable NLP systems.

Speaker 1 (31:10):
Very cool. Let's touch on entity linking. That's about connecting
mentions in text to actual entries in a knowledge base
right disambiguating Washington.

Speaker 2 (31:18):
Exactly is Washington referring to George Washington, the person, Washington, DC,
the city, or Washington state. Entity linking aims to resolve
that ambiguity by linking the mention to a unique identifier,
often in a knowledge base like Wikidata or a custom
company database.

Speaker 1 (31:33):
How does space handle this?

Speaker 2 (31:34):
Spacey has an entity linker component. It's architecture basically involved
three main parts. First, you need a knowledge base KB.

Speaker 1 (31:40):
What's in the KB?

Speaker 3 (31:41):
It stores information about the entities.

Speaker 2 (31:43):
You want to link to their unique IDs like Wikidata qids, names, descriptions,
and aliases. Spacey provides tools to create this. For example,
and in memory lookup KB, you'd add entries for say,
Taylor Swift, the singer, Taylor Lautner, the act Taylor Fritz,
the tennis player, each with your unique ID and maybe
a short description.

Speaker 1 (32:04):
What else is needed besides the KB?

Speaker 2 (32:07):
Second, you need a way to generate candidate entities from
the KB. For a given mention in the text. If
the text says tailor, the system needs to know that Swift, Latner,
and Fritz are all potential candidates. You also add aliases
with prior probabilities. Maybe Taylor Swift the alias has a
one hundred percent probability of linking to the singer's ID,
while just Taylor has an equal chance for all three

(32:28):
initially and the third part. The third part is a
machine learning model which is trained to look at the mention,
its context in the sentence, and the information about the
candidate entities from the KB and then predict the most
likely correct link or predict nil if none of the
candidates seem right.

Speaker 1 (32:44):
Does training this require special data?

Speaker 2 (32:46):
Yes. When you train the entity linker component, your training
data needs to clearly specify which mentions should link to
which kb IDs. You often need a custom Corpus reader
to handle this specific data format during training.

Speaker 1 (33:00):
Okay, we've built all these amazing models and pipelines. How
do we actually put them into the hands of users
or other systems. Let's talk deployment. Building apps in APIs right, moving.

Speaker 2 (33:10):
From the lab to the real world. Two popular Python
frameworks are great for this, with spacey for building interactive
web applications quickly, especially for not a front end expert.
Streamlet is fantastic.

Speaker 1 (33:21):
Streamlt How does it work?

Speaker 2 (33:23):
It lets you build web apps purely in Python. You
can create widgets like textboxes that textavia and buttons very easily.
There's even a specific package Spacey streamlet that provides ready
made components to visualize Space's analysis like anyr results directly
in your streamlet app, so you.

Speaker 1 (33:40):
Could build a quick demo tool for your Spacey pipeline exactly.

Speaker 2 (33:43):
And a key feature is streamlets caching at ffon cash
or at s don cash data. This prevents your Spacey
models from having to reload every single time a user
interacts with the app, which makes it much faster and
more responsive.

Speaker 1 (33:56):
What if you need something more robust, like a back
end API that other services can call.

Speaker 2 (34:01):
Then fastpi is an excellent choice. It's a modern, high
performance Python framework specifically designed for building APIs.

Speaker 1 (34:09):
What makes fast to PI good.

Speaker 2 (34:10):
It's known for its speed. It also leverages Python type
hints heavily. You define the expected data types for your
API inputs and outputs, which faster PI uses for automatic
data validation, gadging errors early, and also for automatically generating
interactive API documentation using Swagger UI.

Speaker 1 (34:27):
So it makes development faster and more reliable.

Speaker 2 (34:29):
Yes, significantly, you use pidanic models to define your data
structures and faster PI handles the validation and serialization. You
could easily build an API in point that takes some text,
runs it through your spacey ner pipeline, and returns the
extracted entities as structured Jason data. The autogenerated documentation makes
it super easy for others or yourself to understand and

(34:52):
test the API.

Speaker 1 (34:53):
Okay, building models, deploying apps, the whole process can get complex.
How do you manage the entire end to end workflow,
especially for reproducibility and collaboration.

Speaker 2 (35:03):
That's where workflow management tools come in. Spacey has a
companion tool called Weasel. Weasel Yeah, Weasel helps you structure
your entire NLP project. You define your workload steps like
downloading data, preprocessing, training, evaluating, along with any data assets
and custom commands, all within a configuration file project dot EML.
It makes your project reproducible and easier for others to

(35:24):
understand and run.

Speaker 1 (35:25):
What about managing the data itself and the models? They
can get large and change often for that.

Speaker 2 (35:31):
Data Version controlled DVC is an increasingly popular tool, especially
in the mL world. It works alongside geit. How does
DVC help it tackles several common problems. First, sharing large
data sets and models is hard with get alone, DVC
lets you version your data end models, storing them in
remote storage like s three year Google Cloud storage, while

(35:52):
keeping small metaphiles in GIT. This makes collaboration much easier.
What else, it helps make your data processing and model
training pipelines relyable and reproducible. You define the steps and
their dependencies, and DVC can tract everything. It also crucially
helps with tracking model metrics over time, so you can
see how performance changes as you modify data or code.

(36:12):
It really embraces giops principles for mlops, making your mL
workflows version, automated and continuously reconciled.

Speaker 1 (36:20):
So it brings better engineering practices to data science exactly.

Speaker 2 (36:23):
It helps manage the whole life cycle, and tools like
DVC studio even provide features like a model registry for
managing and sharing your train models effectively across a team.
Weasel and DVC together provide a really solid foundation for
managing serious NLP projects.

Speaker 1 (36:36):
Wow from the core concepts and pipeline through advanced analysis
like dependency parsing and NR rule based matching, transformers LMS,
training custom models, and finally, deployment and workflow management. That
was an incredibly thorough deep dive into the Spacey ecosystem.

Speaker 2 (36:55):
It really covers a lot of ground, doesn't it.

Speaker 1 (36:56):
It absolutely does. It really reinforces that idea of space
see being that practical, well optimized kitchen knife, precise, powerful
and adaptable for so many different NLP tasks.

Speaker 2 (37:08):
Yeah. And you know, if we connect this to the
bigger picture, really understanding tools like Spacey empowers you not
just to build things, but also to critically evaluate how
these language AI systems work, how they understand or misunderstand
the world.

Speaker 1 (37:20):
That's a great point.

Speaker 2 (37:21):
It kind of raises an important question for anyone listening.
I think, now that you have this deeper insight, how
will you use this knowledge, maybe to refine your own analysis,
or perhaps even to build something new and impactful.

Speaker 1 (37:31):
A fantastic thought to end on. We definitely encourage you,
our listeners, to keep exploring these fascinating topics and think
about how you might apply this knowledge, whether it's in
your work, your studies, or just your own curiosity about
language and AI.

All Episodes

Episode Transcript

Popular Podcasts

My Favorite Murder with Karen Kilgariff and Georgia Hardstark

24/7 News: The Latest

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Mastering spaCy: Build structured NLP solutions with custom components and models powered by spacy-llm

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}My Favorite Murder with Karen Kilgariff and Georgia Hardstark

24/7 News: The Latest

Dateline NBC

All Episodes

Mastering spaCy: Build structured NLP solutions with custom components and models powered by spacy-llm

My Favorite Murder with Karen Kilgariff and Georgia Hardstark