Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Unknown (00:13):
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great.
When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.
With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, Go
(00:35):
to
python
podcast.com/linode.
That's l I Go to python podcast.com/linode,
that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Nick Thoppen and Brendan McGinnis about Sourcery, an advanced refactoring engine that cleans up your code as you work. So, Nick, can you start by introducing yourself? I'm Nick. So I'm me and Brenda both sort of technical, so we both have background in software engineering.
(01:13):
We actually met in our first software job back in 2007 at the university.
So I had a background there and working first on
this language called IBM RPG, which is actually
green screen programming for IBM mainframes, then got into Delphi,
then got into Java doing some sort of more
enterprise web stuff.
And then I kind of left there in 2013 to do a master's in AI. I got interested in machine learning. And after that, worked at Imperial College in London
(01:41):
on quite an interesting project. It was, like,
doing
Twitter mining and then trying to analyze people's tweets to see if they were talking about symptoms of disease. It was, like, a kind of biosurveillance program,
kind of before all this started.
And that was sort of the main machine learning there was kind of turned between things like Bieber fever and actual fever. It's like trying to determine whether people are actually talking about Tim's illness or not.
(02:08):
And then that kind of
got canned.
There's been a lot more interest in it in the past year, certainly. But at that point, there was not so much interest.
And then went to work with Brendan on Sorcery. And Brendan, how about yourself? Thank you for having me on the show, and Nick as well. Yeah. So, like, it makes sense. My first
programming position, I met him back in 2007.
(02:30):
I did pretty much the same things,
RPG,
Delphi,
Java, and
I stuck around for a little bit longer and I introduced Scala
to the
company.
But a couple of years of programming, I got really, really into
code quality
and really obsessed with writing high quality code and in particular taking code that already exists and refactoring it to make it much better and easier to work with. So this all led into this Scala project where I convinced the management that we could rewrite a lot of the system using Scala
(03:04):
and start them from scratch
and spent about 3 years on that
and finally managed to deliver that with a team of people behind me, which was
great success,
especially on a personal level.
And then I was coming up to my 10 year anniversary and wanted out. Didn't want to reach 10 years, so I left. Ended up joining Nick for 6 months on his Twitter analytics
(03:28):
project
before
getting
really excited about
machine learning.
And that's when I got into Python and doing
deep learning.
And after a while, I was like, okay.
Let's apply
machine learning to refactoring.
And so Sorcery was born.
And going back to you, Nick, do you remember how you first got introduced to Python? During the masters, we did a bit for machine learning projects, and I thought it was really cool.
(03:54):
And then when it came to the Twitter analytics, it was kind of a natural language to use with, like, PyTorch, NLP stuff. So that was where I got my first kind of intro to it back in, I guess, it was 2014 ish.
And then coming to Sourcetree, we decided to do everything in Python and focus on Python. And, Brendan, do you remember how you first got introduced to Python? It was in the
(04:15):
projects in Imperial College London with Nick,
but I was only there for very shortly, so it was more of a dabble at that point. It was only at the end of that where
I took some time off just to do my own thing, but I started reading all of these research papers about using reinforcement learning to play Atari
(04:36):
and using supervised learning to
classify pictures
and text.
And so I spent ages just going through all of these research papers and reimplementing them in TensorFlow. And that was how I learned Python, really, just actually as a side product of
reimplementing these research papers to try and understand machine learning. And it was during that period that I realized, well actually Python's really nice, simple, easy language.
(05:04):
I really like it. All of the code's very clean. It's easy to read,
and the libraries are really well written.
And so
decided to use it for sorcery, ultimately.
Yeah. Now I have to go back to Java. I'm like
And so before we get too much into sorcery itself, the term refactoring has come up a few times, and there are a few different ways to think about it and some different framings for it. So for the purposes of this conversation and what you're doing at Sourcery, can you just give a bit of a 30, 000 foot view of refactoring and some of the types of refactoring that we're talking about? Refactoring
(05:40):
is restructuring and changing
the code without changing what it does. It's kind of the the basics of it.
So it might be
changing how the logic works. It might be moving things around to be in different classes. It might be
renaming variables and functions,
keeping all
the functionality exactly the same while improving the quality
(06:02):
is how we kinda think about it. So, you know, if you have a test beforehand, it should still pass afterwards.
And the user will have no idea that the code has been refactored. Everything will be exactly the same. But the benefit of it is that the code becomes easier to read. And
the side effect of that is it becomes easier to maintain that code base going forwards.
(06:24):
So it's easier to add new features because
everything is simpler. It's easier to track down bugs.
And the byproduct of both those things is
we can develop code more rapidly. It's always exciting to develop code fast.
So morale improves in a team. It's more fun to develop in a nice clean code base.
(06:45):
So refactoring actually leads to this, like, happy flow state of programming where things are a bit easier. Things are in the right place. There's no duplication of code across the code base.
Refactoring has been a practice in software for quite a while now. There are a number of different tools across
different
refactoring. I'm wondering if you can give a bit of an overview about what it is that you're building at Sourcery
(07:10):
and some of the motivation for building a new system for being able to perform these types of refactorings and sort of what was lacking in the ecosystem
that Sourcery brings to the table? We like to think of Sourcery as an automated pair programmer
that sits there reading the code that you're writing.
And as you're writing it, it understands it and suggests these refactorings to you in real time. So you've got your code open in your IDE. PyCharm or Versus Code are the 2 main ones that we support,
(07:40):
and you're working on file of code.
And it's understanding
those functions in your code. And for each of them,
if it finds an improvement, it'll just
offer a little highlight
that you can hover over and see a description of that refactoring,
an English description and also a code def.
And
(08:01):
you can apply that refactoring, and your code will be
updated in place. And then you can carry on working with your new improved code. So as well as PyCharm and Versus Code, we also
have language server protocol
implementation,
which allows
Sourcery to be used in Vim and Sublime.
(08:22):
And Sourcery can also be used as a code review tool. So we have an integration which for GitHub,
where it scans pull requests
and offers suggestions to improve those pull requests.
Or we have a command line interface
which allows
Sourcetree to be used as a pre commit hook
or in a standard CI pipeline
(08:45):
doing the code review before
a human needs to get involved,
making those simple
improvements to the code base automatically.
It's the contrast with what's out there. It's
so there's various class of things. So it's like, say, in PyCharm, you can it helps you do refactorings. You can kind of select a bit of code and say extract method on this. If I know what I wanna do, there are these IDs that can kind of help me that and streamline that process.
(09:12):
If you don't know what to do, it's not sort of suggesting those to you. So sorcery is kind of suggesting fixes to you. You don't have to go in there and be like, I wanna do some refactoring now. It's like, it's probably a natural flow. It's suggesting little improvements.
I see other class of tools out there is things like linting tools or
or formatting.
And formatting is just sort of improving the formatting. We do that as well.
(09:34):
Linting is often sort of picking out little errors, but not telling you how to fix them.
So either that becomes just this blizzard of things you ignore, or you kind of keep on top of it, but you still have to go manually go in and fix all these things.
So the idea for Sourcethere was, you know, it's automatic. Whenever we suggest
we notice something, we suggest an automatic fix for it. It's kind of speeding you up.
(09:55):
If I think back to when I was learning to program,
I didn't know the term refactoring. I didn't have the concept of
rewriting code to improve it initially.
Only after a year or 2 did I start to understand that this was a possibility.
And then I had to go out and learn how to do it. So I read
(10:16):
Kent Beck books, Martin Fowler books,
went out and manually learned how to do all these things.
And
I found it fascinating,
but it took a lot of time.
And with a tool like Sorcery,
it can not only give you these refactoring suggestions, but teach you how to improve your code as you're going as well.
(10:37):
So you don't need to know, oh, actually, I can do this because it will tell you. You mentioned things like linting and new things like extract a method.
And as you said, there are things like linters or there are tools like PyCharm or Rope that allow you to manually say, I want to do this
when you know that that's something that you want to do. So Sourcetry fits in this category of automated code review, automated assistance kind of tools. And there are a number of other projects that have come out recently
(11:07):
in the past few years to offer
similar kinds of approaches. So I'm thinking of things like Kite. I know that there are some other sort of automated review tools
and automated sort of code completion where they will
scan multiple open source repositories and use some of the code patterns that they find there to suggest structures
where they'll hook on different sorts of keywords or logical structures and say, you know, here's a snippet that you might want to use to be able to perform this action that you're trying to achieve. I'm Wondering if you can give a bit of a comparison of how sorcery fits within this broader ecosystem of services that people might use to act as
(11:47):
sort of automatic assistance while they're programming and some of the differences in terms of the goals and priorities of what you're building with Sourcery versus what are available with these other systems? There's a few code completion tools. Like you mentioned, there's Kite.
We're aware of another couple, the Python, Codota, and Tab 9.
And all of those
(12:07):
help as you're writing the code in the first place. So you're
writing a line of code, and they'll complete that line for you. So it's like you're in Gmail, you're writing a sentence, and Gmail suggests the rest of the sentence to you. These tools do exactly the same except for with source codes.
The key difference between Sourcery and
(12:28):
these code completion tools is they help you write the line of code that you're working on. Sourcery understands the code that you've already written
and rewrites it to improve the structure and improve the code quality.
So as opposed to being during the code writing is after you've written the code,
and it gives you the high quality code that lets you write
(12:51):
code faster in future or detect bugs
quicker in future. It improves the readability of the code that you've already written. The code completion tools actually don't care about the quality of the codes that you're writing.
They just care about what is the most likely
rest of the line of code that you're working on. Yeah. Because I think there's a study that shows you spend maybe 5% of your time actually typing when you're programming and, like, 70% of your time reading and trying to understand it.
(13:17):
So we're coming from the point of view of not trying to speed up your typing, but try and improve the overall readability of your code so you can go and make changes more quickly later. That readability is where we're really focusing. There are also classes of tools that, particularly in Python, but also in other languages
that serve to add things like constraints to what the function is trying to do. So, like, contract driven programming,
(13:41):
and then there's type inference and things like that. And I'm curious how those types of information are able to feed into the refactoring decisions that you're making with Sourcery? The 1 that we actually use within the Sourcery code base is Mypy, which
I personally really love.
I became quite fond of strong typing when I was doing Scala back in the day. I wanted to find something similar for Python, and MyPy is really, really exceptional.
(14:08):
So in terms of the code
out there, the majority of Python codes doesn't actually use
type hints and type information at the moment.
So we tend to make the assumption that most of the code that we analyze
won't have that information available to us.
So we try to analyze it expecting that not to be there. However, we do do some type inference within Sourcetree that helps us do certain types of
(14:35):
refactorings.
Yeah. So if your code has the type hints in it, we'll be able to ingest more things than if it doesn't, for example. Adding that typing to our code base internally has definitely caught loads of issues.
Another interesting aspect of what you're building at Sorcery is that you both admitted that Python is 1 of the languages you came to later in your career and having a much broader background in a number of other languages. And I'm curious why you chose to focus on Python as both the tool that you're using to implement Sourcery, but also the primary language that you're focusing on as far as the refactorings that you're providing and what your motivation is for investing in Python as a language and for building your business on it. So interestingly enough, it wasn't actually Python, which was the first target for Sourcetree.
(15:20):
So
it was always written at Python, Python, but the first target was actually Closure,
which is a lisp for the JVM.
I was quite keen on Closure before
coming to Sourcetree.
And the first version of Sourcetree was actually a pure deep learning
solution. So it took in closure code and it spat out closure code at the end,
(15:43):
and the goal was to
spit out improved
source codes that was in closure.
What I expected was it would actually be quite easy to do this and I was completely wrong. So the source code that it spat out it was very difficult to make sure that it was actually syntactically correct.
And then after a while, I managed to get that working, and then I realized actually
(16:05):
it might not be semantically correct. It might not actually run. It might not do anything sensible. And then the third thing with refactoring is it actually has to have exactly the same meaning as the source code that is initially passed in. And that turned out to be just completely
impossible.
So during this whole period of trying to do it with a pure machine learning approach, I was talking all the time to Nick,
(16:29):
and we came up with an entirely different approach that actually doesn't use machine learning at the moment. Because we're gonna be writing this all in Python,
we want to be able to refactor our own code and dogfoods the whole thing. And so that was how we ended up using Python and v 2 of Sourcetree.
And, actually, it's turned out to be a really excellent choice
(16:53):
because
of a number of reasons. Firstly,
Python
is a nice simple language to work with. There's less constructs to deal with than in other programming languages.
So that makes it easier to do all of our analysis and refactoring.
And then
it's a very, very popular language. It's I think it's now the 2nd most popular,
(17:15):
and it's the fastest growing language out there.
And
another thing that's really great is a lot of
people starting out on their programming experience
choose Python.
And probably that's where we give the most value
to people who are learning how to program.
They don't know about refactoring. They don't know how to write high quality code. And so when they use a tool like Sourcetree,
(17:39):
it really
levels them up more rapidly, turns them into a good programmer much more quickly. Taking a step back a little bit in terms of
your journey of starting Sourcery, what was it that convinced you that this was a viable business opportunity and that it was something that you wanted to invest your time and money and energy into pursuing?
The first genesis was, you know, the company we worked at was, like, the legacy code base back going stretching it started in the eighties, and had been going for 20, 25 years by the time we got there. It's constantly dealing with bugs. The issue list was enormous.
(18:12):
It was hard really hard to get anything done. It was just sort of not a great quality code base. Everyone's a bit sad about it.
So it was the first Genesys, you know, there has to be a better way of improving
code quality maybe than kind of doing it completely manually. It's just it won't take too long.
And then
it was kind of that wave of interest in deep learning and machine learning
starting in the 2010s. We got really into that. And then
(18:36):
as I'm actually in my masters, and Brendan got more into code quality, it was sort of this spark of idea. Maybe there's a way to turn this machine learning back on the process of writing code quality. Because while code's underlying everything, the process of writing code is still entirely manual. You know, it's still sort of this craft. So it'd be awesome if there were
really great tools to help you. On the business side,
(18:57):
it's only in the last year that we've really
started to try and think about
building a a user base and commercializing
the product.
And I was honestly
incredibly naive about the whole thing.
I expected it to be very easy. I thought it'd be like, okay. We'll build a tool
(19:19):
and then we'll have a business.
Yeah. Before we build our first prototype and then, yeah, it'd be magically
Yeah. It'd be a successful business overnight. And
how wrong we were at that time.
Yeah. It's been a real learning experience in the last year
of just how little we knew about building a business.
(19:41):
Yeah. Just as with any project you undertake, at at the surface level, you think, oh, this is simple. You know, you do this, this, and this, and then it's done. And then you actually start to dig into it, you know, whether it's business or programming, and then you realize, oh, there's a whole can of worms in here that I didn't realize was lurking under the surface.
Yeah. Absolutely.
So many wems. Yeah.
Yeah. I have a tendency to be just extremely optimistic about the outcome of things, which is probably the reason that sorcery exists in the first place.
(20:09):
If I had any understanding of how difficult it would be, then probably would never have started in the 1st place. So
it's a good thing, actually.
Digging a bit more into the refactorings
themselves, you mentioned a couple of things like extracting a method or renaming variables. But I'm wondering if you can just spend a bit more time talking about the types of refactoring that you're able to automate and some of the
(20:34):
structures and inputs that you use for identifying opportunities for those refactorings?
The first thing to say is that Sorcery is currently limited
to reading an individual functional method,
analyzing that, and suggesting
refactorings within that. So the scope is very narrow at the moment.
(20:54):
In the future, we're gonna scale up to understanding
classes and modules and refactoring,
but everything that we suggest at the moment is
at the method level. The code converted to an abstract syntax tree
using kind of some stuff built on top of Python's SE module.
And then
we've
got a load of patterns we're looking for, basically small atomic changes we can make to the code, each of which will not
(21:19):
change the
meaning. So things like moving a statement or combining 2 if statements,
and dropping
an else for just a pass in it, or changing for loop into a list comprehension.
So we've downloaded stack analysis on that syntax tree to see
which lines of code depend on each other, and so we know kind of which changes are safe to make.
(21:40):
And so we chain together individual little safe changes to make a bigger change that we can suggest. This kind of guided by a load of code
metrics, so there's lots of code metrics out there. There's
psychometric complexity, cognitive complexity,
number of lines,
and, sort of elements of duplication.
So we kind of have an idea of how good the code is according to those metrics when you start, then we try to change what because these little changes to kinda get to a better place, try to put a nicer bit of code based on the structure.
(22:12):
First big example was so is this refactoring exercise called the gilded rose cutter,
which is like this horrible nested
spaghetti code of conditionals.
And our very first prototype with sorcery was trying to
sort of straighten that out by
making these small changes, because we did it manually first. We kind
(22:33):
of manually refactored it, looked at the tiniest changes we could that would be refactorings, and tried to incorporate those.
Some of the main things we do are
untangling complex conditionals,
suggesting comprehensions,
suggesting using built in functions like any, all, enumerate,
min, max.
We find duplicate code across conditionals.
(22:56):
So if there's the same
line of code and then both sides of an if and else statement, we would kind of hoist it out
that if else statement.
We can move code closer to,
say if variables are declared quite far away from where they're used, then we can move them closer to the place that they're used.
(23:16):
And
most recently,
Nick's been working on method extraction. So if there's the same
few lines of code happening in a couple of places within
another method, then
we'll suggest extracting that out into a new method.
That's really the direction that we're going. So if we can identify duplicate code within
(23:39):
a single method, we can start to do it at higher levels of abstraction. So within a class, and then we can do extract method within a class.
And
that's really where we're going. The class level refactorings are where we're
very excited.
I guess the reason that I'm really excited about the class level stuff is from a personal level,
(24:00):
when I'm writing codes,
I tend to write
codes that
is already well written from the point of view of sorcery,
partly because Nick and I have implemented the whole thing. We understand how to write exactly the sort of code that Sourcery likes.
So on a personal level, we don't get that many suggestions.
But when I'm writing at a class level, it's always more to take into account. Is there duplicate code?
(24:24):
This can be extracted out. Sometimes it's not easy to identify that as you're writing the code,
or
would this code be better pulled out of this method into another 1 or moved into a different place?
So
as we go higher up the scope of what Sourcetree can do,
we start to,
I think, appeal to more advanced
(24:45):
Python developers.
So a lot of the people who really like Sourcetree
are those
junior to intermediate developers who are still learning about code quality and how to write high quality code.
Obviously, Nick and I consider ourselves pretty good at Python by now, so we don't get as many of those suggestions. And so we find that with other
(25:07):
advanced Python developers, they don't get so many suggestions.
But if we can start suggesting refactoring to a class level, I think it'd be really powerful.
So as somebody who has been using Python for a while, I like to consider myself as advanced as well, but, you know, there's always a little bit of hubris involved there.
Yeah.
You were mentioning that you're looking largely at sort of the function level.
(25:29):
And I'm curious how you handle things like very imperative code where everything is just procedural
within a module. There aren't any functions available.
What types of refactorings or suggestions you're able to make in that kind of context? Yeah. So I guess if there's no functions, we kind of almost analyze it as if the entire thing was 1 function. You know? It's just 1 big script.
(25:51):
It's definitely more difficult to analyze. You can still sort of see, oh, there's a for loop I can change to a comprehension or whatever.
But, say, not as easy as when the code is already nicely
split into functions for you. The sort of things that we can
do around there
are around
moving code closer to where it's used.
And ultimately, as you start to do that, then the code starts to get grouped together.
(26:15):
And
another goal that we have in the medium term is
make code self similar.
And what I mean by that is you want the same pattern appearing
across the function that you're writing. So
if you've got 3 lines of code and then a similar 3 lines of code later on, you want them ordered in the same ordering so that as you read it,
(26:37):
it's easier to understand each time.
So as we move towards that level of refactoring,
then once you've got that, then you can start doing things like, okay. This code is similar here and here. Well, then we can actually do an extract method again.
So
the whole thesis of the way that we've designed Sourcetree is
(26:58):
we're building these very small
library of very small refactorings
that compose together
to build much larger refactorings.
So even though a lot of the refactorings we talk about are quite small, when you combine them together, you can get really powerful things.
And really, if you think about how to do refactoring
as a developer, that's the way you do it. You don't just go, like, wholesale rewrite it from scratch.
(27:23):
You make a small change, check the test to still pass and make another small change, keep going in that routine.
So we're taking that intuition and trying to do it programmatically.
We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV file via email?
(27:48):
Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all the other details that eat up your time?
Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, and many more. Go to python podcast.com/census
(28:10):
today to get a free 14 day trial and make your life a lot easier.
Going back to the types of refactorings
that I'd be interested in, just looking at some of the code that I'm working on recently, I'm thinking about, k. My lender is complaining that I've got too many imports from this module, so I want to just rename all of the uses of this class to be namespaced. So
(28:35):
rather than saying,
you know, bar, I wanna say foo dot bar and change from my from foo import bar to just import foo and be able to do that all throughout the entire module rather than having to go through and do that manually because it's tedious and time consuming.
Or
I've decided that I don't like the naming that I've used for a particular module space, so I want to rename the directory and automatically have all of the imports
(29:00):
changed to match across by project? I'm curious sort of what the challenges or complexities are in terms of your ability to be able to
create those types of changes within a code base? So tackling
the second 1 first,
we would find it very hard to suggest something like that because
(29:20):
naming is not something that we really want to
take control of. It's unlikely that we're going to look at your code base and say,
actually, I think this module should be named something else.
That's something that's very hard for
a machine to do better than a human, I think.
In terms of the first 1, if you've got too many
(29:40):
imports from the same module,
we could easily do something like that. Yeah. We haven't up to now because we've been focused on the functions, but, yeah, it's like a that is part of the file that we're information that we're reading, and we could suggest things like that. Yeah. The challenge with all of these things is
suggesting things that people want
to make to their code base
most of the time.
(30:03):
So we don't want to be suggesting things that people are like, okay.
I can see that, but, actually, I disagree with it. And they're rejecting the suggestion, like, 50% of the time. That's no good. We we wanna at all, where 95%
of the time, whatever we suggest to you, you look at it and go, that's definitely an improvement to my code base. I'm gonna accept it.
(30:24):
So in the example of the imports, it's like,
how many imports is the point where you go, right,
10 is too many. I'm gonna start suggesting that
I namespace all of these.
Okay. Is it 7? Is it 8?
So, and, obviously, we just have to make a call on many of these things. And there's constants throughout our code base where it's like, okay, if there's over
(30:48):
3 lines of code, then we'll
suggest to extract a method. But, yeah, it's very much our intuition of what good quality code is,
trying to match that up with what we believe other people think is good quality code as well,
or most of the time believe it's good quality code. There are a couple of directions I wanna go here. So first, another type of refactoring
(31:09):
that I'd be interested in understanding sort of the scope of the problem and how you might try to address it is
as you expand to analyze
more of the project. So beyond just the individual file level, start to look at the module level or at the project level,
being able to suggest, okay, I see that you're using this pattern in 3 different files. How about we extract that out to a helper module that we put in the lib directory?
(31:36):
Or you're passing a lot of data around into functions, maybe you should convert this into a class and make this an attribute on the class object.
Curious sort of
what is involved in being able to perform those types of suggestions or if that's something that you feel is in scope for what you're trying to build with sorcery. So the first one's definitely in scope because we have
a partial solution to it.
(31:58):
So if you're in PyCharm or Versus Codes, you can right click
directory
and say scan for duplicates.
And we'll look for
any code like that, 3 or more lines of code that is very similar. And then you'll be able to see, okay, in file a, b, and c, I've got this duplicate code.
(32:19):
But at that point, we won't suggest
actually
the extracts
function or extract method.
And the challenge there is where should that go? Where does that code go?
And
I think that's an extremely difficult problem.
Even as a professional programmer, I spend a lot of time thinking where is the best place to put this code? Should it go in a utility class?
(32:42):
Should it be part of this
class that I'm working on? Do I need an inheritance tree?
There's various different possible options that you could choose.
And so at the moment, we're just choosing to
identify that problem for you. You can scan for that duplicate code and then you take the next step of
deciding where it goes.
(33:03):
So 1 additional thing that we do need to do is
suggest to the user, actually, this is what the extracted
function looks like. You choose where it goes.
This is what you need to replace
the codes that you're extracting with this function call with these parameters.
Yeah. It's interesting.
We hadn't really thought about, like, integrating with other linters, but there are
(33:27):
problems around that. So
for example, there's this printing error that comes up if you have a return and then an else, because you could drop the else.
And so we don't particularly agree with that in our code base, so we've disabled that linting error. But then it turned out that 1 of our refactorings can lead to a situation which this new linting error will be triggered once you accept the refactoring.
(33:48):
And we really don't wanna be introducing
linting errors into someone's code base. So we don't we wanna be
basically taking what's broadly accepted as linting errors and making sure we don't introduce them, making sure we fix them,
and also making sure we have configurations so that, you know, it can play nicely with what your linting setup is.
But we hadn't sort of considered, I'm not sure we would, taking in the linting errors as kind of an input to Sourc3. I Think this is a good opportunity to discuss a bit more about how sorcery itself is implemented
(34:17):
and just some of the
design considerations
that you take in and maybe
who you
use as sort of your
apocryphal user to understand
how to design the interface, how to design the interactions
of the overall product?
Yeah. So the way we implemented it was
inside out, really.
(34:39):
We didn't care about the user interface at all in the beginning. It was all about can we take some source code and refactor it. So that was the goal all along.
And
as Nick mentioned earlier, we
take
the code and turn it into an abstract syntax tree.
And
we keep a whole load of extra information on there around code formatting.
(35:04):
Because when we output the code at the end, we want it to have the same formatting
that
came in at the beginning.
And then it goes through a whole range of analysis.
So it's things like
which
variables
does this statement of code depend on from previous in the function.
(35:24):
And so things like what's the control flow, what's the next statement for each existing statements,
How many substatements are within each statement?
And I guess a lot of the analysis that compilers do is stuff that we've kind of brought in. They sort of build a dependency graph as Brendan was saying. So once the analysis
has happened,
we then
do the refactoring matching. So we look for specific patterns in the code.
(35:49):
And
if we find those patterns, we then use the analysis to see if we can perform the refactoring.
So you may have a pattern in the code,
but there's a call that changes some state elsewhere in the program. That means that you can't do that refactoring pattern.
So that's where the analysis comes in useful,
(36:10):
ruling out code changes that wouldn't be refactorings.
So
so we build up a list of possible refactorings
at each step,
and
these go into a search engine
that searches for the best possible overall refactoring.
So
each step has a list of refactorings that can be done. The search engine chooses between them based on the code quality that will be output.
(36:35):
Then the new code
is run through the whole process again. Step by step, it looks for more refactorings that can improve the code until it comes up with a final output, which gives the best possible code quality score. And then that is suggested as the final refactoring to the user. I guess in terms of user interface, we kind of went with what the IDs naturally do. They have this concept of diagnostics.
(37:00):
If you take LSB, have this concept of diagnostics where it finds a problem and underlines it, and there's this concept of quick fixes.
So that's how we've implemented it. So
if we find a suggestion, it kind of underlines it, and then we provide a quick fix. And it's same in Versus Code and PyCharm. It's very similar, kind of kind of fit in with
what other tools already kinda give you. So it's kind of as expected to a developer. And digging a bit more into the editor integration,
(37:26):
I'm interested in understanding
sort of how much variation there is in terms of capabilities
with the editors that you're targeting
and how you manage sort of
maintaining
sort of feature parity across the different environments and experiences
and just some of the
challenges or complexities that are involved in trying to work within these editors.
(37:49):
And you mentioned that you have a language server protocol implementation and just what your views are on sort of the overall benefits of of that development to the overall development ecosystem.
So we started with a PyCharm, which had its own sort of API, and you write the plugin in Java or Kotlin. So that plugin's kind of a bit thicker. It has to have more in it. And then we wrote a
(38:10):
LSP, inflation of the Versus code.
And, ideally, we'd like to have everything in LSP because it's sort of very nice. So it means you can we've actually been able to roll out a bit of support We've got FIM and supply them at the moment. Because of just plugging in the LSP.
So the moment, we kind of think of LSP as our main
ideal implementation, and then we kind of have to write the extra bit in PyCharm to maintain parity, as you were saying. And, usually, that's fairly possible. Like, it's quite a rich
(38:37):
API
in PyCharm, but it's a bit of a hassle.
But, unfortunately, there isn't really a
good way of using the LSP implementation with PyCharm at the moment, it seems. There are
a couple of things that live outside
the LSP
spec as well. So
in both Python and Versus Codes, if you want to scan a whole folder for refactorings or scan for these duplicate code blocks, then that's outside of LSP. You can't do it. So we had to manually put that in on top of the LSP implementation.
(39:11):
Another thing that's probably gonna be quite painful for us going forward is that
LSP doesn't allow you to ask the user for input.
So
say
a good example of this is with our extract methods.
When we extract the methods, we just have to generate a name for it. Best thing to do would be at the point where the user says, yes, I want to do this, then immediately pop up the box what do you want to call it and then apply that. There's no way of doing that in the LSP protocol at the moment.
(39:43):
Obviously, the great thing about LSP is we've got a free sublime and thin implementation.
Sadly, not all of the
LSP clients
are
as good as
the Versus Code 1.
So for instance, Jupyter Notebook has an LSP implementation,
but it doesn't support code actions, which is what we use to actually
(40:07):
apply the changes to your code.
So
I got the Jupyter Notebook stuff almost working.
You could see the suggested code change, you could hover over it,
But then when you actually went to apply it, there was no way of doing it. You couldn't change the codes. So we don't actually have a Jupyter Notebook implementation
because of that,
(40:28):
which is a shame. I've been excited to see the development of LSP because as an emacs user, anytime things like Sourcery or Kite or any of these other new services come about, they say, hey. We support your editor. Accept that 1 because it's very old and weird and hard to figure out.
But Emax actually has really good 2 different LSP clients. So I haven't done it yet, but I'm interested in trying out Sourcery and trying to integrate it with Emax using the LSP protocol. So I was excited when I saw that as an option. Yeah. It's probably really easy. So the reason that we've got FIM is because I use FIM. So I obviously had to find an LSP implementation,
(41:05):
and it was literally
putting 10 lines of configuration
into a file and then Sourceries started working.
So you can probably just go and look at the BIM implementation
in our docs,
copy most of that configuration, and put it in some emacs file somewhere, and you'll be good to go.
Going back a bit to the core implementation of Sourcery, I'm interested too in
(41:28):
some of the
specific patterns
or libraries or tools that you found useful in building it and some of the ways that the overall system has
changed or evolved in terms of the scope and goals that you have for the project?
Yeah. So I guess to start with the last 1, it was has been this evolution, I guess, the first iteration was this ML solution for closure,
(41:49):
and version 2 was
Python working on Python.
We actually initially had it as like a cloud service,
so it'd send your code to
our cloud, and we'd analyze it and send it back.
We got very, very quick feedback on it that, no, we can't do that. We can't have our codes leave. We can't possibly use it for lots of people, which is absolutely fair.
So we had to
(42:11):
completely change it and run everything locally on the user's machine, which is how we currently do it. So the binary runs a new machine, and it just talks to that.
That was the biggest change
in kind of scope as we went on. Obviously, adding new editors and adding a GitHub integration was also part of it.
Don't know if you wanna talk about patterns and libraries, Brendan. In terms of how we initially implemented it, we used
(42:33):
so we talked about using an AST. So we used the asteroid library
for
the core
of Sourcetree, and
that's
the library that's used
inside Pylint.
So it's very well used,
and it's got loads of functionality.
And we made use of it for probably a year, year and a half. I guess the reason we're gonna use rope as well, won't we? We did have use of rope as well. Both of them turned out to be limited in different ways. So the key problem with
(43:05):
Asteroid was that it didn't record any of the formatting of the code.
So
when we output codes, then it would look different to the way it came in. And, additionally, it didn't record any of the comments in the code either. So all refactorings, all of the comments in your code would just suddenly disappear, which is obviously not a good user interface at all.
(43:28):
So
I ended up writing a new AST library from scratch with those specific requirements in mind.
And to be honest, it was only from using Asteroid that we can get sorcery off the ground. And then you learn, oh, there's all of these additional requirements that we have.
In terms of rope,
we found that
(43:48):
some of the refactorings
that we wanted to reuse for it were not actually correct.
So it would misunderstand
certain things around keyword arguments and things like this that
led it to breaking the code base. And really, that's 1 of our number 1 requirements for Sourcetree
more than people accepting the refactoring is that it's actually a refactoring.
(44:11):
The suggestion that we make does not change the functionality of your code
because we want people to be able to trust it. When you first start using Sourcetree, you're obviously gonna review every code change and understand what the suggestion is. But ultimately, you wanna be able to say, I know that Sourcetree is correct. I trust it. I'm going to accept this refactoring.
(44:31):
So we had to throw out rope. And
so even though we didn't go with the not invented here mentality,
almost everything now is invented here.
In terms of the actual
design
of the refactoring engine, as we call it,
it's very much
(44:52):
built on
the intuition that I gained from working in Scala.
So it's very, very functional.
There's almost no state throughout the whole thing.
It's very, very easy to test.
Everything is nicely separated. We have a whole module that's full of analysis. We have a whole module that's full of refactorings. We have a whole module that's full
(45:14):
of understanding the codes,
turning it into an AST, and printing it out again.
Everything's very, very nice and clean.
And in terms
of the refactorings themselves,
we use dependency injection
to get the analysis that you want. So for refactoring, we just have
this is a a refactoring
(45:35):
proposer.
And then below that, you just list out
as class level
variables. I want these 3 pieces of analysis,
and then they'll be available for you when you run
run that refactoring,
which makes it very, very easy
to write new refactorings. You don't have to think about
imperatively
(45:56):
getting this piece of analysis from somewhere else. Or what's the ordering that I do these analysis is?
We build a tree of all the analysis that needs to be run, and we run it in the optimal order and then inject it into
the refactoring proposals, which then output the proposals that are fed into the search engine.
That was probably extremely complicated and needs a diagram. It's difficult to explain these things in words. Yeah. The the joys of doing a podcast about software.
(46:25):
Yeah. Exactly.
And so
I'm definitely interested
in sort of what the opportunity is for being able to integrate some of the rest of the ecosystem of sort of code quality and developer tools in ways that that can
feed into Sourcery to inform or augment the types of refactorings
(46:47):
that you're offering or sort of ways to offer customizations
to people to be able to
specify their preferred code styles. So, you know, I prefer methods named using these patterns, or I prefer to
refactor
using this particular
structure of conditionals or whatever that might be, and sort of how that factors into your
(47:10):
opinionated approach of these are the refactorings that everybody's going to want versus, you know, these are refactorings that are useful, but not everybody wants to have and just being able to set up those toggles of, I want these types of suggestions. I don't want these types of suggestions and maybe use these linting or type inference tools to be able to trigger or inform the types of refactorings that I want to perform.
(47:33):
At the moment, we have gone with this kind of opinionated approach. We have talked about could we integrate black, could we integrate other formatting and, like,
get a preferred formatting?
A lot of it is, like, the interfaces for these things are not super clear, like, it's
like, we briefly looked into, could we integrate MyPy
to use the power of their
type inference, and we thought that seemed too difficult at the time.
(47:54):
So maybe that is something we will explore in future.
We have, at the moment, got the ability to switch through factorings on and off in a kind of a configuration file. So, yeah, I think that's probably something we need to make more powerful in future. Perhaps when we come to look at things like more of a team offering, and so maybe a team can set up how they want their code base to be structured in a way. Yeah. I think that, as you mentioned, the interfaces
(48:17):
to all these different tools and the ways that they approach their analysis
is still too
bespoke across them. So there are some sort of common patterns, but there's no 1 interface to say, I want you to be able to feed me all of the information that you're feeding to the end user.
And, you know, maybe that's an opportunity for something like the LSP to act as the focal point that everybody can coalesce on. This is the interface. This is how we all interoperate
(48:41):
together. Yeah. If there was something like that, that'd be awesome. Yeah. Definitely. Seems like as an industry, there's a lot of
movement happening where interfaces and APIs and patterns are coalescing in sort of different subcommunities, so things like NumPy being standardized as an API for other libraries to use,
you know, data streaming
ecosystem. They're standardizing around sort of the open streaming format,
(49:05):
you know, in the data lineage ecosystem.
There's the open lineage spec that's happening. And so a lot of these different tools and communities are saying, okay. You know, we have all these different ways of doing things, but we don't have any way for being able to interoperate without a lot of manual coding and intervention. So I'd be interested to see
what happens in the code quality and developer tooling space to see if there's an opportunity to coalesce on. This is the way that we all interoperate.
(49:31):
You can all do your own things, but this is the API so that we can all be able to build on top of each other rather than sort of working in our own little corners. That would be really fantastic if you could just
go to Mypy,
pass in a node in the abstraction sax tree and say, tell me the type information
about this.
Or another tool could
(49:51):
hook into Sourcetree and tell me and ask,
tell me what other nodes in the abstract syntax tree this depends on. It would be really fantastic.
Yeah. I can see how that would be excellent. It's just
At the moment, it's sort of like pausing the output file and, like yeah.
There there's a long road to that point, but perhaps we can get there as an industry.
(50:12):
Yeah. I mean, part of the issue is everyone's got their own abstracts in touch screen. And for this particular example,
in a lot of cases, it's very much tightly coupled all the way from
the user
interface all down to the deep workings. And so actually splitting that up is a choice that you have to make.
In our case, it's relatively loosely coupled,
(50:35):
but only because
I had to rip out
asteroids and put in our own ones.
When you've gotta do that, you've gotta simplify the interfaces all across the board. Yeah. It definitely seems like there's an opportunity for a similar movement to what happened with the Sands. Io approach to network protocols, maybe being able to
(50:56):
sort of abstract out the AST parsing and then adding a way for being able to hook in and add your own additional metadata that you can pass around and just sort of removing
the sort of AST protocol implementation from the business logic that you actually care about.
Yeah. That would be brilliant.
(51:16):
Well, in terms of your overall experience of building sorcery
and creating this technology and turning it into a business
and trying to gain users and grow adoption, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process? I think from an implementation point of view, the code formatting has
(51:36):
been remarkably hard. It's not something that I ever expected to be an issue at all. I expected
just to be able
to pass in code, refactor it, and pass it out. But as soon as you start to think about it, people are not gonna want that. If I suggest a refactoring to your code base that you then have to go and
manually reformat to match the rest of the code around it, that's just not gonna work.
(52:00):
And it's
very, very hard to solve that problem. There's
some other libraries out there that have come out recently that try and do a similar sort of thing,
and there's always new bugs that come up with it. Like, a very, very simple example of it is
do I indent with 2 spaces or do I indent with 4 spaces? That's, like, the absolute basics,
(52:23):
but you have to get that right. You have to implement the same.
Gets 1 of our big sort of pivots we had to do was we thought, oh, we'll do it in the cloud. That'll be easy. That makes sense.
Then people, like, really concerned about code privacy.
So we had
to pivot there, do everything on low user's machines. And I guess an interesting thing is, like, can work we've had to do to make sure
(52:46):
that our refactoring star are actual refactorings.
It's sort of more than half of our tests are, like, don't do the refactoring in this case, in this case, in this case, in all these edge cases.
And we also actually
run source through overload of open source libraries and check their tests or pass afterwards. That's kind of our
backstop, which works pretty well. Our analysis section of our code base has sort of had to grow and grow and grow and grow and grow. 1 thing that we haven't really talked about yet is sort of what's involved in somebody actually
(53:13):
getting started with sorcery,
you know, setting it up, and then also
what is involved in using it in a team format and some of the benefits that it might provide in that context?
Yeah. So you can just go to our website
and you can sign up.
Then you go to your editor of choice, and the 2 easiest ones are PyCharm and Versus Code, and you go into their marketplace,
(53:35):
search for Sourcetree
and install
the plug in. And then you go back to the website,
copy
a token, and paste it into
the Sourcetree configuration screen in your plugin,
and then it's going. It'll start analyzing your code immediately and suggesting refactorings to you. And actually, just before that step, it shows a little
(53:57):
demonstration
install file that teaches you how Sourcery actually works.
In a team environment
and also if you're just using GitHub on your own, you can install the GitHub bot and
that's even easier. You just
find Sourcetree in the GitHub marketplace,
you click install,
(54:18):
and then you choose which of your repos you want to add it to. And you just add it to whichever Python repos that you like. And then you do a pull request, and Sourcetree will analyze it for you
and give you the feedback on that. Flash if you just star our repo, it'll still find your most popular Python repo and refactor the entirety
if you just wanna give it a super, super quick test in about 10 seconds. The other option for Teams is if you don't have GitHub
(54:44):
and you have another CI tool, use our command line interface, which you can install through iPy.
You just run 2 commands in the command line interface,
which are all fully documented,
and it will scan the files that have changed in the commit
and out the suggestions
and also can fail the build if you choose to do it like that. And you can do that as a pre commit hook as well. There's a pre commit hook using the pre commit
(55:11):
library
that you can install in 3 lines of configuration.
And for people who are interested
in being able to
simplify some of their refactorings
and streamline some of their development,
What are the cases where Sourcery is the wrong choice?
It won't fix bugs.
That's the first thing, you know. We try to leave the code doing exactly the same as it did when it started,
(55:35):
so that includes any bugs. And I guess we've
gone really hard on on readability, so
all our refactorings will try and improve the readability.
Some of them may improve performance, but it's just kind of a side effect. So if you're really trying to optimize for performance,
SourceTree is probably the wrong choice. And, certainly, we only cover
some ideas that we mentioned.
(55:55):
I guess, Jupyter Notebooks is the big 1. We want to cover in the future, but don't cover now, so the data scientists.
And, also, I guess,
this kind of big thing of if you're moving code between classes
and restructuring your code base on that higher level, we're definitely not there yet. At the moment, it's
this sort of lower level structure, readability of the code. And as you continue to build the product and grow the business,
(56:21):
what are some of the plans that you have for the near to medium term future of the product? The big 1 is kind of analysis at the class level, which I think we've alluded to, as in
being able
to scan your whole class, your whole module, find
duplicated
code, or normally extract that for you. And because a lot of what you're doing when you're manually factoring is this process of sort of splitting methods up, finding removing duplication.
(56:46):
That's the next really exciting thing we're we're planning on doing as well as incorporating some machine learning for doing things like automatic function naming or automatic variable naming
extracting variables?
I think more longer term,
I'm likely to complete
convert to functional
programming style.
So, ultimately,
(57:06):
I would like Sourcetree
to push you towards writing functional style code.
And,
really, that's actually a design decision up front. But it's possible
to write a class so that the majority of the code within a class is functional and just
the interface code, the code in the
external interface contains
(57:28):
side effecting code.
So ultimately, I want to push
Sourcetree to be able to do that. And 1 of the
benefits of that for us is that functional code is much more easy to analyze. It's much more easy to refactor
and change the ordering of
statements and things like that. So if we can help people write functional codes, then we can actually do more refactorings as well. So it's a virtuous cycle.
(57:56):
It's actually gonna take us a while to get to that point, but that's, like, the goal. So we wanna bring functional programming to Python.
For anybody who wants to follow along with the work that you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the pics. And this week, I'm going to choose the movie The Croods New Age. Watched that with my family a couple weeks ago, and it was absolutely hilarious. It was probably the most I've laughed at a movie in recent memory. So definitely recommend watching that regardless of whether you have kids. So with that, I'll pass it to you, Nick. Do you have any pics this week? Me and my wife been kind of binging on a lot of series recently, and 1 we really, really enjoyed was The Magicians.
(58:33):
So I think Keira is on Amazon Prime. It's kind of
a combination of Buffy the Vampire Slayer and Narnia. It was really funny and absorbing, and it kept us busy for quite some time. And, Brendan, do you have any pics this week? Yeah. So I've just finished reading David Cogfield by Charles Dickens,
and
I've never
read any books more than a 100 years old before.
(58:57):
The reason I got into it is because I'm a massive fan of Armando Iannucci
who has made some hilarious programs, particularly if you're British, you'll know of the Alan Partridge show, the day to day, Brass Eye.
If you're American, you probably have heard of Veep.
And he made a version of David Cottfield a couple of years ago. And I decided before I was gonna watch it, I was gonna read the book.
(59:22):
And it's absolutely amazing book. So well written,
so emotionally involving that I actually had to stop reading at times because I couldn't cope with it. The really great thing about it is
it talks about London from a 150 years ago. So I know what London is like now, but you get to imagine London in the past and
(59:42):
just the way things have changed.
But the way things are still the same is really amazing.
I really loved it. Well, thank you both for taking the time today to join me and share the work that you're doing at Sourcery. It's definitely a very interesting project and 1 that I'll have to check out for myself,
see what kinds of fix ups I can offer for my code. So I appreciate all the time and energy you're putting into helping make developers more productive and code cleaner and easier to maintain. So thank you for that, and I hope you enjoy the rest of your day. Thanks very much. It's been great to be on. Yeah. Thank you for having us. Been brilliant.
(01:00:16):
Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast dot com for the latest on modern data management.
And visit the site of pythonpodcast.com
to subscribe to the show, sign up for the mailing list, and read the show notes.
And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com
(01:00:37):
with your story.
To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.