Scaling and Shopify with Kir Shatrov - RUBY 633 - Ruby Rogues

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:05):
Hey everybody, and welcome to anotherepisode of Ruby Rogues. This week counter
Panel, we have Nate Hopkins,Hello everybody, Andrew Mason. Hello.
I'm Charles Maxwood from dev chat dotTV. And this week we have a
special guest and that's kier shatdrop here. Do you want to say hi?
Let us know who you are.Hi, my name is Keir. I'm
a production dinin. You're at Shopifywhere I work on the scalability in the

(00:28):
platform, and I'm based in London. A cake nice. Now, Shopify
doesn't have to deal with any scalability, right, I mean they only run
like half the shopping carts on theweb and things like that. Right.
Oh yeah, So I'm curious aswe dive into this. You know,
you gave us a couple of articles. One was on the state of background
jobs. The other one was onlike capacity planning for web apps. I

(00:51):
kind of want to start with thisand dive mostly into when should I start
caring about this? Right? Becauseif I have a small app, it
matters a lot less for a while, and then eventually I'll get enough users
or enough people using the capacity toactually go all right now, I really
need to start thinking about this,So, Yeah, where do you find

(01:12):
that the cutoff point is for thiskind of thing? Definitely, there is
a lot of talk and technologies thatit's natural for engineers to be super interested
in it. But the price overengineering things and choosing some solutions that are
maybe too complex at the stage whereyour project is right now, that price

(01:36):
can be too high, and oftenthe most resourceful thing you can do is
just deployed on Heirocu and let itrun, and it will cost a few
hundred dollars for your Hiroco bill.For me, I think the cutpoint is
around the time when you start losingthe control of maybe you're hosting costs or

(01:59):
you noticing that whatever scalability promise youhave start hurting your customers and you start
losing money, either as a resultof your customers being unhappy or as a
result of the thing costing to runa lot more than a company kind of
work to run the business in areliable way. Yeah, that makes sense.

(02:22):
It's interesting too that you've kind oftied it to those two practical breakpoints,
right. A lot of people theytry and tie it to well,
I have a certain number of users, or I have a certain size of
an app or I have you know, a certain amount of server capacity or
you know, stuff like that,and it's it's interesting to me that a
lot of this, you know,you've tied it back to oh, it's
impacting the customers or oh, youknow, it's it's impacting my bottom line,

(02:46):
and then it's like, oh,okay, how do I deal with
this? I also think it's interestingthat you mentioned that, you know,
it's easy to do if you justhand it off to Roku and let them
handle it. And I know thatI haven't heard it as much from Nate,
but I've definitely heard it from Ericover at code Fund that that's kind
of his approach. He doesn't wantto deal with DevOps. He just wants
to push it to the cloud andthen, you know, let them handle

(03:07):
it, and he's willing to payfor Heroku to do it. Yeah,
that's that's our philosophy right now.But I mean we're also short staffed,
right Yeah, so we've got twowell really, we're just one and a
half developers on the project. Otherthan we've got plenty of contributors that help
us fix bugs and things like that, but there's only two of us that

(03:27):
are full time. You know,looking at code, and Eric's really only
about halftime looking at code, ifthat right, So we don't have the
time of the bandwidth to really delvedeep into into you know, the ops
story. That makes a lot ofsense. So I'm curious, Nate,
at what point would you guys considermoving off of Heroku? I mean,

(03:49):
would it be a cost thing orwould it be something else? You know,
we're still we've found product market fitand we are trying to scale it
now. We're trying to scale onthe sales side. So as soon as
we have enough customers and enough consistentrevenue flowing in to allow us to kind
of back off and look at ouroperations story, that's probably the time.

(04:11):
So I would say we're probably maybesix months away from you know, having
the luxury being able to look atthat. Yeah, that makes sense.
So Keer as somebody gets to thatpoint, you know, and I think
this might be a relevant conversation thenfor Nate. But you know, when
they get to that point and they'rethinking, Okay, we're going to scale
this, maybe they move it offof Heroku and onto you know, a

(04:32):
Kubernetes cluster, or they move iton to you know, a virtual private
server, something like digital lotion orsomething. What things should they be looking
at then to scale their their stuffup. For any hosted services, like
for instance, it's common to usehosted database as a service, I think
it's important to look at whatever limitationthat service provides, because any hosted service

(04:57):
would have some kind of those Iremember read a blog post where an app
had a very specific requirement for somepostgrous extension that they've been using, and
they switched i think three three providersthat gave them Postgrass's service, and they've
been unhappy with each and they obviouslyspent a lot of efforts, and finally

(05:19):
they got to run postgrass on theirown because having that very extension and requirement
that was a huge point for themwhen choosing a provider like that. It's
important to understand any limitations and andfrom another angle, I think there is
there is so many scalability related problemsthat you can run into that usually it's

(05:44):
you start looking at the one that'smost critical right now. Like I've I've
been part of projects where they've runinto scalability issues with the database layer with
my sequel or with progress and asthey fixed it and iterated it on it
and their database could accept a lotmore load. They came to another bottleneck,

(06:09):
and that bottleneck is different every time, depending on the business, depending
on your patterns of the usage that'scoming from your customers. So it's fixing
one thing at the time, oneby one, and sometimes that's a never
ending story, especially if the companygrows large and there is a team works

(06:30):
just on scalability, which is currentlythe case for my team of Shopify.
Yeah, that's a terrific point interms of really, this is not a
job that ever completes, right,It's something that you're always having to stay
on top of it, especially ifthe company is enjoying any level of success.
One cool thing about code fund iswe are even though we're on Heroku,
we're able to leverage some of thepostgress at more advanced postgress features like

(06:54):
table partitioning and things like that,which has enabled us to continue to scale
on that platform. We're hosted onone hundred and sixty plus sites right now,
and so we're seeing between two anda half million and three million requests
a day pipe through the server.Now. We are paying a premium for
Heroku, but we're still I thinkwe're under eight hundred a month on our

(07:17):
on our production setup, and we'reprobably a little over provisioned in anticipation of
spikes and things like that, andso we don't quite have the fine tuned
control that we would like to have. Your point on postgress, as you
want to customize that and install yourown plugins and things like that into the
database players, that would be somethingthat would be fantastic because since we are

(07:40):
using table partitioning, I know there'ssome plugins that just are not broadly available
on the Heroku platform that would bekind of a luxury to use for us
that we've kind of had to workour way around some of those things.
I'm curious about your experience and timewith Shopify. How long have you been
with the team and what types ofchanges have happened since you've been at the

(08:03):
company. I've been a Chopify foralmost for years, and I've always been
part of the production engineering department,which deals with the infrastructure and is less
exposed to the product. And justthat department grew so much from maybe while

(08:24):
I've been here, from maybe thirtypeople to now more than one hundred,
and all of those people are workingon the infrastructure and reliability, and with
the motto of that, our jobis to keep the site up. There's
another aspect of scaling here, goingfrom forty to one hundred people, Like
how has the team scaled? Likewhat's the dynamic been? Like, Yeah,

(08:48):
it's interesting to follow dynamics in termsof team scaling in every organization,
and I imagine it's a different story. It affected so many things. Like
for instance, at the time whenI joined, our Shopify is based in
Canada and most of infrastructure engineers werejust one office. Now people who work

(09:13):
on the infrastructure are based in threeoffices, and there is also a lot
of remote people like me. Andthen as you grow, you end up
investing into some of the things thatyou would never invest before and have teams
who work just on one part ofdevelopment environment for instance, or just on

(09:35):
background jobs infrastructure, something that Iwouldn't have imagined three years ago. So
what is the technical portfolio for Shopifyaround and like how has it changed since
you join? Obviously that's a greatquestion. There's been a lot of new
tools and techniques and stuff that havecome out, but you know, just
over the last four years, andso I'm curious with the evolution of tooling

(09:58):
has looked like, Yeah, that'sa great point of discussion. So I
think first there is something I wantedto give the context to our listeners.
First is that when Shopify was foundedabout twelve years ago by Toby Lutke,
Toby was one of the first contributorsto Rails and he knew David djh and

(10:22):
they exchanged some emails and around thetime when he started company, when he
started Shopify on rails, rails wasjust a ZIF file that they exchanged over
an email. It wasn't even somespecific version published on a GEM server because
I'm not even sure there was ifthere were any GEM servers at that point.

(10:43):
So from that day when he startedon rails, that app still exists.
It was never rewritten. It's amonolith that has been around for more
than a decade. We tend toput a lot of love into it to
make sure that developer experience stays great. Unlike it often happens that a monolith

(11:05):
is just too slow and too hardto work with that developers get so much
friction and decide to go splitting orcalling the monolith a legacy. It never
happened for us. I've got tointerject and just ask a question on your
monolith in terms of, like Iknow Shopify is a very large company,
how many developers have their hands inthe monolithic code based my rough guess would

(11:31):
be from one hundred to two hundredpeople, given that R and D in
total is a lot more because therewould always be people working on other part
of stack, also mobile developers andso on as you can imagine. So
back to your point about how hasthe stack changed in terms of tools that

(11:52):
are familiar to listeners of our podcast, it's still pretty much a classical rails
up with all the things that comewith it. In terms of the infrastructure,
I think the biggest shift that Ihave observed of the company was moved
from physical data centers to the cloudto Kubernetes. And that's another who interesting

(12:15):
story because we were able to moveto Kubernatus in cloud one shop at the
time, Given that we have millionsof them, we wanted to make this
process as continuous and find control aspossible, so we just took one shop,
moved it to cloud and progressed andwe were able to control that.

(12:35):
It's fascinating to me that you haveupwards of two hundred developers working on a
monolithic RAILS code base. Like someconventional wisdom that I've heard in other circles
and certainly bumped into in my careerhas been that if you're going to scale
your organization, you apply conways andbreak out into micro services. In the

(12:56):
conventional wisdom seems to be that that'sreally the only way to do it,
and you, guys are a terrificcounterpoint to that. What are some techniques
you've used to facilitate it. Ithink one of the biggest has been adopting
domain driven development development and splitting thatmonolith into I would not call them name

(13:18):
spaces, but it's kind of componentsat least that's how we call them.
There is nothing very secret or specialabout it. It's basically just a way
to structure your app directory so thateach team, each component gets their part.
Therefore, it helps a lot toestablish the ownership because, for instance,

(13:39):
as soon as you see an exceptionin production in some of the exception
tracking service that you use, yousee that exception is coming from components Slash
support, Slash app, Slash model, slash something. You immediately know that
a support component and you have allthe metadata to find people who can help
with that, even a non callescalation or a Slack channel where you can

(14:05):
chat and point out. And westarted leveraging that for some of the to
automate some other things like, forinstance, if exception within one app happened
in that component, will send anotification to their Slack channel, not to
some generic Slack channel with tons ofexceptions from all over the company. Establishing

(14:26):
those ownership is I would say,the main technique. Okay, so domain
kind of a domain driven design,and then you give a team like full
stack responsibility or at least all theareas of the stack that that particular domain
piece may touch, right, sothat could slice all the way through front
end, all the way down intothe model layer. Yeah, it's not

(14:48):
as strict as you can imagine,and there would always be cases of reaching
out directly from one active record modelto another through components, through different domains,
and that's not great. We tryto build tools to discourage people from
doing that and for them to knowwhat are the right patterns. Like for

(15:09):
us, it's mostly entry points thatare well that are typed and declared and
documented. So this is kind ofshifting gears a little bit. I'm really
curious about the database infrastructure because Iknow on Shopify, essentially you've sharded the
database or maybe not sharter, butthere's multiple instances of the database, right

(15:31):
that are all that backs this.How is that structured? And how do
you manage that from an OPS perspective? Oh yeah, that's also a great
discussion point. So also to givesome of the context to the listeners.
For all well known rails companies likeShopify, Gethub, based Camp name a

(15:52):
few that's been founded around ten yearsago. At that time, my sequel
was that best known database that everyoneknew how to run and operate. People
were the most familiar, and someother like posgress were not maybe as good
or as established at that point.So that's one huge reason why this subset

(16:18):
of companies, including US, areall based on my sequel. And yeah,
at I think it was around twentyfourteen twenty fifteen when we realized we
can no longer fit everything into onedB. We figure out we have to
find a way to scale horizontally,and for a multi tenant SaaS application,

(16:41):
there is a great way to dothat. Since your tenants are always isolated,
you don't have to. You don'thave any joints between multiple tenants,
so you can put tenants through differentcharts, through different partitions and manage those
independently, which also reduces the blestradios. If you have hundred charts,

(17:06):
one is down for whatever reason,only one percent of your customers are getting
some negative experience, and you goand fix that as as soon as possible.
But it's not all of the platform. So we invested a lot into
charting. In terms of application logic, it's it's mostly done on rails layer.

(17:29):
We have a rails team at Shopifythat that helps to steer that into
the best direction possible, at leastfrom the rails point of view and from
the opps point of view, it'sit's just a lot of charts that that
can be located even in different regions, and which also can allow to isolate

(17:53):
some tenants geographically. So let mejust recap to see if I've got the
picture in my mind correct. Sowe've got a rails monolith that's kind of
structured with kind of these domain areasof responsibility. That's how you structure your
teams and the way you've scaled thisat least up to this point in the

(18:15):
conversation is you're just dealing with gutlike just mountains and mountains of data,
So you've sharded your multi tenancy acrossdifferent database nodes. For the developer,
it can just look like a typicalrails application, correct, And something to
add is that we our goal isto make that all that starting complexity gidden

(18:37):
away from developers who right product featuresfor them. It may feel like there
is just a database with a lotof tables that represent the business model,
but underneath there would be some smartsharp selection that would happen at the beginning
of the request, for instance,that would select the right database. And

(19:00):
I mentioned this just for my sequelfor relational database, but we've realized that
it makes no sense to have sharedit my sequel, but just one global
redditis because regardless of how well youshared that one global redis or that one
global memcash would still be a singlepoint of failure. And as you can

(19:23):
imagine, we learned that lesson byexperiencing those single point of failures. So
our philosophy is that every resource wouldbe sharded, so there would be a
smaller instance of shopify that has itsown My sequel that has its own raddits
that has its own memcash that helpswith this isolation. So with each web

(19:48):
server essentially or maybe partition of webservers the scale horizontally, all of those
would not necessarily have a local copyof them cash and read us, but
maybe just a shared one that clusterof web servers. One thing I should
note is that stuff like web servers, it's still all shared capacity, and
it's mostly it's only resources that areisolated. So any web server can talk

(20:15):
to any to any partition or anylike smaller instance of Shopify, it's mostly
the matter of selecting the right pathdepending on what's the customer. So now
I'm a little curious in terms ofbecause there's obviously a pretty significant coordination piece

(20:37):
there. You know, when therequest initially comes in and then you assign
the correct mem cash server, thecorrect redit server, and the correct my
squel server. How much of thatinfrastructure did you guys have to build Shopify
and how much are you leaning onthe database providers for those things? Honestly,
I think it's mostly all in housebuilt. And to give a bit

(21:00):
of context about that, it's mainlya component called sortine hat I like the
name that is using sound the sortinehat is using a global lookoff table to
find which which domain, which shopis on which partition. It gets the

(21:21):
partition and then it goes to thelocation of that partition can be US West,
Central, use East, somewhere else, and then it just hits the
right database located in that region,and the right through all through rails and
mostly through HDP headers with And what'swhat I find very interesting is that we

(21:45):
were able to build all of thaton top of Engine X since Engine X
allows you to write scriptible LUA moduleswhere you can implement any kind of logic
in those local modules. In EngineX, you can query your database to
look up something where that tenant leaves, and then you just proxy that through

(22:07):
Engine X and you manipulate the headersand just make this work. So it's
quite a lot of infrastructure that wehad to write. But at the same
time, as I talked to calldifferent companies, it's all custom tailored and
there is no there is rarely asame stack, same use case. So

(22:27):
that's also that would be a bithard, maybe a bit hard to share
and abstract. So yeah, howmuch of that infrastructure tooling is open sources
that all secret sauce internal stuff,or have you open sourced some of it
which try to open source quite afew things. There is also a lot

(22:48):
of conference tocks that will link toshow notes that give way better over of
the architecture. Then I just explainedthe routing layer itself. I wouldn't say
it's open sourced, but there islots of information out there for someone who
who would want to build and usesame techniques. So that's probably a good

(23:11):
segue into you know, additional scalingaspects. So you've you've addressed a lot
of the persistence layer pretty much theentire persistence layer horizontal scalability, but you
still have response times to deal with, right, And so one way to
make response times fast is through backgroundjobs. And I know you've got quite

(23:33):
a bit of expertise there. Whatis the approach and architecture of Shopify's background
job system. Well, and justto pile on here real quick, it
seems like when people start talking aboutscaling ruby at or rails apps or sondraps
or whatever, this is one ofthe first things people reach for, right
because any long running task they justyou know, shunt it off to background

(23:56):
job and you know, report errorsback to the user if they have to,
and it shortens the response time becausethen it's hey, go do this
job instead of I'm going to grindthrough the work of doing this job.
Yeah, and before you jump inwith an answer too, I mean one
thing to bear in mind is likesome of the stuff is just it's baked

(24:17):
into rails with active job. Butyou don't even have to set up redd
us or anything like that to supportit, right, It'll run it on
a background thread out of the box. So what is the path for developer
kind of chucks lead in question?You start on a small project that's maybe
a little hobby thing, and itstarts to get some traction and then maybe
it turns into a business. Whatdoes the evolution of kind of evolving that

(24:40):
background job handling look like over time? Oh yeah, And to note that
like myself or some of the byprojects, I run background jobs exactly in the
background thread in those uma processes.Yeah, just because it makes no sense
to pay for extra for instance,kick down as on Hierroco for those bad

(25:02):
projects. And exactly as you pointedout, it makes sense to start with
something as brutal as a background thread, and then I'm really happy that Ruby
community has a project like Sidekick andMike Perham who is behind that project,

(25:22):
who has pushed the community to adoptsome beast practices around background jobs, and
also offers nineteen nine percent of whatcommunity needs as an open sound project,
and for the remaining of one percent, when you get to that point,
you can buy a pro or anenterprise edition, and I'm pretty sure that

(25:47):
when anyone is at that point,that's actually quite an affordable software to buy.
As a company, and just likemost of the community who is using
Sidekick, Shopify is very similar interms of setup. Because we've been around
for so long time, such along time. We've started with Rescue if

(26:12):
anyone remembers, that was a preSidekick era library to basically achieve the same.
So we still run Rescue, werun reddits. We got to rewrite
most of Rescue internals because we're multitenant and we want to share some of

(26:34):
the capacity and reuse that between tenants, which we can dive into if if
you say later, I guess thefirst question from you and from some of
the listeners could be why we're noton Sidekick, And the answer I would
say is mostly the legacy part andalso how much we know the stack and

(26:56):
how much we customize it for usat this point. But we're all so
starting some smaller apps at the company, some smaller rails apps. In fact,
in addition to the Monoli, youprobably have a couple hundred other smaller
rails services for something very specific ormaybe something just employee facing, and all
of that would use the recommended setof libraries that includes Psychic. Yeah,

(27:22):
that makes sense. I'm also workingon a software as a service. I'm
sponsoring one of the bigger conferences thatserves that niche podcasting in August, and
so I anticipate that things you're goingto grow. And yeah, I have
a lot of things that I ampushing into the background jobs right now just
because you know, I want toget the response times down. But one
thing that I'm wondering about, andI'm kind of tempted to go with Heroku,

(27:45):
but part of me, I don'tknow, I have this mental block
about paying for something that I couldprobably figure out the scaling on myself or
at least do some you know,a couple of minor things to help with
the performance and scaling that way.So what should I be looking at next.
It seems like you all have kindof gone toward the cloud, and
I'm wondering if that's the right answer, or you know, beyond background jobs,

(28:07):
what's the next step? A stepto reduce response time? No more,
it's more a step to just getit to scale, you know,
get that you know, be ableto handle more traffic without having the site
slow down. Right, there wouldalways be some kind of bottleneck, which
is depending on if you have agood setup of tools, should be possible

(28:32):
to find. And for us,that bottleneck has changed through the time,
And I would guess there is nosingle answer because maybe there is something in
a web server, in a controllerstill spending quite a lot of time which
which slows down the response time.Or maybe it's all database that's a bottleneck,

(28:55):
or maybe it's it's reddis or maybethe rails reaches out to some external
service that is not located too closeto it, which increases latency and also
impacts response time. Yeah, thatmakes sense. I'm curious what criteria you
use to determine what should move intoa background job. Obviously you may hit

(29:18):
some latency on a particular request andsee something that is kind of low hanging
through to move to a background job. But just because you moved it to
a background job doesn't mean you've actuallyaddressed the root of the problem. You've
just moved it out of the requestflow, right. Oh yeah. And
a very common batter that I seein people do with jobs is, for

(29:42):
instance, you want to iterate overall users in your app and do something
about each of them, maybe remindthem that they need to add a credit
card or maybe something expired, oryou want to send them an engagement email.
When you start, you have justone hundred users, so that job

(30:02):
works off pretty quickly under a minute, maybe depending on what kind of work
that is. You grow to thousands, hundred, thousands, to millions,
and a job to iterate over amillion users and to check balance of each
of them, that job starts takingdays or weeks. And how do you

(30:26):
solve that? And it's just soeasy to introduce that problem. You just
do user dot find each in ajob and it works, but until the
point when it stops. So theway how we solved it, and that's
actually all open source. We'll alsolinking a show note. We've solved that
by making every job interruptible and preservinga cursor so that a job would progress

(30:55):
for a bit and then maybe itwould get restarted for some reasonasically, this
allows us to iterate over really longcollections and do some work with them and
never lose the work that has beendone. Nice. Yeah, that's really
cool. I'm gonna check out theShopify job iteration. That sounds really really

(31:15):
interesting. One of the things thatwe've done a code fund is when we're
iterating across of course, we'll dolike a find in batches, and then
we will just in queue the smallerwork, so when the large job fails,
it's essentially item potent and can bejust rerun again without without impacting things
that may have been half processed orhalfway chunked through. Yeah, that's the

(31:37):
approach that I take as well.An interesting side effect of that could be
that, again, if this leadsto a fin out of a million jobs,
because if you have ten million usersand each batch is side of ten
for instance, like the numbers don'treally matter, but the point is that
if the fen out of so manythe jobs, we need to remember that

(32:01):
something like credits is always limited inmemory, and there's been so many times
across every I would say across everyorganization where I worked, that people would
push reddits into out of memory state, and unfortunately there is no I would

(32:21):
love to have a great solution forthat. But every time we want to
do something like you describe, iteratein batches, thank you something, we
have to be mindful about what's behindthat. And yeah, I've been that
as well. You start dropping jobsbecause there's no memory left. Certainly happens

(32:44):
at times when there's when jobs mightbe failing. Right sidekick for it gives
you some pretty nice failsafe capability whereit will reattempt those jobs. But if
you've got a bug and not alot of memory dedicated to your reddits,
instance, then of course you maystart losing work that may be critical to
the business. Yeah I could seethat. I haven't run into that myself,

(33:06):
but I could definitely see that happening. This is a great reminder about
all sorts of data databases that existthere, and maybe push push someone to
learn about that, because at theend rad as so reddits is in memory
database which is bound by some ramthat you give. It can be gigabyte,
can be four, can be sixteen, and that backlock of jobs would

(33:29):
not be backed by something that's thatcan be written on storage that's bigger than
RAM like like which would be DISCif it's, for instance, my sequel
progress. So something that we wouldreally like to find is a store that
could persist those things on disc witha performance not too far and features not

(33:54):
too far from radits. Reddis doeshave the capability to push right to disk,
right to flush itself out to disc. Yeah, So that only helps
to have a snapshot in case thecomputer where REDDITS is running rebootst but it
still doesn't allow you to store morethan you have than the RAM that you

(34:16):
have. Yeah. I mean,that's probably a great argument to move to
cloud, right because on Heroku,it's just one button click when you see
the memory filling up to scale outor scale up your REDD storage capacity.
Yeah. And a lot of clouddatabases or cloud instances they have methods were
compensating for that, and so theywill just migrate you to a bigger instance

(34:40):
or you know, basically allocated toallocate it new memory without you even having
to click it. As far asthat workflow is validated and people are certain
that it will work. That's agreat feature of cloud providers. One of
the thoughts that i've I've had architecturalwhich would be kind of neat on the

(35:00):
background processing would be some jobs obviouslyare a bit more ephemeral and less critical,
and they could be handled in alittle bit more localized fashion, So
it'd be neat to build a routinglayer that was intelligent where you maybe had
three stages of reddus or just backgroundjob storage. Right. One could be

(35:22):
this is very ephemeral and not veryimportant, so we'll just let it be
handled in process on a separate thread, so we'll route that job over there.
Or it may be that this webserver the job is still kind of
ephemeral, but a little bit moreimportant, So we could have a dedicated
redd instance sitting on the web serverthat has just a small set of dedicated

(35:43):
memory for that, and you couldpush those jobs there to handle some of
that back pressure, and then forthe really important stuff, you could hef
to those off to like your appliancetre of reddis storage that gives you the
full capacity across the entire application.Oh yeah, we haven't done something like
this for jobs though, I thinkit could help a lot. But in

(36:04):
general, like in terms of buildingsystems, I think This is a common
case of defining priority for different workloads, which also allows you to shed some
of the load. So, forinstance, you would have it doesn't have
to be jobs. It could besomething as basic as web requests. And

(36:24):
there are requests that go to somethingthat's very important to the business, maybe
checkouts, which has the highest priority. Then you have something medium priority that
may be browsing just the admin,and then you have something low priority like

(36:45):
checking out robots X or checking outsite map or hitting an API. And
by declaring priorities to those requests whenyou're at the load, you can shed
some of those that you don't need. And this idea comes mostly from the
largest companies in the industry, LikeGoogle has lots of papers and books how

(37:07):
they do it, and as youcan imagine, every request to Google service
would have some kind of priority andthey actually shared those like I'm pretty sure
that mail is higher priority than watchingvideos on YouTube. It's really interesting and
one of the neat things about sidekickis it provides like in terms of if

(37:28):
you couch that in terms of backgroundjobs, sidekick provides some of that facility
just out of the box, evenfor a simple deploy right because will you
can you can prioritize. You cansay this is in the critical queue,
this is in the default queue,is in the low priority queue, and
Sidekick will drain the higher priority queuesfirst. Now you could start there and
then and then eventually expand out andsay, well, I'm going to give

(37:51):
a set of dedicated worker virtual machinesor dinas or whatever to process a particular
queue. And I may even giveus up dedicated reddis instance or tier for
that particular cueue. But you canstart with just a simple Reddus instance and
the default Sidekick configuration. Say justfor anyone listening, because when we're talking

(38:14):
about like scaling large systems, rightlike Shopify. But if you're starting a
rails app, for me, thego to is pretty much I always reach
for Rettus, Postgress and Sidekick,along with everything else that comes out of
the box with Rails. That's prettymuch what I always go for when I
start a new project. Yeah,I mean, I use I've used Rescue

(38:35):
in the past for a lot ofprojects, and then yeah, I've moved
into Sidekick for my newer stuff.But yeah, when is it too much
to background something? Right? SoI wrote a gym that allows me to
essentially background every or any method thathangs off of an active record model,
which is really convenient, but whatI've found is it makes it almost too
convenient, where if something seems tobe slowing down a request, you can

(38:59):
just do it dot defer to themethod name and it would stick it into
the background, which is great,but it got abused and we ended up
with far too much running in thebackground, hitting those problems you're talking about,
like exhausting memory and stuff. Sohow do you how do you determine
what should be backgrounded? That's agood question, and frankly, as someone

(39:20):
who's spent quite a lot of timeon that part of stack, I'm not
sure there is a single answer,and I think it's somewhat related to how
For instance, if it's active recordand sequel quarius, how heavy are those
quaris? If your request i'me outis thirty seconds and just one sequel query,

(39:45):
that's for some reason heavy some kindof aggregation takes ten and you need
maybe to run a few of those. There is no way to fit that
into a bub request, And ofcourse it might not make a lot of
sense to do the premature optimization,and it can be fine to just start
with everything in a web request ina controller, and then you find out

(40:07):
that's the thing where your apps spendsmost of the time in a web request,
and you just move that to ajob because for simple apps, that's
maybe it will be part it willnever be a job, and it will
scale fine for the next few years. Yeah, I wonder if a good
approach would be to first This probablyvery much depends on if you've got paying

(40:29):
customers that are being impacted, right, So, if paying customers are being
impacted and you've got just some inefficiencyand a query or some aspect of a
web request, maybe you background that, but you also set you put it
in some type of planning process whereyou revisit that job and try to actually
optimize the real root of the problem. Yeah. I tend to use the

(40:52):
background jobs when I have a performanceissue in the request pipeline, like we've
talked about before, and then ifthere's problem with running it in a background
job, you know it's timing outor you know something's breaking or something like
that, you know, then Irevisit it from there. I don't know
if there's a silver bullet. Ithink a lot of times it's context specific
and you just have to Okay,I'm moving this out of the request pipeline.

(41:15):
Okay, now it's having a problemhere, So now I've got to
address the issue there. And yeah, you know, eventually it kind of
bubbles itself up to the top ofyour tech det queue and you address it.
So one thing before we wrap up, do you have like some favorite
tips or tricks or approaches that youdo it shopify or have done at other

(41:36):
employers that make this easier, oryou know something that you just feel like
is something that you did that you'reproud of. Yes, For someone who
is curious about performance and fixing thosekind of bottlenecks, my best advice would
be to study all the set andvariety of tools that you can use.

(42:00):
These tools can be as high leveland web based and simple as muralk and
some of the similar services that youcan connect to your app and see insights.
Two more system level tools like forinstance, as trays. The amount

(42:22):
of times where as trays saved meor and some of my colleagues at the
middle of the of the service disruptionjust it's so hard to count those And
my advice is not necessarily about astrays, but knowing the wide variety of
tools that you can use. Someof those tools are very Linux specific and

(42:45):
system level. Some of them areRuby level, like arbispy, a great
tool by Julie Evans, or arbitrays, and then there are some services that
offer that those kinds of things.So if you know that range of tools
and you know which one is thebest for something that you're looking for,

(43:08):
you pick it up and fix thething. Anyway, You've got to wrap
up soon. I've got a couple, just a couple of questions to put
you on the spot here. Oneis, do you know what the request
volume that chaff of hy does persecond? The public number that I can
say is about eighty thousand requests perminute. And what about background jobs?

(43:30):
How how many background jobs you arebeing processed per minute? That's a great
question, and to be honest,I don't remember those numbers just out of
my head. Yeah, yeah,I have probably suff It's a lot,
right, Yeah, it's a lot, and it can be very spiky.

(43:50):
And there is a huge difference fromsteady state and spiky state. Because shopify
is also hosting some of the wordslargest sales, sometimes for celebrities, sometimes
it's worldwide cups and some special salesthat where millions of people try to crash

(44:14):
Superfest stores. Yeah, I canimagine code fund is tiny in comparison.
Since January we've done over three hundredmillion. Wow, that still feels like
a lot to me. Yeah.We keep changing what's in the background,
what's not in the background, sothat we've had that number kind of artificially
inflated at times. But still,yeah, that's a lot of background work.

(44:37):
Yeah, makes sense. All right, Well, I'm going to push
us to picks, Nate, doyou want to start us off with the
picks? Sure? So I guessone pick for me today is open source.
How fantastic open sources are good.A thing on the side that I'm
doing for my brother in law andit's basically a CRM. So I went

(44:57):
kind of diving around for open sourcetools that I might be able to use
to set up for him, andI found fat free CRM, which is
a rails based CRM. It's abit antiquated on the uh you know,
the way it looks in terms ofthe UI and UX, but it's pretty
fantastic that data models solid and itmeets all of his needs. Which is

(45:21):
terrific. The other pick I've gotis cats. So we've got a Maine
coon in a Russian blue and theyjust provide so much joy for my girls
and for the family in general.So highly recommend getting a pet, and
especially a cat. Nice. I'mgonna step in here with a couple of

(45:44):
picks. The first one that Ihave is a challenge that I've been doing.
This is a challenge that has beenless fun with a broken arm,
but it you know, I startedit because I just I really want to
prove to myself that I can dothis, and yeah, doing it with
a broken arm, it just Iwasn't gonna wait to heal because it's several

(46:05):
weeks to heal a broken arm.Anyway, the challenge is called seventy five
Hard. It comes off of themf CEO Project podcast with Andy Frizella,
and I've picked that on the showbefore his podcast, but anyway, it's
basically a challenge that he made up. But it essentially is a challenge to
prove that you can, you know, do what you've got to do for

(46:27):
seventy five days. So there arefive rules and if you violate any of
the rules, then you have tostart over the seventy five days. And
the first rule is you have towork out twice a day for at least
forty five minutes each time, andone of the workouts has to be outside.
So if it's raining, if it'scold, if it's hot, if
there's a hurricane, you know whatever, you're going to work out outside.

(46:49):
And basically the he says that that'sjust a you push you through the you
know what. Sometimes you have todo stuff when the conditions aren't ideal.
The other rule, you have todrink a gallon of water every day.
You have to read ten pages ofa book every day. You have to
choose a diet and stick to it. No cheating every day for seventy five

(47:10):
days, So a lot of diets. You know, people are like,
well, I take a cheap dayevery week, no cheat days, no
cheat days on seventy five hard.And then the last one is you have
to post a status photo to socialmedia. And so yeah, I've restarted
twice so far. The first timeI forgot to read the ten pages,
which was dumb. It was theone thing I kind of took for granted

(47:31):
that I do and I didn't doit. The other one, I got
a salad from Coasta Vida and Ididn't realize that I hadn't told them to
take the rice out of it.And I've been doing a Kido diet,
so yeah, so I started over. I felt really dumb about that.
I was like, I know theyput rice in it. I don't know

(47:51):
why I didn't ask them to takeit out. So yeah, So it's
just kind of learning to adapt tosome of this stuff. But I'm definitely
enjoying the pros. And incidentally,just to throw it out there, so
I've I've been doing the the challengefor about a week and a half and
you know, and I'm currently onday two. Just to throw that in

(48:12):
the right because I had to restart. The flip side is is that I've
lost ten pounds and know we canthat's a serious program, like you're gonna
be committed. Yeah, but hesays it's a mental toughness challenge. Right,
You're going to go and some daysyou're just gonna have to push through
do some stuff but you really don'tfeel like doing. Yeah, like the

(48:35):
run to that I have scheduled today, it'll probably beat both of my forty
five minute workouts together. Because it'sit's one of my longer training runs for
the marathon I'm gonna run in October. And yeah, I'm really feeling it
today, especially with my arm andeverything else. I do not want to
go out there and do it,but you know, I've got to suck
it up and go do it,so anyway, but yeah, you know,

(48:58):
I've got to go do two workoutstomorrow, and tomorrow's a holiday,
so yeah. Anyway, so thatthat's my pick. If you want to
go follow me on Instagram, Ithink my handle is Charles max Wood.
Then I've been posting my uh mysocial media posts there. I tend to
try and post them to Twitter andFacebook as well, but I'm not always
great about that. I'm pretty consistenton Instagram. So anyway, here,

(49:22):
do you have some picks for us? To be honest, I'm not I
don't know the like the format verywell. If you can, just are
there one or two things that youthink everybody in the world should know about
that way, right this one?I think it would be interesting for the

(49:42):
main audience like Ruby developers. Acouple of weeks ago, I followed a
hecking guide from MRI committers that showsyou how to build Ruby, how to
change some simple source and see howto rebuild it again and see how it
works. Which also allows you totry all the new features that are coming

(50:05):
with Ruby two point seven because youbuild it from the master branch, so
you can go and try stuff likebatter and matching. It's something that you're
excited about. And the reason whyit can be interesting for any roupe developer
to try is because you get tosee all the magic behind it, just

(50:27):
all the ce code, and it'sbecomes no longer just a thing that some
room committers that I have no ideaabout build, and it becomes something that
you can understand a little bit bettermaybe. And I think that a haicking
guide was also made to reduce thebearrier to start doing that open source.

(50:53):
So I think this point falls backto the pick that need brought up about
open source being awesome. We'll linkthat to very cool, yeaph cool.
One more question. If people wantto find you online see what you're working
on these days, how do theyfind you? Yeah, it's a Katrov
on Twitter or Kres on GitHub.Awesome. All right, well, thank

(51:17):
you for coming. This is reallyinteresting. I want to ask like a
dozen more questions, but we justdon't have times, so maybe we'll have
you come back. Thanks for inviting. We'll be happy to come back,
all right. Well, let's goahead and wrap this one up, folks,
and we'll come back next week withanother episode. Thanks lot, Bye bye,

All Episodes

Scaling and Shopify with Kir Shatrov - RUBY 633

Episode Transcript

Popular Podcasts

24/7 News: The Latest

Dateline NBC

The Clay Travis and Buck Sexton Show

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Scaling and Shopify with Kir Shatrov - RUBY 633