Well, folks, I'm not gonna sugar coat it. Here, I'll pad with some new stats to make it look slightly better. From Tuesday, February 10 through Monday, February 16:
- Official Solder Uptime: 100%
- Official Solder Estimated Satisfaction (T=0.1 seconds): 92%
- Official Solder Usage: Over 62 million requests (Note: Last week we reported 18 million solder requests, but that was incorrect by a factor of 3)
- Platform Uptime: 99.72%
- Modpack Runs: Over 5 million
- Fresh modpack installs: Over 1 million
- Launcher API Requests: Over 94 million
Well that doesn't seem so bad.
- Launcher API Estimated Satisfaction (T=0.2 seconds): 85%
Oh, well you can
- Launcher API Uptime: 92.9%
So, this week we worked on some side projects and packs we'll be able to announce later, but mainly we were working on this. As a result, I don't have anything that I can discuss this week except the Launcher API. I know that you want to hear that we'll fix it, so here's the good news: we fixed it three hours ago. Here's what happened.
Basically, sct Rewrote the API
We deployed the rewrite a few hours ago. It's working great, better than we imagined. It did break some stuff for about 20 minutes after we deployed it. Your launcher might have cached some bad data during that period of time, so if your launcher is acting weird, select the pack that's giving you guff and give it a quick moment to update from the API. I'm going to be investigating this week what I can do to make the launcher finish at least trying to refresh a pack before letting you play it. But otherwise things are good, really good. Here's all the ways the new API is better:
- Previously we cached pack data for between 1 and 10 minutes (depending on the type of pack data) in the API after which we'd pull replacement data from the Platform. Pulling replacement data is pretty costly, so we wanted to not do it very often. But we also wanted to not annoy pack authors by having long cache times which would make it harder to see and test their changes. It turns out, different cache times for different data failed at both goals.
- Instead, we now have long cache times, 60 minutes, but changing your pack on the platform kicks off a job that updates the API cache after just a few moments. Now you'll see your changes in the API instantly, but with huge cache times, our database can rest easy.
- Another problem is that doing a cross-country data pull in the middle of a web request turned out to be a really great way to generate slow responses. For packs with a lot of traffic, a ton of DB queries could be generated between the time the cache expired and the time the first request re-filled it. Now if your pack receives a request in the 10 minutes before the cache expires, a job is kicked off to update the cache before it expires. This means that our heaviest-trafficked packs are also our fastest to serve: there is no time when useful data isn't in the cache.
- This is the important part: the jobs that I mentioned are running on separate hardware from the API, meaning that when you're trying to use the launcher, you don't have to share time with the work being done to maintain the cache. With the exception of rarely-used packs, the web half of the API is basically a web service pulling text from cache and spitting it at you. If your pack is in the cache, your response will be basically instant.
- We were trying to do a lot with the old version- different parts of the pack data had different cache times. We were trying to give one cache time to the stuff that had to be updated often while the less-updated stuff was cached for longer. The result was we had to check several different caches when building a response- even a fully-cached pack required several cache gets to build a response. Since the Platform is now proactive in rebuilding the cache, we can give everything the same cache length and cache a single, pre-constructed response. This means that even fully-cached pack data is much faster.
So what effect has this had on our service?
- Well, the servers stopped crashing. So, there's that.
- The response speed is so fast now, it's hard to explain. I'm looking at a graph of our response times over the last half hour- our slowest 1% of traffic peaked at an average response time of 0.35 seconds. Yeah. It was usually faster than that.
- Our Estimated Satisfaction was so high that I had to reduce the expected response time from 0.2 seconds to 0.1 seconds just so that I can see a difference between good and "bad" traffic.
- Previously, our traffic numbers were so spiky that we couldn't use autoscalers reliably and had to manually scale our hardware. That is no longer the case. We now have robots watching the shop for us, which means that we don't have to have someone wake up at 7AM on weekends to manually restart in case of an outage or scale in case of traffic surge.
- Usage on the DB shared by the Platform and API is a third of what it was.
- Oh, also our hosting bills are down. Adding separate worker hardware saved us money. Weird, right?
So obviously it's early days yet- it went live on a school night at the end of a 3-day weekend. But this early data is incredibly positive- it's night and day, even compared to this time last week. I want to thank all of you for sticking with it- I promise that these past few weeks haven't been in vain, they've been vital for us gathering data and building a plan. I understand that it doesn't make much sense for it to take this long, but it's tough to gather data on what's going wrong when you're handling traffic during the week without breaking a sweat and getting murdered on the weekends. It's hard to see what works and what doesn't. The good news is we feel like this solution will end major problems with the API. I'm sure we'll find other problems to correct, in the same way we still find problems with the Platform. But we're now at the point where you can expect the Launcher API to be working on weekends. We look forward to serving you launcher data this Saturday! And I look forward to sleeping in on Sunday.