hugo-blog/content/posts/2016-03-18-scaling-openings...

78 lines
8.6 KiB
Markdown

---
title: "Scaling openings.moe"
date: 2016-03-18T17:08:57+02:00
slug: "scaling-openings-moe"
aliases: ["/scaling-openings-moe/"]
image: /content/images/2016/03/Blog-Header-1.png
---
**This is a blog post that was suggested by a puny little [tweet](https://twitter.com/justgalym/status/710470434882510848) - The infrastructure behind openings.moe. I could make a nerdy post about how it works, discussing configs, cache directives, traffic shaping, web server optimizations and so forth. But that wouldn't be fun. So I thought I'd rather write about the *history* of openings.moe's infrastructure (Brace yourselves, this will be a long read)**
It's now been roughly one year since I first made openings.moe. As a quick project during Easter 2015, the video shuffle code was (literally) only about 15 lines of code - It has grown a lot since then. In February 2016 alone we handled roughly 18 TB of bandwidth and the openings.moe domain was resolved exactly 1 314 021 times. That's a lot of weebshit.
openings.moe started out as most projects do these days. A cute little $5 droplet from DigitalOcean - For a while, this worked fine. We had a $5 droplet sitting behind CloudFlare for DDoS protection. But fast-forward a month or two and we were easily blowing that 1 TB of included bandwidth, so we needed to look into more solutions for this.
# Servers
At the current time openings.moe consists of four main servers. These are:
* eu1, our Netherlands edge server.
* Mio, our US edge server.
* Neko, our France edge server.
* And the origin server, also in the Netherlands.
In addition, eu2 was just shut down today as Neko took its place. That's a fair number of servers.
But it's not surprising if you think about it. 18 TB per month turns out to be a constant load of about 55 Mbit/s if you do the math. And that's not even considering the spikes. The worst spike we've experienced so far was December 23rd 2015. Where we capped 1.3 Gbit/s of bandwidth for 6 hours straight (I guess that also shows how weebs prefer to celebrate Christmas?) - Now I'm not planning to stay ahead of this all the time, that's just wishful thinking and frankly it's also dumb sysadmin work. But it can't happen to often.
So when the first server wasn't enough, we turned to CloudFlare. Sadly though, that didn't really work out. Users were reporting lag all over the world and I couldn't find the cause. After a lot of global benchmarking and testing, I assumed that CloudFlare was capping our bandwidth (Mainly in the range of about 2-3 Mbit/s per connection). And it appears that I was correct:
![CloudFlare plans](/content/images/2016/03/firefox_2016-03-18_08-50-17.png)
Note that this is the only place where a speed difference is mentioned. I have no issues with it, but I still wish it was clearly stated, hidden conditions like this are what I like to refer to as "a bullshit move". However this doesn't change our situation. We had to get off CloudFlare's back, which led to the birth of eu1.
At this point it was simple. eu1 was a reverse proxy running nginx (More on this choice later) that pointed to our origin server. It was nothing but simple DDoS protection. You see, people from IRC still think it's fun to fire off cheap DDoS scripts from the net and "hack" people (Jesus, grow up already)
I was fine on DigitalOcean since they currently don't charge for extra bandwidth, but I'm not a fan of exploiting stuff like this. If we overdid it and DigitalOcean had to start enforcing limits, even for those who only use 1.1 TB instead of 1 - It'd cause problems for a lot of people other than us. (I'm sure some random guy named Carl wouldn't be thrilled to pay an extra $1 because one of his blog's cat pictures got popular)
So we decided to split it up. And so us1 was born - Which was also a $5 droplet at the time.
For the next half year or so nothing really changed, except for the servers getting scaled up to $10 droplets. There were a couple spikes here and there causing slowdowns, but nothing major. Until Christmas 2015 hit.
We saw the largest increase in traffic ever. We're talking a daily average of 80 mbit/s - Which was almost 10x our regular traffic at that time. Which led to us making a pretty dumb move. We moved us1 over to HostHatch, since they provided a couple more TB per $ - It worked well, for about a week.
December 27th - Right in the middle of most students' Christmas vacation, us1 got suspended for "Abusive bandwidth usage" - So effectively, we got our available bandwidth halved right in the middle of our yearly peak. Not too good. During the 4 hours it took all the twitter mentions to wake me up at around 5 AM - The site was pretty much unusable. It was unable to handle streaming. And for a streaming site, that's pretty rough. This lead to what is probably the fastest server setup I have ever done. It took me approximately 12 minutes to set up an entirely new server on DigitalOcean and configure it to work as it should. This server is still named us1 today.
Once again eu1, us1 and main were all hosted on DigitalOcean - But this taught us an important lesson. We need some headroom. When you're hosting a site that's both bandwidth intensive and not perfectly copyright compliant (Though we are non-profit, which thankfully resolves a lot of arguments) servers *will* get suspended. So eu2 was born.
This was a server hosted at [Scaleway](https://www.scaleway.com/), with a 200 Mbit/s unmetered connection. To make it a bit more useful, it's configured to handle most of south-west Europe by default, such as France and Spain. This system ran from January 2016 until today. The only change being that us1 was renamed to "Mio" in the middle of all this.
This week when openings.moe turns a full year (Or more, honestly i have no idea exactly when we started), eu2 was shut down and Neko took it's place. Neko is a dedicated server with 4 CPU cores, 8 GB of ram and 50 GB of pure-SSD cache space. It's currently on its test period and if it stacks up, eu1 might also get shut down. This would both lower costs and give us more bandwidth to work with.
We're hoping to keep on scaling to accommodate more visitors in the future. Weebs are life.
# Web server
I mentioned [nginx](https://www.nginx.com/products/feature-matrix/) earlier, for those who don't know, this is our web server. It's the very core of openings.moe on the software side. And of course we're using the open source version. There are many web server options available, the most popular being Apache. But there were plenty of reasons that tilted me towards using nginx, well except for obvious ones at least:
![Hah](/content/images/2016/03/nginx-apache-memory.png)
nginx is a fast web server, darn fast. The only thing that gets close would be [lighthttpd](https://www.lighttpd.net/) or [Caddy](https://caddyserver.com/) - It serves requests at record speeds and is one of the fastest web servers available, if not *the* fastest.
This helps a lot for loading, users are here to watch a video. If they have to wait for the player and all of its resources to load, most people would be annoyed. That's like walking into a store to buy a lolipop, but they're sold out so you have to wait for a shipment from China before you can buy one. Though it's slightly less annoying than that.
While some may argue that a solution like HAProxy or Varnish is better for the reverse proxy side of things. And that might also be true for all I know. But nginx is a well supported piece of software and everyone who has ever used it treats it like it's their waifu. Using nginx through the entire stack allows for both consistent configuration and a massive amount of online support.
nginx is also pretty much always the first web server to get new features. Why live like a snail when you can just cut yourself on the HTTP/2 edge as soon as possible (also, while we're on the subject. I'm hoping to get HTTP/2 working on openings.moe, but I'm waiting for Ubuntu 16.04 to appear first and a couple reviews regarding it as a web server)
Since I already use nginx for my personal servers, it was just easier than relearning an entire stack as well.
# Conclusion
In summary we've met surprisingly few bumps in the road along this run, maybe it's luck or god forbid I actually know what I'm doing here. Either way, I want to sum up how openings.moe runs by showing a you an image of DigitalOceans "optimal" application life cycle. Except that it's edited to reflect how we **actually** run this site:
![openings.moe workflow](/content/images/2016/03/aniopfunction.png)
When all is said and done, the site is still alive and kicking better than ever. So happy Easter and try not to watch too much weebshit, you'll end up brainwashed by Japan. (Unless you already are, then just go ahead. Nothing left to lose)