As reported in our previous post, we moved a vast majority of our Platform infrastructure to DigitalOcean. We mentioned that this isn’t going to show as a direct change to anything for users, but the Platform should respond more quickly compared to how it has been. The stability of the Platform has also been improved to increase our critical system redundancies and fallover in the event of a critical issue. This means when infrastructure issues happen, you will barely notice them compared to previous incidents! Now, I'll go into more detail about what exactly we've been doing since the changes are not as visible.
Moving to DigitalOcean from our previous host Linode, greatly enabled us to consider new and better solutions for the Technic Platform, the move itself was complex that required us to redo our entire code deployment setup, automated node configuration system, and rebuild our network from the ground up. This took lots of time to stage, plan, and execute with some unavoidable downtime; but we made it happen!
Previously our database ran off of one massive main MySQL server. While it did its job, issues would arise if the database service needed to be brought down for maintenance or due to an underlying issue with the our providers physical node. This has been replaced with a MariaDB Galera Cluster running in a master-master replication setup. This allows us to bring down nodes at will, without impacting the availability of the data to the Platform and Launcher. Rolling security updates and optimizations are now possible on a more routine basis rather than dealing with the headache of scheduling maintenance for the entire Platform.
You can read more about Galera cluster here: http://galeracluster.com/
Error Reporting & Network Metrics
One of the biggest problems was identifying and isolating problems with the Platform. Users report problems daily ranging from the Platform to Solder to the Launcher. We didn't have a great way of cross-referencing ports from users to reproduce and isolate the problems. To help assist us with this we implemented systems to target two key things.
We utilize a system called Sentry (https://sentry.io) that allows our systems to remotely report when a user experiences an error and to log it. We utilize this data to cross-references reports from Users about issues they are experiencing. We've already fixed multiple issues that were plaguing the Platform, unknown to us. These include:
- Complete overhaul of the email backend (registerations, password resets, etc.)
- Complete overhaul of our entire web asset system
- Rewrote Platform Solder integration
- More reliable Mod sub-page listings
- Multiple profile page tweaks
- and numerous bug fixes...
The other problem was knowing where traffic was going, monitoring the health of each individual system, and identifying bottlenecks in real-time. For this, we've setup Prometheus (https://prometheus.io/). Prometheus is an open-source monitoring and time-series database. It allows us to export custom statistics from each of our individuals core systems to reference over time. To better track/analyze this data, we setup Grafana (http://grafana.org/) metric dashboard system that plugs into the Prometheus backend. So examples of our dashboard can be seen below:
We have a new status page to keep you guys informed about our maintenance windows and the current status of our Platform and its components. http://status.technicpack.net
More to come!
We've made great strides over the last couple of months, but this is only the beginning. We have some cool new features that we are excited to share with you soon. We are confident in working on those features now that the Platform is in better shape. Also, I hear that a band of adventures are returning for another grand adventure!