MySocket on Stackpath, exploring a more flexible and heterogeneous multi-cloud infrastructure
In this blog post, we'll look at the mysocket journey of moving from a single cloud provider to a more flexible and heterogeneous multi-cloud infrastructure. We moved from using just AWS to now also include Stackpath. We'll see how mysocket now relies on two independent anycast networks and what that means for the build and deploy processes as well as traffic load balancing options.
For the last few months, the mysocket infrastructure has been running on AWS without any issues. I chose AWS for no specific reason other than it has a wide range of supporting services that make building a cloud-delivered service easy. However, I’ve always felt there were other compute providers that Mysocket could be running on. I figured it’s time to put this to the test and see what it would take to run mysocket on a multi-cloud infrastructure.
By actually going through the process of building this, we’ll gain a better insight into what it would take to move to a multi-cloud infrastructure. It will also put us in a better place as not being tied to one cloud provider will give us much more flexibility and eliminates provider lock-in. You can follow along with my journey in this blog post.
High-Level architecture overview
Before we dive in, it's important to have a basic understanding of the architecture. At a high level, Mysocket has three main components. The section below provides a brief overview of each.
The control plan is responsible for making sure that all changes made through the API by users or changes in the number of available compute nodes (edge proxies and tunnel servers) are correctly and swiftly signaled to all edge servers. The control plane is for example, responsible for making sure that when a user creates a new socket, or a new tunnel connects, all edge proxies are made aware of this change and configure themselves accordingly within a few seconds.
The management plane has everything to do with logging and monitoring. Its role is to provide the operator of the mysocket service with all the graphs and logs, etc, to see what’s happening. This is where we go to make sure everything runs fine, or when there’s a hiccup. This should provide us with the info to help us find what happened when and why.
The data plane layer are the servers that carry all the users' traffic and the real workhorses of the service. As a result, these servers make up the far majority of our fleet. It’s a collection of compute nodes running both the tunnel servers and the proxy servers. Ie. this is where the tunnels terminate, and all requests for *.edge.mysocket.io are being served.
Any hiccup in the data plane will affect the user's experience. As a result, we’ve taken great care in making sure that it’s highly available and as close to the users as possible. We’ve achieved this using anycast, implemented using AWS’s global accelerator service. Anycast allows us to have many compute that are globally distributed.
We’ll focus on the data plane layer for our multi-cloud experiment since the far majority of the cost and potential lock-in is with these servers.
Build and Deploy architecture
Before we continue, I should share a bit more about how we’re currently building and deploying images. High level, the process is pretty simple and follows industry best practices.
Each time there is an update to any of the software components on the data nodes, and we’re ready to deploy them, we build a fresh VM image. To do this, we spin up a fresh VM and kick off our build scripts. These scripts install all the software components and configure the base VM according to our needs. We then take a snapshot of this VM, the image is stored, and we finish the build process by destroying the build VM.
Next up, we bring up new compute nodes with the freshly built VM image. As part of this deploy process, we specify the latest VM image, the size of the VM (ie. the number of CPU cores and memory), and finally how many instances we want and in what data center locations to deploy.
Since we want to deploy often, it’s important to automate and streamline this process as much as possible. This makes sure we can deploy multiple times a day and have a high level of confidence that it will all work as expected. To achieve this, we rely heavily on automation, i.e. scripts that automatically build the images, orchestrate the infrastructure, and ensure they get deployed gracefully.
This automation uses the API provided by cloud providers. In our case, we use Terraform pretty heavily to manage almost all components of the build and deploy pipeline.
Adding Stackpath as a second cloud provider,
I choose stackpath as my second cloud provider. Mostly because of my positive past experiences with them, and more importantly their support for terraform and anycast. We’ll use their anycast feature to implement high availability and load balancing.
Since both stackpath and AWS (our current cloud provider) have support for terraform, it should make the integration of additional dataplane providers easy.
I used this blog and notes as an example to get started with the new terraform file(s) for Stackpath. All in all, the addition of Stackpath was pretty simple. I did need to make a few changes, but luckily most of it was contained to a different terraform template for each provider.
I started with a copy of the build and deploy scripts for AWS and modified them for Stackpath. As the workflow is mostly the same, except for a few implementation details this was all fairly easy.
I did need to make some changes around secrets management. For example, I’m using cloudwatch for logging. In the AWS world we’re using a feature AWS calls “iam roles”. This allows the VM’s to get the required credentials automatically. This obviously doesn’t work with machines outside of EC2 so I ended up having to statically pass in the proper cloudwatch credentials for the Stackpath VM;s
The multi-cloud setup is already live and serving traffic. I'm using Round-robin DNS to load balance traffic between the AWS and Stackpath anycast IP addresses. This means the client's resolver will pick one of the two available anycast IP addresses randomly.
$ dig +noall +answer wispy-flower-5537.edge.mysocket.io wispy-flower-5537.edge.mysocket.io. 300 IN A 126.96.36.199 wispy-flower-5537.edge.mysocket.io. 300 IN A 188.8.131.52
After the client selects either the AWS or Stackpath IP addresses, it will route the traffic to the selected cloud provider. From there on we rely on anycast to get the user's traffic to the closest mysocket edge node.
I will continue running in this hybrid AWS - Stackpath mode for a while, just to make sure it continues to work as expected and continue to gain operational experience.
For now, round-robin DNS between the two anycast networks is fine. But, since I'm already using NS1 as the DNS provider for these records, it may be interesting to add some of their smart routing capabilities to do traffic routing based on capacity or cost. This way, we can send say 75% of the traffic to Stackpath and 25% to AWS.
Wrapping up; lessons learned and looking forward
I’m pretty happy with the result! Mysocket edge services (proxies and tunnel servers) are now multi-cloud and are serving traffic in both AWS and Stackpath. The service is anycasted in both cloud providers, providing us with needed global load balancing and high availability.
Good automation around build and deploy processes made introducing an additional cloud provider easy. This example was with Stackpath, but I’m pretty confident it will be just as easy with Equinix Metal or Vultr, or any other provider with terraform and BGP or anycast support. All in all, it took me about two days to implement the addition of Stackpath as an alternate provider, most of that work is implementing the automation and testing.
Looking forward, I think I’ll let mysocket run on both cloud providers for a bit to continue getting operational feedback. I may start using the two providers as separate deployment cycles, where the initial deployment goes to say AWS, and a few hours later to Stackpath. That way, we always have some redundancy in case of a bad deployment.
Similarly, instead of using DNS round-robin for tunnel connections, we could further improve the reliability of the tunnel service by having the client initiate multiple connections. For example, one tunnel to cloud provider A and one tunnel to cloud provider B.
In summary, I think introducing more cloud providers is pretty easy as long as there support for automation, preferably with Terraform as that fits with the current workflow and some form or anycast of BGP support. There are a few cloud providers out there that should support this out of the box; perhaps we’ll look at those next :)