by Kirill, Animesh - Dec 12, 2023
Greetings from the team behind Chips of Fury! We've been on an exciting journey, crafting an online poker experience that is built with passion for the game and care for our players. In our quest to provide a pleasant gaming experience, we’ve conducted a detailed study of hosting providers out there. Fly.io was one of them - it is an infrastructure provider with a bunch of cool features, easy global accessibility, and more than reasonable hosting costs. Also, Chips and Sandwiches go great together, so we obviously went all in.
The experience has been great so far - Fly.io is a delight to use in most regards. There are a few caveats that we’ve discovered, however, and we would like to share them.
Now, let us deal the cards…
When Chips of Fury players initiate a game, our orchestrator checks for available servers. Existing servers seamlessly host players, but due to obvious limitations, new servers must be started sometimes.
These newly spawned game servers, despite receiving the internal green light with regards to their health, exhibit a hesitation to promptly respond to external requests. This delay, as revealed by our benchmark data, probably surpasses the attention span of avid TikTok users (but do not quote us on that). This is a somewhat painful situation we have found ourselves in as a growing company - the long game server connection establishment might turn away new potential players, while always provisioning machines in all regions is a cost we would like to avoid.
Using Dart, the language Chips of Fury game servers speak, we created a straightforward script to collect some data on how long it takes for started servers to become responsive. It was kept running for 24 hours in regions crucial for us (
The application that the machines hosted is a basic HTTP server that has a few endpoints, including the essential /health. As to the numbers, we had two key metrics in mind:
For some reason, the
lhr region was misbehaving that day. Since it would not be very fair to judge the machines’ quickness to become responsive, some outliers have been removed from the overall picture (but note that this data highlights the occasional instability of Fly.io which we are also unhappy with).
Even with the extreme values removed, the startup time varies a lot from region to region, but that is not the main concern here.
Instead, it is the response times observed across different regions that raise red flags: 20-30 seconds for the machine to become available to the public internet is very long for our specific needs. Remember: the time to become responsive to traffic outside of the Fly’s internal network is recorded after the machines enter a started state. It seems that the Fly.io networking takes time to catch up and start routing traffic to the instance, and as stated above, for the server to be unresponsive for upwards of 20 seconds after startup is too long for a game server.
In addition to this problem, there are a number of smaller issues that Fly has, such as an inconsistent dashboard, machines getting stuck in some states (even during this benchmark one machine stuck in the "starting" state and remains totally unresponsive until today), and very little failure recovery paths. The last one can be especially painful - our main admin application has completely stopped working because both instances dedicated to it became unhealthy, and not due to application error.
Something else that perhaps not only us would benefit from is some sort of real time notifications about machine status updates.
While this analysis has shed light on the peculiar response times, it leaves us with lingering questions about the root cause. Our goal here is simple: share what we found, spark a conversation, and hopefully make the Fly.io experience better for Chips of Fury and everyone else.
Are there optimizations we've overlooked in our current infrastructure setup? Is there a possibility of unexpected behavior from the Fly.io proxy service? As we dive deeper into these mysteries, we invite the Fly.io community to join the exploration.