Basically, I believe I've located the source of the performance issues and have addressed it, in light of everything not grinding to a halt (or even coming closing according to my monitoring) the past two nights despite having more traffic than Monday and about equal to Sunday.
If you're interested in the details, read on.
There's a backend scheduler that runs tasks (load battles, refresh clan membership, run queued tasks). The scheduler automatically creates threads to do this, which is good because it means these things can run concurrently (not actually, but that doesn't really matter for this).
The way this all ends up interacting, at most the following could be running in theory:
- 2 Battle Loaders
- Hourly refresh for all clans.
- Daily general maintenance tasks (for the ASIA server around primetime for NA).
All of which is, I suspect, alone isn't the issue. However, there's another factor, those queued tasks.
There are basically three possible ways a task might be added to the queue: Manually triggered request to reload clan members, Payout Calculations, and User Stats loading.
That final one I think was the crux of the issue. It's triggered whenever a user signs in if their data hasn't been loaded recently. For players in a participating clan, this isn't triggered unless they joined the clan then signed in between two automatic updates. For players who aren't in a participating clan, this is basically always going to be triggered due to the likely distance between sign-in events.
Prior to Tuesday, there was no (sensible) upper limit for concurrent queue tasks, so if a number of people signed in at the same time and triggered player data refreshes for each of them, then every 5 seconds a new thread would be created to run one of those updates. This often will work fine because the refresh is quite quick, but it still does take some time due to needing to send two requests to the WG API. Add in some server load and they can start backing up.
Worst still, they don't go away when the (web) server is restarted, so they basically start clobbering the (entire) server as soon as the web server is started, and the more they back up, the more they clobber. This eventually leads to a massive amount of thrashing, leading to very high I/O wait times and I/O usage and grinds everything to a halt.
After adding a sensible limit to keep this in check, along with some other limits as detailed over in this blog post, the massive performance issues have completely stopped from what I can tell. Though, I suspect in peak load response times a bit slower as a reduced the number concurrent requests on Sunday in an ineffective stopgap measure.
Barring any unexpected complications, I'll switch to the new server for serving web requests tonight, and along with it will increase concurrent limit.
No comments:
Post a Comment