Fixes
- Fixed updated library causing internal server errors in certain cases.
Also, Friday morning I finished setting up the second server. However, there was a small but noticeable increase in latency (the actual delay appears to vary based on the size of the data returned). This is the expected result, due to the database connection now being remote which adds network latency, plus the (temporal) cost of encryption for that connection.
As such, given that the current server--once I fixed the backend scheduler clobbering it--appears to be more than capable of supporting the current load, I see no reason to degrade performance just because.
Basically, I believe I've located the source of the performance issues and have addressed it, in light of everything not grinding to a halt (or even coming closing according to my monitoring) the past two nights despite having more traffic than Monday and about equal to Sunday.
If you're interested in the details, read on.
There's a backend scheduler that runs tasks (load battles, refresh clan membership, run queued tasks). The scheduler automatically creates threads to do this, which is good because it means these things can run concurrently (not actually, but that doesn't really matter for this).
The way this all ends up interacting, at most the following could be running in theory:
All of which is, I suspect, alone isn't the issue. However, there's another factor, those queued tasks.
There are basically three possible ways a task might be added to the queue: Manually triggered request to reload clan members, Payout Calculations, and User Stats loading.
That final one I think was the crux of the issue. It's triggered whenever a user signs in if their data hasn't been loaded recently. For players in a participating clan, this isn't triggered unless they joined the clan then signed in between two automatic updates. For players who aren't in a participating clan, this is basically always going to be triggered due to the likely distance between sign-in events.
Prior to Tuesday, there was no (sensible) upper limit for concurrent queue tasks, so if a number of people signed in at the same time and triggered player data refreshes for each of them, then every 5 seconds a new thread would be created to run one of those updates. This often will work fine because the refresh is quite quick, but it still does take some time due to needing to send two requests to the WG API. Add in some server load and they can start backing up.
Worst still, they don't go away when the (web) server is restarted, so they basically start clobbering the (entire) server as soon as the web server is started, and the more they back up, the more they clobber. This eventually leads to a massive amount of thrashing, leading to very high I/O wait times and I/O usage and grinds everything to a halt.
After adding a sensible limit to keep this in check, along with some other limits as detailed over in this blog post, the massive performance issues have completely stopped from what I can tell. Though, I suspect in peak load response times a bit slower as a reduced the number concurrent requests on Sunday in an ineffective stopgap measure.
Barring any unexpected complications, I'll switch to the new server for serving web requests tonight, and along with it will increase concurrent limit.
I've changed the way the backend scheduler works to prevent an excessive number of tasks from running at the same time. For the moment it's extremely limited with battle loading having one channel of work (mutex) and everything else sharing another channel of work. Queued operations, specifically Payout Calculation, may take a while to run due to this. However, I'm hopeful I'll be able to relax the restrictions soon as I'm in the process of setting up a second server, but for the time being this will hopefully prevent the server from falling over as has happened in the past two days.
Presently, I'm working on finishing the setup process for the new server. After that's setup and handling requests I'll reassess the performance situation and go from there.
Due to reports of Imgur being compromised (which I used to host the images on the site, for the features/help sections), I've temporarily disabled images. I think I will host them on the Clan Tools server to avoid future issues such as this.
First, I want to apologize for the issues the site was having last night. I know how frustrating it is when a service you rely on is slow or completely unuseable, and I'm sorry that's something you had to deal with.
I am looking for what is causing the problem, and more importantly solution(s). However, there doesn't seem to be a clear answer to the former question, which makes the latter a shot in the dark.
The issues, to me, don't make sense given my understanding of performance. There's clearly an bottleneck somewhere, but I can't figure out where.
Last night, the peak 1 minute load was 6.28 which was at one specific point, beyond that load was between 5.5 and 4.0. The server has 6 cores, and to my understanding 1.0 load represents 100% CPU usage for a single core, thus 6.0 is 100% usage for 6 cores.
Thus, outside of one minute, CPU load was below 100% across all cores.
Memory usage never went above 2GB (out of 3), and thus was well within reason as well.
Inspecting the queue, which shouldn't be causing timeouts anymore regardless, didn't show any backed up requests.
The only oddity I found was that sendmail was apparently stuck in a recursion loop, but I wouldn't think the amount of I/O it was causing would have been significant enough to introduce slowdowns. However, I don't have any log information regarding I/O, which I have since addressed.
Of course, none of the above is a solution, so at present I'm working to make a few key changes to reduce load on the server. I hope to have these changes implemented by tonight. Though, again, it really is just an educated guess at this point if this will have any impact.
There are several new code type options, two of the most interesting ones are designed to address the same request from clans in two different ways.
This, as the name suggests, make all valid codes of that Code Type available to be seen by all clan members by visiting a specific page on Clan Tools. You can get to this page from the Attendance page (Clan Home > Attendance > Valid Public Codes) or Code page (Clan Home > Codes > Valid Public Codes).
Available on the Code Settings tab.
This setting on the other hand allows having auto-creation of codes beyond the current day, up to 7 days. This makes it easy to copy a week's worth of codes for listing someplace else.
Available on the Code Settings tab.
The above three settings are available on the Advanced Settings tab.
As you may already know, WoTManager is sadly shutting down. Further, WoTManager has graciously decided to direct their users to Clan Tools as a replacement.
This has already led to an influx of new and interested clans, which has the potential to cause performance issues.
As such, I just want to be clear that I am monitoring performance, and if there is a degradation in performance, I will address it by increasing the available resources.
Also, if you or your clan are noticing such performance drops, please contact me (https://clantools.us/contact) as the monitoring tools I have only show so much. Thanks.
Because WG doesn't remember a user's IR data when they leave their clan, and Clan Tools does, a negative value would be reported whenever a player left a clan then rejoined it. Clan Tools now checks the player's join date to detect when this has occured and record the correct delta.
Existing negative IR has also been corrected. If you notice any issues, please contact me: https://clantools.us/contact
Another issue was also fixed, which was Clan Tools for any new member would ignore any IR gained in the first day with the clan.
To be clear, this has no real impact. The interface is a bit cleaner in a few places (known available, which was often zero, is gone and any display of in garage is gone). Tank Locking data will still work as per usual.
As to why? It was unused. In large part because Wargaming requiring an API token from the individual player to access in garage data, every player in a clan would need to provide an API token to Clan Tools. Possible, but it hadn't happened yet.
I've also optimized the clan refreshing method to require one less API request per a clan and reduce the amount of data transmitted.
Intermittent timeout errors was an issue that would spring up every once in a while and didn't make much sense to me when I had looked into it previously. However, I recently had a realization and I believe I've fixed the issue, which appeared to be caused by requests that got stuck waiting due to high server load.
On the topic of server load, I've also reduced the maximum number of simulations requests which can be served. This may seem to be a negative, however I suspect that the grinding to a halt of the entire site encountered a few sundays prior was caused by very high server load which pushed memory usage beyond the amount of available RAM and into the swap file (which slowed everything down). My monitoring indicates that most of the instances were rarely used anyway, so I don't expect this to be noticeable the vast majority of the time either way.
Feedback on this is welcome though; do you notice a change during times of high load?