Saturday, September 26, 2015

Changelog for 2015-09-26 (Hotfix)

Fixes

  • Fixed updated library causing internal server errors in certain cases.

Changelog for 2015-09-26

Changes

  • Added section to WoTManager translation guide about Import functionality.
  • Removed ability for clans to manually reload members. Clan member changes are automatically loaded each hour, and whenever a clan member signs into the site, their clan data is automatically updated. There's no need to have this, and it can lead to race conditions (for join/leave events).
  • Added a minimum combination check to Code Types, to prevent a random choices and random length combination that gives a very low number of combinations (because this can cause code creation to fail due to not enough combinations).
  • Added Note of when automatic code creation occurs to the Code Settings tab.
  • Added Delete link on Codes lists.
  • Added slight fade to Code Action links on Attendance and Valid Public Codes pages.
  • When creating a new Match Type, One Battle per Match now defaults to true as this setting produces more logical results, especially when paired with setting (or having it auto-determined via a replay) the Result for a Match.
  • Added better error page for timeouts.
  • Added better error page for Security Violations typically caused by having more than one tab open editing an item.
  • Added better error page for File Not Found (for replay downloading).

Fixes

  • Added check to hopefully prevent issues with duplicate join events and a single (invalid) leave event because created, seemingly due to short-term mismatches in data returned by different WG API methods.
  • Fixed Twitter widget not loading due to CSP conflicts with changes to the way Twitter loads the widget.
  • View code checking show (details) instead of edit permission for codes listing. Note this was purely an error in the client side display code and had no impact on security.
  • Fixed attempting to use a lowercase Prefix that has already been taken causing an internal server error.
  • Moved Features/Help images to Clan Tools (instead of using Imgur).

Backend Changes

  • Removed unused libraries.
  • Updated various libraries.
  • Increased maximum concurrent request handlers.

Also, Friday morning I finished setting up the second server. However, there was a small but noticeable increase in latency (the actual delay appears to vary based on the size of the data returned). This is the expected result, due to the database connection now being remote which adds network latency, plus the (temporal) cost of encryption for that connection.

As such, given that the current server--once I fixed the backend scheduler clobbering it--appears to be more than capable of supporting the current load, I see no reason to degrade performance just because.

Thursday, September 24, 2015

Performance Issues [Update 2]

Basically, I believe I've located the source of the performance issues and have addressed it, in light of everything not grinding to a halt (or even coming closing according to my monitoring) the past two nights despite having more traffic than Monday and about equal to Sunday.

If you're interested in the details, read on.

There's a backend scheduler that runs tasks (load battles, refresh clan membership, run queued tasks). The scheduler automatically creates threads to do this, which is good because it means these things can run concurrently (not actually, but that doesn't really matter for this).

The way this all ends up interacting, at most the following could be running in theory:

  • 2 Battle Loaders
  • Hourly refresh for all clans.
  • Daily general maintenance tasks (for the ASIA server around primetime for NA).

All of which is, I suspect, alone isn't the issue. However, there's another factor, those queued tasks.

There are basically three possible ways a task might be added to the queue: Manually triggered request to reload clan members, Payout Calculations, and User Stats loading.

That final one I think was the crux of the issue. It's triggered whenever a user signs in if their data hasn't been loaded recently. For players in a participating clan, this isn't triggered unless they joined the clan then signed in between two automatic updates. For players who aren't in a participating clan, this is basically always going to be triggered due to the likely distance between sign-in events.

Prior to Tuesday, there was no (sensible) upper limit for concurrent queue tasks, so if a number of people signed in at the same time and triggered player data refreshes for each of them, then every 5 seconds a new thread would be created to run one of those updates. This often will work fine because the refresh is quite quick, but it still does take some time due to needing to send two requests to the WG API. Add in some server load and they can start backing up.

Worst still, they don't go away when the (web) server is restarted, so they basically start clobbering the (entire) server as soon as the web server is started, and the more they back up, the more they clobber. This eventually leads to a massive amount of thrashing, leading to very high I/O wait times and I/O usage and grinds everything to a halt.

After adding a sensible limit to keep this in check, along with some other limits as detailed over in this blog post, the massive performance issues have completely stopped from what I can tell. Though, I suspect in peak load response times a bit slower as a reduced the number concurrent requests on Sunday in an ineffective stopgap measure.

Barring any unexpected complications, I'll switch to the new server for serving web requests tonight, and along with it will increase concurrent limit.

Tuesday, September 22, 2015

Performance Issues [Update 1]

Update 2

I've changed the way the backend scheduler works to prevent an excessive number of tasks from running at the same time. For the moment it's extremely limited with battle loading having one channel of work (mutex) and everything else sharing another channel of work. Queued operations, specifically Payout Calculation, may take a while to run due to this. However, I'm hopeful I'll be able to relax the restrictions soon as I'm in the process of setting up a second server, but for the time being this will hopefully prevent the server from falling over as has happened in the past two days.

Presently, I'm working on finishing the setup process for the new server. After that's setup and handling requests I'll reassess the performance situation and go from there.

Images

Due to reports of Imgur being compromised (which I used to host the images on the site, for the features/help sections), I've temporarily disabled images. I think I will host them on the Clan Tools server to avoid future issues such as this.

Monday, September 21, 2015

Performance Issues

Update 2

Update 1

First, I want to apologize for the issues the site was having last night. I know how frustrating it is when a service you rely on is slow or completely unuseable, and I'm sorry that's something you had to deal with.

I am looking for what is causing the problem, and more importantly solution(s). However, there doesn't seem to be a clear answer to the former question, which makes the latter a shot in the dark.



The issues, to me, don't make sense given my understanding of performance. There's clearly an bottleneck somewhere, but I can't figure out where.


Last night, the peak 1 minute load was 6.28 which was at one specific point, beyond that load was between 5.5 and 4.0. The server has 6 cores, and to my understanding 1.0 load represents 100% CPU usage for a single core, thus 6.0 is 100% usage for 6 cores.

Thus, outside of one minute, CPU load was below 100% across all cores.


Memory usage never went above 2GB (out of 3), and thus was well within reason as well.


Inspecting the queue, which shouldn't be causing timeouts anymore regardless, didn't show any backed up requests.


The only oddity I found was that sendmail was apparently stuck in a recursion loop, but I wouldn't think the amount of I/O it was causing would have been significant enough to introduce slowdowns. However, I don't have any log information regarding I/O, which I have since addressed.



Of course, none of the above is a solution, so at present I'm working to make a few key changes to reduce load on the server. I hope to have these changes implemented by tonight. Though, again, it really is just an educated guess at this point if this will have any impact.

Wednesday, September 16, 2015

Changelog for 2015-09-16

Changes

  • Re-enabled WoTcs.com clan history views since they appear to be working again.
  • Added exception for replay uploads to (hopefully) work around an issue related to Wine.

Fixes

  • Fixed WoTLabs signature on Player Lookup using incorrect server name, resulting in the signatures displayed being for NA players of equal names.
  • On Player Lookup, fixed no Tier Header being visible in Clan Wars Tanks if a tier section only had close to tanks.