Hydaelyn Role-Players
'Server guy' gives his insight on ARR server problems - Printable Version

+- Hydaelyn Role-Players (https://ffxiv-roleplayers.com/mybb18)
+-- Forum: Final Fantasy 14 (https://ffxiv-roleplayers.com/mybb18/forumdisplay.php?fid=41)
+--- Forum: FFXIV Discussion (https://ffxiv-roleplayers.com/mybb18/forumdisplay.php?fid=12)
+--- Thread: 'Server guy' gives his insight on ARR server problems (/showthread.php?tid=4063)



'Server guy' gives his insight on ARR server problems - Dameon - 08-27-2013

Quote: Poached from Reddit


I wrote this comment in reply to a post asking 'They refuse to add new NA servers, why? Can anyone explain this?'. I felt as though I could explain this, at least in part, and so I did. And here it is.

I only had about ten minutes to spare to write this up before I went back to scaling our own systems, so I'd love to hear some other people who work in the industry chime in with their experiences, examples, and opinions. Maybe we can get rid of this 'MOAR SURVURS DUH' attitude that some people seem to have.

Edit: Oh, and this post is intentionally dumbed-down, not because I doubt my fellow redditors but because I had to rush through writing it. Obviously the issues are significantly more complex than the examples I lay out here.

As someone who works in infrastructure/servers/networking/IT/etc. for a company that does large-scale multiplayer games, I might actually be able to.

First, not everything scales linearly. Within a given 'world', in the servers that handle people online, you may have one 'server' that can handle 5000 people online at once, but adding a second server may not get you to 10000; two servers together may only be able to handle 9000 people; three servers might only get 12000, four might allow for 14000, and five might support 16000.

This can be due to the overhead of managing multiple characters and multiple interactions. If you have two people in one zone, and you're updating their positions every 1 second, you have to send 2 updates every second per person (updating each person with their official location and everyone else's location), so 4 updates total. If you have 4 people (twice the number of people) you have to send 16 updates per second. If you have 8 people, you're sending 64 updates per second (each 8 person getting 8 updates). This is a really simplified example, but it shows how 2+2 can be a lot more than 4.

So if you have a case like that, you have the option of spending those 'five servers' on one world to handle 16,000 people, or five worlds to handle 25,000 people. This is why the solution to capacity problems in MMOs is usually to open new worlds, and not just to grow existing ones.

Obviously these systems are more complex. A single 'world' is made up of dozens of components (character servers, combat servers, chat servers, instance servers, dispatch servers, login servers, etc.), and each one of these systems could have the same issues, and completely different load profiles and scaling issues. Because 'instance servers' appear to be shared across all worlds, they have to handle the capacity of every single dungeon, instanced fight, etc. for every single NA/EU world concurrently, which means that they scale completely differently than the rest of the worlds.

Another problem those servers had was that because no one could create an instance, a lot of people got backed up at the same points; before Sastasha, before Ifrit, before your level 20 class quest, etc., so now instead of having players spreading out across the level curve, you have huge clusters of people catching up to each other like something out of Amazing Race.

Then they bring the instance servers back up, and everyone rushes to do their instances. All the Ifrit fights, all the lv5 quests, all the Haukke Manor runs. Now suddenly instead of having instances spread out because levels are spread out, you have a huge proportion of players all trying to get into instances at once, and your load spikes, and now no one can get in. Now it's a completely different problem; instead of being unable to handle the common case of instance requirements, you can't handle the case of a large proportion of people online trying to run an instance, all happening at once.

One of the problems with servers is that if your servers are overloaded, it's easy for your monitoring tools to start failing (because the system won't run them because there's too much else going on), and you can have problems logging in. In those cases, you can have servers which hit their capacity in unexpected ways, suddenly, before you have a chance to spot the problem and figure out what's happening. For example, a memory leak that only happens sometimes can take down a server rapidly, and make it extremely difficult to track down because once the server has died you can't log in to debug it.

It wouldn't surprise me if some of their downtime was trying to work around those issues while also adding a lot of debugging information so they could track down what exactly was happening on the server and find the source of the problem (instead of just trying to mitigate it).

So for that instance issue, it doesn't matter how many servers they add to run instances, if they're still going to have them die off too quickly because there's a software bug they need to fix, or because each server adds less and less capacity because of non-linear scaling.

That's just my two cents though. It could be a dozen other reasons.

Edit: holy f-balls, this thread blew up. Ive been trying to reply to everyone's questions but I've fallen behind, and my Free Company is having a meetup right now (I'm at the restaurant as I type this), so I'll have to come back to it all tonight and try to catch up.

Keep the questions coming and I'll try to answer them as best I can tonight!



RE: 'Server guy' gives his insight on ARR server problems - Asyria - 08-27-2013

And that's why I'm saying auto-kicking afkers is more important than adding servers. *nodnods*


RE: 'Server guy' gives his insight on ARR server problems - Z'karu - 08-27-2013

(08-27-2013, 11:28 PM)Asyria Wrote: And that's why I'm saying auto-kicking afkers is more important than adding servers. *nodnods*

Pretty much, the entire thing's become a self-fulfilling prophecy

People get on early, hear that people can't get on because of crowding, refuse to log out because they don't want to become one of said people, becoming a part of said problem

Still, coding's a pain, so "flip a switch to kick afk" isn't as simple as that wording puts it, it'd probably be a week long change to the system


RE: 'Server guy' gives his insight on ARR server problems - Ravinous - 08-28-2013

My two cents as a IT specialist  (I'm a CCNP).

This is why I was talking about sharding. Basically each server is a collection of different servers with one handling traffic flow. Each sub server acts as it's own server sending general updates back and forth with the main server(s) to keep track of were all the players are, there messages and chat, dealing with background task, etc. This doesn't make it limitless but you can deal with a very large user base (EVE is shard based and has a single server that deals with tens of thousands to hundreds of thousands).

In the perfect set up Players only talk to the log in server and the shard they are currently using, the interconnect servers are only used for those background task and inter-shard chat and a secondary transfer server is used for changing shards.

Some claim it breaks immersion, but having to OOC to decide which shard to play on isn't the end of the world and can remove you from issues with trolls/harassment and in general are more efficient than just stacking servers into a single instance. It's about good data management and redundancy but requires a high initial investment.

This is were I think that SE decided not to utilize this system, cost. They could not predict that we would have this large of a player base at once. I'm sure they are using a classic form of zone servers like in FFXI (which is why we have to load into each sub area) but if to many people are on at once in one zone well... there you go.

So yes, more servers can help, alot. AFK timers too. But more servers would only help in the right style of infrastructure, and with them operating on the Zone method I think it would require a lot of time to utilize.


RE: 'Server guy' gives his insight on ARR server problems - LiadansWhisper - 08-28-2013

(08-28-2013, 12:04 AM)Ravinous Wrote: My two cents as a IT specialist  (I'm a CCNP).

This is why I was talking about sharding. Basically each server is a collection of different servers with one handling traffic flow. Each sub server acts as it's own server sending general updates back and forth with the main server(s) to keep track of were all the players are, there messages and chat, dealing with background task, etc. This doesn't make it limitless but you can deal with a very large user base (EVE is shard based and has a single server that deals with tens of thousands to hundreds of thousands).

In the perfect set up Players only talk to the log in server and the shard they are currently using, the interconnect servers are only used for those background task and inter-shard chat and a secondary transfer server is used for changing shards.

Some claim it breaks immersion, but having to OOC to decide which shard to play on isn't the end of the world and can remove you from issues with trolls/harassment and in general are more efficient than just stacking servers into a single instance. It's about good data management and redundancy but requires a high initial investment.

This is were I think that SE decided not to utilize this system, cost. They could not predict that we would have this large of a player base at once. I'm sure they are using a classic form of zone servers like in FFXI (which is why we have to load into each sub area) but if to many people are on at once in one zone well... there you go.

So yes, more servers can help, alot. AFK timers too. But more servers would only help in the right style of infrastructure, and with them operating on the Zone method I think it would require a lot of time to utilize.

This is actually what World of Warcraft is about to implement after 5.4 drops. "Virtual Realms" are going to group together realms of the same type to stabilize populations (since there are some extremely overpopulated servers, a few middling servers, and a lot of really low population servers (I'm talking no more than 5-10 max levels in a faction online during peak hours). You'll be able to join guilds that are based on "other servers" but in your same Virtual Realm, add people to your friends' list, send in-game mail between characters spread out over the actual servers that are connected (including sending Heirloom XP boost items to them), and the Auctionhouse will be linked between all of the servers that make up the Virtual Realm. The Virtual Realm will basically function as a giant server.

Don't forget that their explanation for the instance issues, btw, was that the server that "ports" you from the zone server into the instance server kept dying.  When you think about all the people taking airships, going from zone to zone, etc, at one time...I really hope it's more than one server handling that. lol


RE: 'Server guy' gives his insight on ARR server problems - Taeh Niumoenwyn - 08-28-2013

I wonder if the information about the latest emergency maintenance gives us a clue, you can read it all here - http://na.finalfantasyxiv.com/lodestone/news/detail/747232b9ca786660b3c3099413ef7831dcde966e

Quote:- Preparations for handling an influx in concurrent user counts as a result of split processing the duty finder

Has a severely overload duty finder server been why they have had to heavily restricted logins to stop it failing?


RE: 'Server guy' gives his insight on ARR server problems - Dameon - 08-28-2013

Some people on the other forums I go to are still getting 1017, so I suppose not =(


RE: 'Server guy' gives his insight on ARR server problems - lady2beetle - 08-28-2013

Here's my question - what about the queue?

One way to minimize 1017 issues would be to increase the maximum queue size. I would be happy to log in and get at the end of the queue, even if it meant I was #1000 or something. Because then I'd know I could go do something else, not spam "log in" buttoms and maybe go have dinner and then come back. It also means that it's not a race to click 'log in' at the exact right time. I'd know then that I'd get in when it is my turn.

However, I assume that putting people in a queue means that you have to have a server devoted to that right? I'm not really sure how queues work but I assume there's a devoted "queue server" and that is also very small because they didn't expect to have this many people queued up at once?

(As an aside, I'm not saying that I want to log in to #1000 in the queue every day, but it would be an acceptable bandaid until they have a chance to upgrade their servers next week.)


RE: 'Server guy' gives his insight on ARR server problems - Moondoggie - 08-28-2013

(08-28-2013, 09:37 AM)lady2beetle Wrote: Here's my question - what about the queue?

One way to minimize 1017 issues would be to increase the maximum queue size. I would be happy to log in and get at the end of the queue, even if it meant I was #1000 or something. Because then I'd know I could go do something else, not spam "log in" buttoms and maybe go have dinner and then come back. It also means that it's not a race to click 'log in' at the exact right time. I'd know then that I'd get in when it is my turn.

However, I assume that putting people in a queue means that you have to have a server devoted to that right? I'm not really sure how queues work but I assume there's a devoted "queue server" and that is also very small because they didn't expect to have this many people queued up at once?

(As an aside, I'm not saying that I want to log in to #1000 in the queue every day, but it would be an acceptable bandaid until they have a chance to upgrade their servers next week.)

You more or less answered your own question there. Having a queue will just fill up the lobby server full of people trying to get it and it will crash and have errors so you are really just moving the same problem from one server to the other. Unfortunately they don't have a great solution right now but any solution that lets more players into one server will just cause all the same problems we had on day one when tons of people flooded the servers.


RE: 'Server guy' gives his insight on ARR server problems - Asyria - 08-28-2013

Sharding rocks, as long as it's done right.
I remember being mad at GW2 because it would send people to "overflow servers" regardless of people being in parties.
Same goes for TSW's shards.


RE: 'Server guy' gives his insight on ARR server problems - LiadansWhisper - 08-28-2013

(08-28-2013, 10:58 AM)Moondoggie Wrote:
(08-28-2013, 09:37 AM)lady2beetle Wrote: Here's my question - what about the queue?

One way to minimize 1017 issues would be to increase the maximum queue size. I would be happy to log in and get at the end of the queue, even if it meant I was #1000 or something. Because then I'd know I could go do something else, not spam "log in" buttoms and maybe go have dinner and then come back. It also means that it's not a race to click 'log in' at the exact right time. I'd know then that I'd get in when it is my turn.

However, I assume that putting people in a queue means that you have to have a server devoted to that right? I'm not really sure how queues work but I assume there's a devoted "queue server" and that is also very small because they didn't expect to have this many people queued up at once?

(As an aside, I'm not saying that I want to log in to #1000 in the queue every day, but it would be an acceptable bandaid until they have a chance to upgrade their servers next week.)

You more or less answered your own question there. Having a queue will just fill up the lobby server full of people trying to get it and it will crash and have errors so you are really just moving the same problem from one server to the other. Unfortunately they don't have a great solution right now but any solution that lets more players into one server will just cause all the same problems we had on day one when tons of people flooded the servers.

Well, that's not true at all.  During major data patches, and at the start of new expansion, I've seen 3000+ player queues in WoW on high population servers.  Their servers don't crash, and neither do their lobbies.  It's all about the strength of their infrastructure.


RE: 'Server guy' gives his insight on ARR server problems - Ravinous - 08-28-2013

(08-28-2013, 10:58 AM)Asyria Wrote: Sharding rocks, as long as it's done right.
I remember being mad at GW2 because it would send people to "overflow servers" regardless of people being in parties.
Same goes for TSW's shards.

Thousand times this. GW2 is a great game and all, but not being able to control which "overflow" instance you end up at is sometimes a game breaker in large groups.


RE: 'Server guy' gives his insight on ARR server problems - Dameon - 08-28-2013

(08-28-2013, 10:58 AM)Asyria Wrote: Sharding rocks, as long as it's done right.
I remember being mad at GW2 because it would send people to "overflow servers" regardless of people being in parties.
Same goes for TSW's shards.

Pretty expensive to completely redesign an infrastructure after launch.


RE: 'Server guy' gives his insight on ARR server problems - Ravinous - 08-29-2013

(08-28-2013, 06:49 PM)Dameon Wrote:
(08-28-2013, 10:58 AM)Asyria Wrote: Sharding rocks, as long as it's done right.
I remember being mad at GW2 because it would send people to "overflow servers" regardless of people being in parties.
Same goes for TSW's shards.

Pretty expensive to completely redesign an infrastructure after launch.

It's just the best solution for a situation like this (namely the unexpectedly high population). But yeah, not the cheapest or easiest implementation, but a lot cheaper in the long run than just server stacking with it's faster diminishing returns.