0000058: Skulltag doesn't play nice with Linode VPS.

Notes
(0000153) Eruanna (reporter) 2010-09-25 08:50 edited on: 2010-09-25 08:51	I can't fully confirm this as I don't have this problem often on Obsidian - however, I have occasionally received reports of such problems on there. Obsidian is a London Linode. Since it never seems to happen that often when I am playing, I really can't put my personal experience into this - but anyone who plays on Obsidian regularly might. Most of the time my issues with Obsidian are network related - the datacenter has been having issues lately over the past couple of weeks.

(0000155) AlexMax (developer) 2010-09-25 14:45 edited on: 2010-09-25 16:08	Just out of curiosity, what distribution and kernel do you use? My Linode is located at New Jersey and was using Ubuntu (I'm testing Debian to see if it has the same issues). For what it's worth, a friend of mine hosts a single server at their texas location and never has any issues.

(0000156) TIHan (reporter) 2010-09-25 17:43	I have a linode in Atlanta that I have been using for a month and I haven't had any problems with "waiting for server". I would like to know what might be causing it to help you out. I'm running on Debian 5.0.

(0000176) Torr Samaho (administrator) 2010-09-25 22:05	Are you sure that this only happens when servers are running on a Linode VPS?

(0000177) AlexMax (developer) 2010-09-25 23:06	I actually moved away from Linode a year and a half ago because of these sort of issues. I happened to move back and what do you know, it persists. However, it's not consistent. Today I have not had any issues. What sort of information would be useful to you when it starts exhibiting that behavior again? I was planning on doing a tcpdump when I see it happening, is there anything else I should be recording?

(0000187) AlexMax (developer) 2010-09-26 17:20	I had complaints about this last night. The attached file is a tcpdump of all UDP activity on my server. I have taken a tcpdump of an offending 'skip' and have pinpointed one of the times where there is a hiccup. By filtering out all packets except those going to and from my personal computer (ip.addr == 174.108.7.186), I have pinpointed a particular incident where I remember having a conversation with players on my server, and I say "like now?" and the words took forever to get there. Using wireshark on the log with that filter listed in parenthesis above, you can clearly see where usually the server sends out a bunch of UDP packets per second to my IP, there are none from second 27.944 to 29.072, in the space of which my client attempts two packets. Clearing the filter at that point, you can see that the server sends out no packets at all...except for Skulltag master heartbeat queries...from 27.944 to 29.070. Seeing as tcpdump is located locally at my servers connection, it appears at first glance that it's not a network issue, but rather an issue with the server getting 'stuck' for a second or so.

(0000215) AlexMax (developer) 2010-10-02 15:53	Any ideas here? I've got the Linode for another month, so I'd really like to get this resolved soonish.

(0000220) Torr Samaho (administrator) 2010-10-02 20:53	Does the server make any DNS lookup during the time it's not responding? Or does it reparse any of your IP lists (banlist/whitelist/adminlist)?

(0000255) AlexMax (developer) 2010-10-03 17:42	I'm in the process of getting a tcpdump that will show this, as the old tcpdump only shows UDP connections. In the meantime, this is how I will recreate my testing environment, step by step, so you know exactly what I'm working with: 1. Make sure you have a Linode 512 hosted at Newark, New Jersey. It should have nothing else on it at all. 2. From the main screen, click "Deploy a Linux distribution". On the next screen, select Ubuntu 10.04 LTS as your distro, swap disk is 512 MB and your deployment disk drive is 15872 MB. Put in a root password and click okay. 3. Once your distro is set up, boot into it. SSH into it with either putty or command line SSH. 4. At the command prompt: # aptitude install libsdl1.2-dev screen tcpdump That will install everything having to do with SDL in one fell swoop. 5. At the command prompt: # adduser yournickname # su yournickname We're going to create our own user for this thing. 6. At the command prompt: $ mkdir ~/skulltag $ cd ~/skulltag $ wget'http://skulltag.net/download/files/testing/98d/SkullDev98d-2954linux-x86.tar.bz2 [^]' $ wget'http://www.skulltag.com/download/files/release/st-v098c_linux-base.tar.bz2 [^]' $ tar jxvf SkullDev98d-2954linux-x86.tar.bz2 $ tar jxvf st-v098c_linux-base.tar.bz2 Installing Skulltag. Nothing to see here. Make sure and grab doom2.wad somehow too, since that's dependent on where you get doom2.wad from. 7. Close your SSH session and log back in as your user. From there, at the command prompt: $ screen $ cd skulltag $ ./skulltag-server -port 10666 +map d2ctf1 8. Start playing. If you don't see any "skipping", try again a few hours later. Once you do, create a new screen tab (Ctrl-a then c) and then: $ su Password: <put in your root password $ tcpdump -Xs0 > ~/traffic.pcap Go back to your server, wait for at least one good "skip" and then Ctrl-C out of tcpdump. I'll let you know what the results are...

(0000299) AlexMax (developer) 2010-10-10 00:48	This doesn't seem to happen consistently, in fact performance has thus far ranged from good to just a hiccup here and there that wasn't at all like what I was running into a while back. I'll let you know when I have another tcpdump ready for you. In the meantime, I have a ton of FFA servers set up on the Linode, and hopefully I can get some people on there to test while I'm observing the hiccuping...

(0000430) AlexMax (developer) 2010-10-17 19:52	I decided to cancel my Linode VPS subscription and get another VPS. So far, I haven't had the same kind of issues....though AOW2 did have one huge hiccup, but that might have been AOW2 and it's nowhere near as choppy as it was on linode.

(0000431) AlexMax (developer) 2010-10-17 20:26	Oh wow, on this new VPS I'm getting a similar symptom. You know that "AOW2 huge hiccup" I mentioned? It just happened again on my Private CTF server. I'll get a tcpdump to you guys as soon as I can...

(0000433) AlexMax (developer) 2010-10-17 22:22 edited on: 2010-10-17 22:26	Okay, I got a tcpdump. My lists were empty at the time of the hiccup. The log is too huge to post here, but again at second 1102.33944 the server communicates with a specific IP that I'm following, let's call it #.#.#.#. #.#.#.# sends a UDP packet to the server at 1022.36501, 1022.38774, 1022.41871, 1022.44244. And...silence from the server until second 1042.02150...over 20 seconds later! Expanding things to look at ALL packets, there seems to be a huge blackout of server communication for about 20 seconds. So was the entire server frozen? I don't think so. One time this was happening a few hours before when I was still connected, I alt tabbed from the server, moved my cursor around, and didn't hear the server catch up for at least a few more seconds. So yeah, that's it. Nothing else was going on, no DNS no nothing.

(0000436) AlexMax (developer) 2010-10-18 16:18 edited on: 2010-10-18 16:18	Had an interesting conversation with InterServer, my new VPS host: "Does Skulltag use UDP? In /proc/user_beancounters I see some failures in dgramrcvbuf dgramrcvbuf 0 262144 262144 262144 101894 Nothing else has hit a fail count. dgramrcvbuf = "The total size of receive buffers of UDP and other datagram protocols." " Their VPS management page also provides a neat little graph for dgramrcvbuf_f (failure rates of dgramrcvbuf). Lo and behold, I saw a spike that corresponded exactly to the lag I was getting on my server. Reading up about dgramrcvbuf, I found that it's a VPS option...this new VPS uses OpenVZ, but I suspect Xen has something similar: 'http://wiki.openvz.org/UBC_secondary_parameters#dgramrcvbuf [^]' At the end of the day, they increased a setting on my VPS called numothersock, which is described here: 'http://wiki.openvz.org/UBC_primary_parameters#numothersock [^]' Hopefully, this will solve the problems. However, is there anything perhaps abnormal that Skulltag might be doing to run up against this limit?

(0000453) Torr Samaho (administrator) 2010-10-23 19:56	A Skulltag server should only allocate one socket when it's started and is supposed to keep using it then. Do you know what numothersock was set to before and is set to now?

(0000454) AlexMax (developer) 2010-10-23 22:45 edited on: 2010-10-23 22:45	No idea what it was set to before, but this is what the numbers are set to now: uid resource held maxheld barrier limit failcnt dgramrcvbuf 0 351424 352144 352144 25644 numothersock 22 44 460 460 0 This is what it looks like on a day with a 22/24 AOW2 and 8/15 Private CTF. That failcnt is a tally from October 19th, in which time both numothersock and dgramrcvbuf itself was increased. Why dgramrcvbuf itself? Well, according to the wiki: "The dgramrcvbuf limits usually don't need to be high. Only if the Container needs to send and receive very large datagrams, the barriers for both othersockbuf and dgramrcvbuf parameters should be raised." After reading this, I concluded that instead of overwhelming the number of other sockets, Skulltag was instead sending very large UDP packets back and forth, and overflowing the buffer with packets that are too damn big. The first time I had them bump the number, the graph that showed failures was still going up under heavy traffic, but it was halved. Since then, they've bumped it twice, and tonight is the first time since then that there has been a significant number of players on my AOW2 and Priv CTF servers. I'll let you know how many failures I get tonight.

(0000455) AlexMax (developer) 2010-10-24 00:13 edited on: 2010-10-24 00:16	Just so you know, it's still overflowing the buffer every so often with those numbers. The stoppages aren't not nearly as bad, but people are still complaining. :(

(0000460) Torr Samaho (administrator) 2010-10-24 08:43	If it's a problem with the UDP packet size, you can try to lower sv_maxpacketsize. The default is 1400, I'd try lowering it to 1024.

(0000485) AlexMax (developer) 2010-10-29 21:19 edited on: 2010-10-29 21:20	Aha! That seems to have fixed things. I've been running without any dgramrcvbuf failures for a couple of days now, and my servers have been under an AOW2-sized load a few times since then. This might be a useful thing to add to the wiki, by the way. Thanks!

(0000499) Torr Samaho (administrator) 2010-11-01 11:54	I'm glad to hear that! Since the current default value of 1400 seems to cause problems, I'm lowering the default value to 1024 again (like it was in 98b and older versions).

This issue is already marked as resolved. If you feel that is not the case, please reopen it and explain why.
Supporters:	No one explicitly supports this issue yet.
Opponents:	No one explicitly opposes this issue yet.

Issue History
Date Modified	Username	Field	Change
2010-09-25 04:52	AlexMax	New Issue
2010-09-25 08:50	Eruanna	Note Added: 0000153
2010-09-25 08:51	Eruanna	Note Edited: 0000153	View Revisions
2010-09-25 14:45	AlexMax	Note Added: 0000155
2010-09-25 14:46	AlexMax	Note Edited: 0000155	View Revisions
2010-09-25 16:08	AlexMax	Note Edited: 0000155	View Revisions
2010-09-25 17:43	TIHan	Note Added: 0000156
2010-09-25 22:05	Torr Samaho	Note Added: 0000176
2010-09-25 22:05	Torr Samaho	Status	new => feedback
2010-09-25 23:06	AlexMax	Note Added: 0000177
2010-09-25 23:06	AlexMax	Status	feedback => new
2010-09-26 17:19	Anonymous	File Added: st_weirdlag.pcap
2010-09-26 17:20	Anonymous	Note Added: 0000186
2010-09-26 17:20	Anonymous	Note Deleted: 0000186
2010-09-26 17:20	AlexMax	Note Added: 0000187
2010-10-02 15:53	AlexMax	Note Added: 0000215
2010-10-02 20:53	Torr Samaho	Note Added: 0000220
2010-10-02 20:53	Torr Samaho	Status	new => feedback
2010-10-03 17:42	AlexMax	Note Added: 0000255
2010-10-03 17:42	AlexMax	Status	feedback => new
2010-10-09 15:13	Torr Samaho	Status	new => feedback
2010-10-10 00:45	Anonymous	Note Added: 0000298
2010-10-10 00:45	Anonymous	Note Deleted: 0000298
2010-10-10 00:48	AlexMax	Note Added: 0000299
2010-10-10 00:48	AlexMax	Status	feedback => new
2010-10-17 19:52	AlexMax	Note Added: 0000430
2010-10-17 20:26	AlexMax	Note Added: 0000431
2010-10-17 22:22	AlexMax	Note Added: 0000433
2010-10-17 22:26	AlexMax	Note Edited: 0000433	View Revisions
2010-10-18 16:18	AlexMax	Note Added: 0000436
2010-10-18 16:18	AlexMax	Note Edited: 0000436	View Revisions
2010-10-23 19:56	Torr Samaho	Note Added: 0000453
2010-10-23 19:57	Torr Samaho	Status	new => feedback
2010-10-23 22:45	AlexMax	Note Added: 0000454
2010-10-23 22:45	AlexMax	Status	feedback => new
2010-10-23 22:45	AlexMax	Note Edited: 0000454	View Revisions
2010-10-24 00:13	AlexMax	Note Added: 0000455
2010-10-24 00:16	AlexMax	Note Edited: 0000455	View Revisions
2010-10-24 08:43	Torr Samaho	Note Added: 0000460
2010-10-24 08:43	Torr Samaho	Status	new => feedback
2010-10-29 21:19	AlexMax	Note Added: 0000485
2010-10-29 21:19	AlexMax	Status	feedback => new
2010-10-29 21:20	AlexMax	Note Edited: 0000485	View Revisions
2010-11-01 11:54	Torr Samaho	Note Added: 0000499
2010-11-01 11:54	Torr Samaho	Status	new => resolved
2010-11-01 11:54	Torr Samaho	Fixed in Version	=> 98d
2010-11-01 11:54	Torr Samaho	Resolution	open => fixed
2010-11-01 11:54	Torr Samaho	Assigned To	=> Torr Samaho
2012-06-09 13:22	Torr Samaho	Category	General => Bug
2018-09-30 19:47	Blzut3	Status	resolved => closed

Relationships