MantisBT - Zandronum
View Issue Details
0001178Zandronum[All Projects] Bugpublic2012-11-11 17:082024-03-10 20:40
Watermelon 
 
highminorrandom
closedunable to reproduce 
 
 
0001178: Memory leak (or something) causes lag spikes on the server
Is there anyway to diagnose what could be causing the server to have a lag spike and/or connection issues? While this is a very generic request, this is what I've noticed:

- No matter what host I go to (either VPS, home connection or even a dedicated server like GV), there seem to always be intermittent spikes that occur where everyone in the server sees "Connection interrupted" for a second or so. Sometimes if it's really bad it can last up to 5+ seconds.

- It seems to only happen on one server, which is confusing because if it was the actual host itself you'd think that all the servers would be affected at the same time. Therefore, what could cause only one of (lets say five) servers to be affected while the others don't -- and they are all on the same VPS/dedicated server

- This has happened since 98d as well but we all (the hosts of the servers) thought it was just VPS's possibly handling data incorrectly

- I *only* run 5 servers on the linux system which consumes around 300 or so mega of ram, I have 3700 megs left over and literally nothing else running

- On the BEST-EVER servers, Jenova literally turned off every other running thing in the background that could be turned off and it still occured

- I've had this happen on FOUR VPS's, all of which were different hosts. This is not including Grandvoid which is on a dedicated server

- Happens on any operating system, even linux



Is there any tool I can run to determine this problem? I have no idea what it is though it's starting to become annoying in game and plague most servers.
Player amount does not affect the lag. We've had the worst lag spikes at 6 players, and 20+ minutes of nothing when there was 27+ people in pub CTF.
No tags attached.
Issue History
2012-11-11 17:08WatermelonNew Issue
2012-11-11 17:10WatermelonNote Added: 0005363
2012-11-11 17:22WatermelonNote Edited: 0005363bug_revision_view_page.php?bugnote_id=5363#r2945
2012-11-11 22:06Torr SamahoNote Added: 0005375
2012-11-11 22:08Torr SamahoNote Edited: 0005375bug_revision_view_page.php?bugnote_id=5375#r2956
2012-11-11 22:08Torr SamahoNote Revision Dropped: 5375: 0002955
2012-11-12 01:03WatermelonNote Added: 0005376
2012-11-12 01:04WatermelonNote Edited: 0005376bug_revision_view_page.php?bugnote_id=5376#r2958
2012-11-12 02:17ZzZomboNote Added: 0005378
2012-11-12 02:18ZzZomboNote Edited: 0005378bug_revision_view_page.php?bugnote_id=5378#r2960
2012-11-12 06:28Torr SamahoNote Added: 0005381
2012-11-15 05:58Torr SamahoNote Added: 0005397
2012-11-15 05:58Torr SamahoStatusnew => feedback
2012-11-15 11:50Konar6Note Added: 0005398
2012-11-18 10:26Torr SamahoNote Added: 0005409
2012-11-19 22:07WatermelonNote Added: 0005422
2012-11-19 22:07WatermelonStatusfeedback => new
2012-12-22 05:24WatermelonNote Added: 0005543
2012-12-23 11:38Torr SamahoNote Added: 0005549
2012-12-23 18:12WatermelonNote Added: 0005552
2012-12-23 18:18WatermelonNote Edited: 0005552bug_revision_view_page.php?bugnote_id=5552#r3044
2012-12-23 20:58Torr SamahoNote Added: 0005553
2012-12-24 01:01WatermelonNote Added: 0005555
2014-06-14 03:15WatermelonStatusnew => closed
2014-06-14 03:15WatermelonResolutionopen => unable to reproduce
2024-03-10 20:40Ru5tK1ngRelationship addedrelated to 0003873
2024-03-10 20:40Ru5tK1ngRelationship deletedrelated to 0003873

Notes
(0005363)
Watermelon   
2012-11-11 17:10   
(edited on: 2012-11-11 17:22)
I forgot to mention the frequency varies. Sometimes 3 happen in a row. It appears to be random, estimated one every 10 or so minutes (+/- 5 minutes).

It seems the server still processes the shots and stuff on it's own end and still seems to accept incoming input, but falters on sending any data... if that helps.

EDIT: Zandronum seems to have a higher threshold before it displays connection interrupted, whereas on ST it displayed it much more because every little transmission error showed 'connection interrupted'. Here however, Zandronum has a buffer zone so some of the micro-lag spikes you don't actually notice unless you fire a stream of plasma and notice some not coming out.

(0005375)
Torr Samaho   
2012-11-11 22:06   
(edited on: 2012-11-11 22:08)
Just to be sure that it's not a hostname lookup issue: Try if setting the CVAR masterhostname manually to the current IP of master.zandronum.com makes any difference.

Quote from Watermelon
It seems the server still processes the shots and stuff on it's own end and still seems to accept incoming input, but falters on sending any data... if that helps.
How do you know? Are you looking at the server console output?

Quote from Watermelon
Zandronum seems to have a higher threshold before it displays connection interrupted
FYI, I'm pretty sure that I didn't touch the threshold, but I fixed some bugs that affect this.

Personal comment: Memory leak is likely the second most commonly misused term (right after crash). I see no indication for a memory leak in your description of the problem. If the overall memory usage of the server is not constantly rising it's no memory leak.

(0005376)
Watermelon   
2012-11-12 01:03   
(edited on: 2012-11-12 01:04)
Quote
Just to be sure that it's not a hostname lookup issue: Try if setting the CVAR masterhostname manually to the current IP of master.zandronum.com makes any difference.

I'm going to try that tomorrow and get back to you on it ASAP

Quote
Quote
It seems the server still processes the shots and stuff on it's own end and still seems to accept incoming input, but falters on sending any data... if that helps.

How do you know? Are you looking at the server console output?

When the lag spikes happen, if you are holding "+forward" while it happens your character will be much farther ahead of where you are, therefore I assume it has to still be receiving data to process the +forward commands, but for whatever reason the screens on the client freeze. Therefore the only thing I can think of with my limited knowledge is that the update data for the client is not coming through and results in a connection error.

Quote
Personal comment: Memory leak is likely the second most commonly misused term (right after crash). I see no indication for a memory leak in your description of the problem. If the overall memory usage of the server is not constantly rising it's no memory leak.

I think I did mis-use this term here. I don't know what to call it, some kind of overload somewhere?

I'll also be checking into this tomorrow:
Quote
<Konar6> certain lagspikes appear when the server is advertised and there are nameserver issues
<Konar6> don't ask me why

to either confirm or hopefully rule this out.

I'll be setting up Odamex and ZDaemon servers tomorrow to establish if it is only Zandronum it is happening with. If so, what could possibly be causing it?



EDIT: I further had this problem confirmed by a few more server cluster hosts. Whatever this may be, it seems much more widespread than I thought. Could this just be how the internet is nowadays?

(0005378)
ZzZombo   
2012-11-12 02:17   
(edited on: 2012-11-12 02:18)
I don't know how it's related but once I timed out from server on my local host playing cooperative on Doom I stock maps without any PWADs. The server didn't crash or something so after reconnect I could play further without any troubles. The client gave me "CLIENT_CheckForMissingPackets: missing more than 1024 packets. Unable to recover" error.

(0005381)
Torr Samaho   
2012-11-12 06:28   
Quote from Watermelon
When the lag spikes happen, if you are holding "+forward" while it happens your character will be much farther ahead of where you are, therefore I assume it has to still be receiving data to process the +forward commands, but for whatever reason the screens on the client freeze
I'd say more likely the server completely freezes, your system still receives and buffers the network packets from the clients and as soon as the server unfreezes it parses all the client movement commands buffered by the system at once, making it look as if you jump ahead.

Quote
<Konar6> certain lagspikes appear when the server is advertised and there are nameserver issues
<Konar6> don't ask me why

This are the hostname lookup issue I was referring to. The reason is very simple: The server only uses a single thread, calls gethostbyname and has to wait for it to return something.
(0005397)
Torr Samaho   
2012-11-15 05:58   
Did anybody have a chance yet to check whether it's a hostname lookup issue?
(0005398)
Konar6   
2012-11-15 11:50   
I believe this is bogus. Internet connections aren't stable and it's normal for clients and even servers to experience packet loss. I advised Watermelon to try running a different server on the box (such as ZDaemon) to observe its behavior.
Ever since I noticed the lagspikes from DNS issues pointed above (and tracked them to be related to the server resolving the master's hostname), I run a local DNS server, which I would recommend to all dedicated server hosts.

I've had a different issue happen and heard 2-3 times, though probably unrelated, as this one is said to affect all clients. When the only one client can't send any data to the server, but the server data are received fine. So that client can't move or talk, but can see others moving and talking normally. Lasts for a few seconds.
(0005409)
Torr Samaho   
2012-11-18 10:26   
Quote from Watermelon
I'm going to try that tomorrow and get back to you on it ASAP
This was almost a week ago. Did you have a chance to test this yet?

Quote from Konar6
I believe this is bogus.
I don't question that these lag spikes exists, but I'm not convinced yet that it's a Zandronum bug.
(0005422)
Watermelon   
2012-11-19 22:07   
I will test it as soon as I can, sadly I came down with the flu days ago and I'm barely over the worst part yet. When I get out of the woods I'll update my results
(0005543)
Watermelon   
2012-12-22 05:24   
So I've been testing this out just to see if it's actually the VPS or not.

So far I've got the following statistics:
- ZD does not seem to suffer the same problems
- Unable to get Odamex working so no data here
- The lag seems to be really... weird on Zandronum. At times it happens on *all* the servers briefly which led me to believe that it is just the VPS sucking... but other times it only happens on the server itself which is really unusual. We had multiple people on the same server cluster in different port games and most of the time only one of them would get hammered with this random lag spike and the rest would be fine.
Therefore I'm unsure if the one off lag spike that occurred on all of them was actually just the internet being the internet or if it was an indication that the VPS is not good.

Likewise I've tried four VPS's and all of them have this same lag issue. It's hard to believe that four individual service providers would have this problem. This is not including the other servers like NJ Funcrusher which have this problem, and even Grandvoid has this happening when it's a dedicated server.


Is there some kind of diagnostic tool I can run to check for lag spikes? I'd really like to prove without a shadow of a doubt that it is the VPS/box it's hosted on before I continue on with this ticket.
(0005549)
Torr Samaho   
2012-12-23 11:38   
Did you test whether it's a hostname lookup issue?
(0005552)
Watermelon   
2012-12-23 18:12   
(edited on: 2012-12-23 18:18)
What are the steps to rule everything out? I'd like to do as many things in one go to rule them all out


1) DNS server lookup issue -> how to I fix this?

2) Run a local DNS server as Konar said -> how do I do this? never done this before

3) Anything else -> I will do ASAP

The masterhostname points to master.zandronum.com

(0005553)
Torr Samaho   
2012-12-23 20:58   
This should be a good start:
Quote from Torr Samaho
Try if setting the CVAR masterhostname manually to the current IP of master.zandronum.com makes any difference.
(0005555)
Watermelon   
2012-12-24 01:01   
I've had that set for a while and it made no difference