Apis Networks

Access Control Panel
Control Panel Login close
Username
Domain
Password
 
Contact Us
create account
Reply
 
Thread Tools Display Modes
  #1  
Old 04-20-2011, 10:42 AM
Matt's Avatar
Matt Matt is offline
Tech Liaison
 
Join Date: Jun 2005
Location: Atlanta, Georgia
Posts: 1,035
Send a message via AIM to Matt Send a message via MSN to Matt
Help! Multiple: High Latency, Sporadic Performance

Yesterday on April 19th beginning at 6:10 PM EDT (-0400 GMT), several servers tripped passive MySQL checks indicating client count exceeded reasonable limits (> 50 clients concurrently). Normally resolved automatically, these alerts are indicative of either transient spikes in load average or a table-level lock that leaks out and blocks all connections to all tables on all databases.

Borel generated the first alarm, but Echelon quickly followed 1 minute later at 6:11 PM. Borel generated another alert at 6:15 PM, while Aleph became unresponsive at 6:25 PM. Assmule in turn generated an alarm independent of other servers at 6:28 PM. After several failed minutes of reviving Aleph, it was reset coming back online around 6:50 PM. All servers except for Assmule stabilized by 7 PM. Assmule eventually stabilized by 8 PM.

---
Interestingly, multiple servers that are usually quiet issued alerts within a small window of time. After taking a look at Assmule, I noticed the internal DNS resolver accumulated an excessive amount of CPU. All servers use either Assmule or Augend to translate hostnames to IP address (e.g. stinky.apisnetworks.com -> 64.22.68.1). Momentarily taking down HTTP and e-mail to allow DNS to quickly restart helped diminish server load eventually returning to baseline around 8 PM.

There are two possible explanations:

(1) DNS lookups on ns1 were blocked for prolonged periods leading to a build-up of http processes on the servers. Gradually the number spawned outstripped available RAM leading to paging, thereby degrading disk performance, and finally resulting in a build-up of MySQL clients waiting for a DB task to complete. This is corroborated by a remote DNS status poll (via rndc) stalling indefinitely. Notably, DNS on ns2 was upgraded to 9.7 the day before. There's a remote, albeit quite weak, possibility of performance regression when BIND 9.4 and 9.7 communicate (again - bear in mind abnormal CPU usage).

(2) Alternatively, since releasing the new ticket interface, CP bug reports go through an external SMTP server before coming back to Assmule, which doubles as the primary internal DNS resolver. Echelon began generating bug reports of DB connection problems (too many clients) around 6:06 PM. Since mail is dispatched through another SMTP server, mail throughput is limited by throughput capacity of another server that does little besides mirror a few crucial internal databases and relay mail from apisnetworks.com. A server in distress can dispatch more warnings than before, which can, in this case, be a bad thing. It's also possible the mail relay overwhelmed Assmule from the bug reports, that in turn increased DNS resolver latency, which in turn caused a spike in http clients.

In either case, DNS is likely to have precipitated the disastrous chain of events. DNS on ns1 has been upgraded and primary internal DNS is now Augend, which is formerly ns2. Assmule continues to split roles between internal and external ns1, but now more traffic will go to ns2. Secondly, bug report delivery is now limited to 1/15s to prevent inundation.
Reply With Quote
  #2  
Old 07-06-2011, 05:30 AM
walkershane123 walkershane123 is offline
Apis Networks User
 
Join Date: Jul 2011
Posts: 1
you have put an informative article. in fact at my office we're experiencing sporadic bad performance due to high latency and packet loss. Unfortunately, videotron has been unable to fix it (so far).
Would you like give me any suggestion. ?


Wedding Hairstyles

Last edited by walkershane123; 07-11-2011 at 06:50 AM.
Reply With Quote
  #3  
Old 07-06-2011, 10:38 AM
Matt's Avatar
Matt Matt is offline
Tech Liaison
 
Join Date: Jun 2005
Location: Atlanta, Georgia
Posts: 1,035
Send a message via AIM to Matt Send a message via MSN to Matt
Quote:
Originally Posted by walkershane123 View Post
you have put an informative article. in fact at my office we're experiencing sporadic bad performance due to high latency and packet loss. Unfortunately, videotron has been unable to fix it (so far).
Would you like give me any suggestion. ?
Open up a ticket within the control panel via Help > Trouble Tickets to have a log of this issue. Provide me with your IP address in the ticket.
Reply With Quote
Reply

Tags
aleph, assmule, borel, echelon

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump