|
|||||||
![]() |
|
|
Thread Tools | Display Modes |
|
#1
|
||||
|
||||
|
Augend experienced a deadlock earlier this morning at approximately 4:55 AM EST (-0500 GMT) caused by complications stemming from, apparently, a severe threading crash. named, the DNS server, logged the fatal, esoteric "*** POKED TIMER ***" message leading up to the immediate lockup. From prior experiences, this message is indicative of a severe kernel-level crash.
A mandatory filesystem check was issued at 5:00 AM lasting approximately 20 minutes. Augend regained responsiveness by 5:30 AM. Lax checks for MySQL and a recent upgrade left MySQL unresponsive until support e-mails were handled during the beginning of my shift at 10 AM alerting me to the problem. Comments: Kernel bug: A kernel upgrade has been scheduled for all of the servers starting at the end of January, which will resolve the instability witnessed twice on two separate servers. MySQL: Later versions of MySQL 5.0 includes a management controller called "mysqlmanager" to manage multiple instances of mysqld on a given server. During package upgrades, the installer detects and deactivates the "mysql" service from start-up in favor of the mysqlmanager, which is deprecated in 5.1 and removed entirely in 6.0. Why the MySQL team decided to make it a critical component in 5.0 then is bewildering. Unless the mysql service is explicitly reinstalled after upgrade, mysqld will rely on its clunky cousin, the instance manager, for start-up. * mysql has been turned on again at start-up MySQL Monitor: MySQL is rigorously monitored for problems [em]while[/em] running; however, the monitor in its current embodiment does not check whether it is running. A very shortsighted, and embarrassing, feature to not implement. Before individual checks were necessary with MySQL (instability in 4.1 onward) mysqld process checks were handled by the service integrity monitor, which performs routine process polling. sim does not provide the level of detail necessary to detect lockups in a MySQL process. Responsibility was shifted to another collection of scripts lacking a pid check. Incidentally, return values were checked at one time. Unpredictable query timeouts require the mysql commands to work in the background otherwise the monitor may be hopelessly stuck waiting for the current query to complete. In the event of a lock-up on MySQL the query will never complete producing a false negative. Jobs are sent to the background and checked after a fixed interval to avoid such a situation from occurring. * PID checks have been implemented at this time. Control panel: Ideally, a user should be able to login to the control panel either directly on the server, e.g. http://cp.augend.apisnetworks.com or through the login portal on apisnetworks.com. Neither will succeed if MySQL is down on the server. Users must resort to sending an e-mail to support@apisnetworks.com labeled with high priority -- dispatches to the pagers as a high priority ticket. I do have plans to introduce a unified control panel interface with plans to detect and notify of connection errors, but I would like to introduce it at the same time as esprit 1.0 in March. Last edited by Matt; 01-13-2009 at 12:39 PM. Reason: smilies in times |
|
#2
|
|||
|
|||
|
Thanks for the post-mortem, Matt. I'm glad to see all the issues recognised and solutions in place/planned.
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|