BTMash

Blob of contradictions

loadwatcher: Making your server sane when it thrashes.

Written

A few years ago, Khalid wrote a wonderful script that would help make your apache server sane again after the opcode caching on the server started to throw segmentation faults (he aptly named it logwatcher). It was great because APC would crash for unknown reasons at the time and completely kill a website. It took care of an important issue (the one change I had made was to clear the apc cache instead of restarting the server but all in all, super ^_^).

A few years back during my time at zinc Roe, we were finding that the server would start to die due to spikes in traffic on the Zimmer Twins. In trying to figure out what was happening, we found that the limit on the number of connections that the db server would allow was in fact making the apache server spiral out of control. What happened is:

  1. User does a page request (be it via form submit or clicking a link)
  2. DB connections are all taken up
  3. Apache starts writing to log the list of errors cropping up from the server
  4. Repeat for thousands of users coming on around the same time

Part of the issue with the db server had to do some with bad queries (search was killing the site as it tried to find relevant results from over a million nodes). Users would get impatient and try to request a page again, and repeat apache writing to the disk and you get a server basically thrashing to death. What I did find is that *if* I was able to log over into the server and restart apache, the problem would resolve itself and people could actually start using the site sanely once again. However, when a server is getting out of control, the simple act of trying to ssh in can be very challenging.

As a result of this, I created a daemon script (well..2 scripts) called 'loadwatcher' which does something somewhat straightforward. It checks to see what the load average on the server is every 60 seconds (which you could change to be whatever interval you wanted) and if it passed a threshold (in my case I used 10.0), it would restart Apache (I then waited a while before starting to do the checks again). Really silly idea, right? Luckily, what the script did along with the assumption, I made took care of nearly all server uptime issues (the server remained up; there might be *some* site downtime but it was better than having to wholly reboot a server). It also gave some breathing room to figure out ways to optimize the search functionality and reduce site downtime even further. Its easy enough to add any other things you want done (like restarting mysql if that was a possible cause of the issue) or to change other pieces. I've been able to use the script across various servers over the years with similar results (keeping the server going when crap hits the fan) and I figured it would be something worthwhile to share.

The daemon script I had would be continually running on the server checking what needs to be done (also posted on pastebin).

  1. <?php
  2.  
  3. define ("LOAD_AVERAGE_THRESHOLD", 10.0);
  4. define ("APACHE_RESTART_THRESHOLD", 60 * 5);
  5. define ("DEFAULT_LOAD_AVERAGE_PATH", "/proc/loadavg");
  6. define ("APACHE_COMMAND_STOP", '/etc/init.d/apache2 stop');
  7. define ("APACHE_COMMAND_START", '/etc/init.d/apache2 start');
  8. define ("APACHE_COMMAND_RESTART", '/etc/init.d/apache2 restart');
  9. define ("APACHE_COMMAND_STATUS", '/etc/init.d/apache2 status');
  10.  
  11. function get_current_load_average() {
  12. $load_average_file_name = DEFAULT_LOAD_AVERAGE_PATH;
  13. $fp = @fopen($load_average_file_name, 'r');
  14. if ($fp == null) {
  15. die("unable to open file at $load_average_file_name\n");
  16. }
  17. $load_averages = explode(" ", fread($fp, 4096));
  18. return floatval($load_averages[0]);
  19. }
  20.  
  21. function restart_apache_server() {
  22. // Do some stuff in here!
  23. printf ("Server might be going through problems...restart the web server now!\n");
  24. system(APACHE_COMMAND_STOP);
  25. printf ("Stopped web server");
  26. sleep(10);
  27. system(APACHE_COMMAND_START);
  28. printf ("Started web server");
  29. }
  30.  
  31. $allowed_restart_time = time();
  32.  
  33. while (1) {
  34. $current_load_average = get_current_load_average();
  35. printf("Current load - %f\n", $current_load_average);
  36. if ($current_load_average < LOAD_AVERAGE_THRESHOLD) {
  37. printf ("Server is doing pretty well!\n");
  38. }
  39. else if ($current_load_average >= LOAD_AVERAGE_THRESHOLD && time() > $allowed_restart_time) {
  40. restart_apache_server();
  41. $allowed_restart_time = time() + APACHE_RESTART_THRESHOLD;
  42. }
  43. else {
  44. printf ("Wait for some time to ensure server is getting back to normal.\n");
  45. }
  46. sleep(30);
  47. }

I then wrapped this up with a shell script which would keep track of the pid and kill any old loadwatcher instances (also posted on pastebin):

  1. #!/bin/sh
  2.  
  3. BASE_DIR=/path/to/base
  4. SCRIPT=$BASE_DIR/scripts/logwatcher.php
  5. PID_FILE=/var/run/loadwatcher.pid
  6. EMAIL=btmash@gmail.com
  7.  
  8. # If there is an old process, kill it
  9. kill `cat $PID_FILE`
  10. # Make sure the file is clean
  11. rm -f $PID_FILE
  12.  
  13. cd $BASE_DIR
  14. nohup php $SCRIPT > /dev/null &
  15. PID=$!
  16.  
  17. echo $PID > $PID_FILE

Naturally, there are better things that can be done to ensure the server stays sane and that you don't run into issues like these in the first place :) What do you do?

Update: Kevin Kaland told me about monit (http://mmonit.com/monit/) which seems to do a lot more than what I have been doing and I'll be looking into it for more work in the future.