I hope this helps someone out there. The last 3 days, I have had my server crash on me every 2-3 hours. At first I thought it would be a spike in the traffic, since I couldn’t find any crash reports from Apache, and there was a spike in the traffic at this time. So I increased the resources on the server. It crashed again, every 2-3 hours.
I setup an Apache core dump. Nothing showed up. I checked the kern log, no segfaults there. The apport log was empty. Couldn’t be MySQL since MySQL was hosted on AWS RDS.
top
displayed no CPU load, RAM usage was uncharacteristically high, but SWAP was not used. The IOWait (wa%
) was way high for any normal ops. The server load was way over 50.0.
I was stumped!
I searched Google, scourged stackoverflow.com and serverfault.com for any hints of what might be wrong. I read answers and navigated to links, learnt a lot, but no help to what may have caused this.
I use mod_pagespeed and Varnish with Apache 2.2, on an Ubuntu server (12.04LTS). Thus I assumed, it may be, my configuration of the above setup wasn’t correct. I started to experiment with max_clients
and other Apache configs - even shifted over to mpm_worker
, CPU threads used by mod_pagespeed, some CPU intensive filters activated in mod_pagespeed and the amount of RAM used by Varnish along with Varnish’s launch string. None, of the permutations for the configurations above fixed the problem. It crashed still.
I read Apache’s error.log
again, this time one line at a time. Nothing still. I read the vhost access logs and stumbled on to a bot problem - "MJ12bot"
. A lot of hits to Apache from this one. Best course to take was to install fail2bot, and make sure "MJ12bot"
was banned and along with it, update the robots.txt
to disallow the bot.
I had checked everything there was to check. I did one last sweep of the logs and still had nothing to go on with. It still crashed.
Here’s where I thought it may be my code, some loop I haven’t checked or closed completely. But before I started on the effort to read seven years of newly re-written code, line by line, I wanted to be sure of this fact and wanted to exhaust all other possibilities.
One such possibility would be, the server was provisioned with SSD IOPS, instead of a HDD at ~7200RPM, the SSD storage has about 20 to 3000 more IOPS(AWS EC2 IOPS Optimised Instance) instead of the normal ~100 IOPS. In effect the CPU is slower than the SSD IOPS, and so the IOWait time could be high. I downgraded from SSD to Standard storage (SSD EBS). It crashed again.
Sigh, I was at my wits end. My last effort, before I moved on on to a different server instance with fresh installs of Apache 2.4, etc., was to make this server crash on purpose to check at what load of HTTP requests would make the server crash and was this number of requests, low or high.
To test the server load for HTTP requests, I used Apache’s JMeter to send about 500 requests in 10s ramp up and sure enough, top
showed wa%
was up, no load on us%
or on the RAM. The IOWait was caused the server load to go up to around 50.0. At this time I remembered /var/crash
folder, and I went in to investigate, and finally found a crash dump of the Kernel!
Apache had made the Kernel crash and it had restarted all by itself. The logs wouldn’t tell me why or had any notification of Apache being the problem, but whenever I turned apache off, the system would then run fine.
It was Apache but where, what and how?
I started with the sites-enabled
and moved on to mods_enabled
and found the culprit.
I toggled every Apache mod I had enabled and found [mod_status
] was what caused the crash.
The Server, now, is back to normal.
No more mod_status
, ‘til I upgrade Apache, from 2.2 to 2.4.
Again, I hope this helps some one out there.