One of the most common mistakes when setting up service monitoring (besides defining lots of unnecessary probes with low thresholds, constantly giving false positives) is checking if a process exists. Let’s say we serve Fubar for our customers. It’s serviced by two daemons, /usr/sbin/{fubard,fubard-spool} and perhaps requires cron.
Do the customers care if an instance of /usr/sbin/fubard is running on your system? No. They just wish to have Fubar ready and available and they aren’t keen on the innards of your setup.
Should you care if the /usr/sbin/fubard process is running? Only if you’re trying to solve a problem with the Fubar service.
Probes checking whether fubard, fubard-spool and cron are running are misleading to say the least. What if the fubard process exists but is frozen by a bug? What if crond exists, but got stuck on I/O? Or fubard-spool seems to be there, but actually is in Z-state?
What you should do is query services exactly as their clients would and check for valid output. Design your software with intrinsic support for instrumentation (probing). Even if it’s closed software or OS components you’re monitoring, there’s always a way to check if it’s actually working. In our example, you can monitor crond simply by setting up a job touching a file every minute and monitoring that file’s mtime in your probe.
That said, you should not avoid monitoring discrete components of your software. In our example, monitoring the client side output of the Fubar service is obviously the goal, but you should also set up probes checking fubard, fubard-spool and crond operation.
If anything is wrong with the service, you will (hopefully) be able to determine which component is at fault simply by looking at your monitoring system’s dashboard. That’s usually one look at a web page – it does matter when your, um, mobile ventilation system has been, erm, impacted.
PS. all of the above also applies to checking PID files.