Feature #4447
openSystem status monitor should also report on warden-filer status
0%
Description
As was detected during the recent PostgreSQL upgrade to 11.x on mentat-hub
, Mentat system-status monitor does not take the warden_filer_{receiver,sender} status into consideration.
At least the receiver is a critical system component in current system architecture and should be monitored in my opinion.
(The actual problem was caused by missing write rights for the mentat user on /var/run/warden_filer/
, so the .pid
files could not have been created - the directory was owned by root
instead of mentat
).
Related issues
Updated by Pavel Kácha almost 5 years ago
What is the "api" between mentat-controller.py --command status and real-time modules? Could for example warden-filer init script provide necessary outputs?
Updated by Pavel Kácha almost 5 years ago
- To be discussed changed from No to Yes
Updated by Pavel Kácha almost 5 years ago
- To be discussed changed from Yes to No
Updated by Pavel Kácha over 4 years ago
- Target version changed from 2.8 to Backlog
Updated by Pavel Kácha almost 4 years ago
- To be discussed changed from No to Yes
Updated by Pavel Kácha almost 4 years ago
- Assignee deleted (
Jan Mach)
Mentat controller needs PID file, name, start script.
- PID files are expected at common Mentat path, based on daemon name
Filer has configurable PID file, so this could be just configuration issue - Start script is used to start the daemon
This is configurable in controller. - Stopping daemons by SIGINT
Filer reacts to SIGINT gracefully, no problem expected here. - Also, for enumeration, controller applies regexp to processes.
This needs to be done - probably by configurable regexp in controller.
So - we need to implement configurable regexp (and add to controller config), change PID file place in Filer, add configuration clause for Filer in controllers config, and we should be done.
Updated by Radko Krkoš over 3 years ago
Also, the status of warden_filer must be monitored more deeply as debilitating errors that do not force the daemon to exit are still reported by systemctl as 'active (running)'. For example the recent certificate expiration:
$ sudo systemctl status warden_filer_cesnet_receiver.service ● warden_filer_cesnet_receiver.service - Warden Filer - receiver (cesnet) Loaded: loaded (/etc/systemd/system/warden_filer_cesnet_receiver.service; enabled; vendor preset: enabled) Active: active (running) since Fri 2021-02-19 22:45:52 CET; 1 months 0 days ago Main PID: 17832 (warden_filer.py) Tasks: 1 (limit: 4915) Memory: 84.9M CGroup: /system.slice/warden_filer_cesnet_receiver.service └─17832 /usr/bin/python /usr/local/bin/warden_filer.py -c /etc/warden_client/warden_filer_cesnet.cfg --pid_file /var/run/warden_filer/receiver_cesnet.pid --daemon receiver Mar 22 10:32:46 mentat-alt warden_client.py[17832]: cz.cesnet.mentat_alt.warden_filer (ERROR) 00000000/getEvents Error(0) Sending of request to server failed (cause was SSLError: [SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:727)) Mar 22 10:32:46 mentat-alt warden_client.py[17832]: cz.cesnet.mentat_alt.warden_filer (INFO) 00000000/getEvents Detail: {"headers": {"Accept": "application/json"}, "data": null, "log": "/warden3/getEvents?count=5000&client=cz.cesnet.mentat_alt.warden_filer&id=1418200759"} Mar 22 10:32:46 mentat-alt warden_client.py[17832]: cz.cesnet.mentat_alt.warden_filer (DEBUG) 00000000/getEventsTraceback: File "/usr/local/bin/warden_client.py", line 453, in sendRequest conn.request(method, loc, data, self.headers) File "/usr/lib/python2.7/httplib.py", line 1058, in request self._send_request(method, url, body, headers) File "/usr/lib/python2.7/httplib.py", line 1098, in _send_request self.endheaders(body) File "/usr/lib/python2.7/httplib.py", line 1054, in endheaders self._send_output(message_body) File "/usr/lib/python2.7/httplib.py", line 892, in _send_output self.send(msg) File "/usr/lib/python2.7/httplib.py", line 854, in send self.connect() File "/usr/lib/python2.7/httplib.py", line 1279, in connect server_hostname=server_hostname) File "/usr/lib/python2.7/ssl.py", line 369, in wrap_socket _context=self) File "/usr/lib/python2.7/ssl.py", line 599, in __init__ self.do_handshake() File "/usr/lib/python2.7/ssl.py", line 828, in do_handshake self._sslobj.do_handshake()
Maybe this could be fixed in the unit settings?
Alternatively this can be ignored as such a situation is eventually detected by mentat-watchdog
(no new events reach the DB).
Updated by Pavel Kácha over 3 years ago
Radko Krkoš wrote in #note-11:
Also, the status of warden_filer must be monitored more deeply as debilitating errors that do not force the daemon to exit are still reported by systemctl as 'active (running)'. For example the recent certificate expiration:
Any ideas? Some errors are transient, some are permanent, only filer might sometimes know, which are which.
It might be caught by warden lib or filer anyhow - but it would have to assume that the error it got from server is final and definitive to warrant its exit with fail. But I'm not sure any well known daemon does it this way (exiting on server error).
Sure, this specific situation could be handled by warden lib/filer entirely by finding out cert is expired and refusing to use it, but is bailing out the right way?
Maybe this could be fixed in the unit settings?
Systemd can decide based on log contents?
Alternatively this can be ignored as such a situation is eventually detected by
mentat-watchdog
(no new events reach the DB).
That was one of the reasons to create this check I guess.
Updated by Pavel Kácha over 3 years ago
Pavel Kácha wrote in #note-12:
Sure, this specific situation could be handled by warden lib/filer entirely by finding out cert is expired and refusing to use it, but is bailing out the right way?
Alternatively this can be ignored as such a situation is eventually detected by
mentat-watchdog
(no new events reach the DB).That was one of the reasons to create this check I guess.
Also, there is a nearing certificate expiration check on mentat-hub, I guess it has been considered overkill on mentat-alt.
Updated by Radko Krkoš over 3 years ago
Pavel Kácha wrote in #note-12:
Radko Krkoš wrote in #note-11:
Also, the status of warden_filer must be monitored more deeply as debilitating errors that do not force the daemon to exit are still reported by systemctl as 'active (running)'. For example the recent certificate expiration:
Any ideas? Some errors are transient, some are permanent, only filer might sometimes know, which are which.
I agree. Maybe the output of ERROR line to the log is enough for detection from outside?
It might be caught by warden lib or filer anyhow - but it would have to assume that the error it got from server is final and definitive to warrant its exit with fail. But I'm not sure any well known daemon does it this way (exiting on server error).
I would definitely not go through the exit route. I just wanted to note that even if the unit does not really work, it is reported as if it was, what I consider a bug. From the mentat
point of view it might not be important as the monitoring of daemons is not based on systemd
in any way.
Anyway, I just wanted to note a related aspect that might be interesting one day, when this is dealt with.
Sure, this specific situation could be handled by warden lib/filer entirely by finding out cert is expired and refusing to use it, but is bailing out the right way?
I am not sure. Then, in case of non-recoverable errors (what an expired certificate is), there is nothing to do. Exiting there might have triggered systemd
, making it easier to debug perhaps?
Maybe this could be fixed in the unit settings?
Systemd can decide based on log contents?
I have no idea. It already does almost everything not related to deamon management, maybe something like this also slipped in by accident? {This reminds me of: "The $PRODUCT does @UNRELATED_FEATURE1. Of course, it does $UNRELATED_FEATURE2 very well. It is also a hard to use $PRIMARY_FEATURE." }
Seriously, I would hope so, but have no idea.
Alternatively this can be ignored as such a situation is eventually detected by
mentat-watchdog
(no new events reach the DB).That was one of the reasons to create this check I guess.
Yes, I suppose it was. Then it was not ever pressing enough, so maybe we can just leave it be (as we have for 2 years). In the end, mentet-watchdog
does its job.
Updated by Pavel Kácha over 3 years ago
Radko Krkoš wrote in #note-14:
Pavel Kácha wrote in #note-12:
Radko Krkoš wrote in #note-11:
Also, the status of warden_filer must be monitored more deeply as debilitating errors that do not force the daemon to exit are still reported by systemctl as 'active (running)'. For example the recent certificate expiration:
Any ideas? Some errors are transient, some are permanent, only filer might sometimes know, which are which.
I agree. Maybe the output of ERROR line to the log is enough for detection from outside?
We can deduce, whether error is server based, or client based, based on query id - client errors have 0. In case of server errors, HTTP 5XX are potentially transient (server based, we know nothing), 4XX are permanent, however it depends on specific error and context, whether it is event based (next query will potentially be ok), or daemon based (the error will sustain until admin interference). Similar goes for client based errors - some errors can go away, some are permanent. I'm at my wits' end.
I would definitely not go through the exit route. I just wanted to note that even if the unit does not really work, it is reported as if it was, what I consider a bug. From the
mentat
point of view it might not be important as the monitoring of daemons is not based onsystemd
in any way.
Anyway, I just wanted to note a related aspect that might be interesting one day, when this is dealt with.
From the point of systemd unit does work. When apache site cert gets expired, apache runs happily ever after, and lennartd also spits rainbows. Even if the site is the sole purpose of the whole system, nobody cares - except maybe admin, whose work is to put correct checks/processes in place on application/contents level.
Sure, this specific situation could be handled by warden lib/filer entirely by finding out cert is expired and refusing to use it, but is bailing out the right way?
I am not sure. Then, in case of non-recoverable errors (what an expired certificate is), there is nothing to do. Exiting there might have triggered
systemd
, making it easier to debug perhaps?
Yup, that's what I meant, however I'm not sure any (significant) project does it that way. Still, it does not solve the problem of on which errors exactly to handbrake.
Maybe this could be fixed in the unit settings?
Systemd can decide based on log contents?
I have no idea. It already does almost everything not related to deamon management, maybe something like this also slipped in by accident? {This reminds me of: "The $PRODUCT does @UNRELATED_FEATURE1. Of course, it does $UNRELATED_FEATURE2 very well. It is also a hard to use $PRIMARY_FEATURE." }
Seriously, I would hope so, but have no idea.
Well, I've never heard of that, quick internet search yielded nuthin', however error classification problem still remains.
Alternatively this can be ignored as such a situation is eventually detected by
mentat-watchdog
(no new events reach the DB).That was one of the reasons to create this check I guess.
Yes, I suppose it was. Then it was not ever pressing enough, so maybe we can just leave it be (as we have for 2 years). In the end,
mentet-watchdog
does its job.
We have two checks able to spot this particular problem quite quickly (one before it happens) on production, I'd call it quits.
Updated by Radko Krkoš over 3 years ago
Pavel Kácha wrote in #note-15:
Radko Krkoš wrote in #note-14:
I agree. Maybe the output of ERROR line to the log is enough for detection from outside?
We can deduce, whether error is server based, or client based, based on query id - client errors have 0. In case of server errors, HTTP 5XX are potentially transient (server based, we know nothing), 4XX are permanent, however it depends on specific error and context, whether it is event based (next query will potentially be ok), or daemon based (the error will sustain until admin interference). Similar goes for client based errors - some errors can go away, some are permanent. I'm at my wits' end.
I meant something along these lines:
If at any point in time in the application you know enough to log an ERROR, then the administrator is probably interested in knowing. If the specific failure is recoverable, then it probably is not an error and should not be logged as such (4xx and 5xx, although classified as errors, can be reinterpreted in context of what the application is doing). In that case there is special code for the recovery procedure, so it is easily discernible.
I generally design the logging in this way. Of course if it is not the case (I believe it actually is), doing a redesign in this manner in Warden
now might not be desirable (or at least should not be logged as a Mentat
issue) and I do not advocate for that.
I would definitely not go through the exit route. I just wanted to note that even if the unit does not really work, it is reported as if it was, what I consider a bug. From the
mentat
point of view it might not be important as the monitoring of daemons is not based onsystemd
in any way.
Anyway, I just wanted to note a related aspect that might be interesting one day, when this is dealt with.From the point of systemd unit does work. When apache site cert gets expired, apache runs happily ever after, and lennartd also spits rainbows. Even if the site is the sole purpose of the whole system, nobody cares
OK, maybe my expectations of a daemon manager are grossly exaggerated. Then of course I have never relied on it in that manner.
except maybe admin, whose work is to put correct checks/processes in place on application/contents level.
And this ticket is exactly about that in my opinion. Of course there is the robustness/code-simplicity tradeoff (and we discussed that before), so the added checks might not be worth it.
Sure, this specific situation could be handled by warden lib/filer entirely by finding out cert is expired and refusing to use it, but is bailing out the right way?
I am not sure. Then, in case of non-recoverable errors (what an expired certificate is), there is nothing to do. Exiting there might have triggered
systemd
, making it easier to debug perhaps?Yup, that's what I meant, however I'm not sure any (significant) project does it that way. Still, it does not solve the problem of on which errors exactly to handbrake.
I do not think that can be decided in general. You have to go through possible errors and analyze them in detail (what is done during design or implementation).
Maybe this could be fixed in the unit settings?
Systemd can decide based on log contents?
I have no idea. It already does almost everything not related to deamon management, maybe something like this also slipped in by accident? {This reminds me of: "The $PRODUCT does @UNRELATED_FEATURE1. Of course, it does $UNRELATED_FEATURE2 very well. It is also a hard to use $PRIMARY_FEATURE." }
Seriously, I would hope so, but have no idea.Well, I've never heard of that, quick internet search yielded nuthin', however error classification problem still remains.
Yeah, that is unfortunate. How is that better than SysVinit with grep over logs (or even without) eludes me.
Alternatively this can be ignored as such a situation is eventually detected by
mentat-watchdog
(no new events reach the DB).That was one of the reasons to create this check I guess.
Yes, I suppose it was. Then it was not ever pressing enough, so maybe we can just leave it be (as we have for 2 years). In the end,
mentet-watchdog
does its job.We have two checks able to spot this particular problem quite quickly (one before it happens) on production, I'd call it quits.
OK, this particular problem is worked around (or detected early) in production. I just added this as a note that relying on `systemd status` is not sufficient for health monitoring, something that was a bit surprising for me, but might be well known otherwise; also knowing that the existing monitoring does not depend on `systemd status` or anything similar, really just a note for future reference (I mean I would be glad for such information if I ever had to solve an issue like this). I certainly did not expect any detailed discussion, especially as the issue is in backlog.
Updated by Radko Krkoš over 3 years ago
- Related to Bug #7121: Spool dir is sometimes created with wrong privileges on start added
Updated by Rajmund Hruška over 1 year ago
- Related to Feature #4218: Hawat: Improve system status view module added
Updated by Pavel Kácha 5 months ago
Pavel Kácha wrote in #note-10:
So - we need to implement configurable regexp (and add to controller config), change PID file place in Filer, add configuration clause for Filer in controllers config, and we should be done.
PID is on the right spot, so altering the regexp should be enough... mostly. There is just one global regexp in the code. Making it per process configurable would mean to iterate through all the configured regexes (probably in addition to default one).
There is another possibility, in getting rid of process regexes entirely. We already have the PID files – if controller gathers them by already functioning regexp, it should (through contained PIDs) be able to check for all corresponding runing processes – much more deterministically than to just enumerate all processes and filter them by arbitrary filter.