Project

General

Profile

Actions

Feature #4447

open

System status monitor should also report on warden-filer status

Added by Radko Krkoš over 5 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Development - Tools
Target version:
Start date:
11/18/2018
Due date:
% Done:

0%

Estimated time:
To be discussed:

Description

As was detected during the recent PostgreSQL upgrade to 11.x on mentat-hub, Mentat system-status monitor does not take the warden_filer_{receiver,sender} status into consideration.
At least the receiver is a critical system component in current system architecture and should be monitored in my opinion.

(The actual problem was caused by missing write rights for the mentat user on /var/run/warden_filer/, so the .pid files could not have been created - the directory was owned by root instead of mentat).


Related issues

Related to Mentat - Bug #7121: Spool dir is sometimes created with wrong privileges on startNewRajmund Hruška03/11/2021

Actions
Related to Mentat - Feature #4218: Hawat: Improve system status view moduleWaiting07/27/2018

Actions
Actions #1

Updated by Pavel Kácha about 4 years ago

What is the "api" between mentat-controller.py --command status and real-time modules? Could for example warden-filer init script provide necessary outputs?

Actions #2

Updated by Pavel Kácha about 4 years ago

  • To be discussed changed from No to Yes
Actions #3

Updated by Pavel Kácha about 4 years ago

  • Target version set to 2.7
Actions #4

Updated by Pavel Kácha about 4 years ago

  • To be discussed changed from Yes to No
Actions #5

Updated by Jan Mach about 4 years ago

  • Priority changed from Normal to High
Actions #6

Updated by Jan Mach almost 4 years ago

  • Target version changed from 2.7 to 2.8
Actions #7

Updated by Pavel Kácha almost 4 years ago

  • Target version changed from 2.8 to Backlog
Actions #8

Updated by Pavel Kácha over 3 years ago

  • To be discussed changed from No to Yes
Actions #9

Updated by Pavel Kácha about 3 years ago

  • To be discussed deleted (Yes)
Actions #10

Updated by Pavel Kácha about 3 years ago

  • Assignee deleted (Jan Mach)

Mentat controller needs PID file, name, start script.

  1. PID files are expected at common Mentat path, based on daemon name
    Filer has configurable PID file, so this could be just configuration issue
  2. Start script is used to start the daemon
    This is configurable in controller.
  3. Stopping daemons by SIGINT
    Filer reacts to SIGINT gracefully, no problem expected here.
  4. Also, for enumeration, controller applies regexp to processes.
    This needs to be done - probably by configurable regexp in controller.

So - we need to implement configurable regexp (and add to controller config), change PID file place in Filer, add configuration clause for Filer in controllers config, and we should be done.

Actions #11

Updated by Radko Krkoš almost 3 years ago

Also, the status of warden_filer must be monitored more deeply as debilitating errors that do not force the daemon to exit are still reported by systemctl as 'active (running)'. For example the recent certificate expiration:

$ sudo systemctl status warden_filer_cesnet_receiver.service
● warden_filer_cesnet_receiver.service - Warden Filer - receiver (cesnet)
   Loaded: loaded (/etc/systemd/system/warden_filer_cesnet_receiver.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2021-02-19 22:45:52 CET; 1 months 0 days ago
 Main PID: 17832 (warden_filer.py)
    Tasks: 1 (limit: 4915)
   Memory: 84.9M
   CGroup: /system.slice/warden_filer_cesnet_receiver.service
           └─17832 /usr/bin/python /usr/local/bin/warden_filer.py -c /etc/warden_client/warden_filer_cesnet.cfg --pid_file /var/run/warden_filer/receiver_cesnet.pid --daemon receiver

Mar 22 10:32:46 mentat-alt warden_client.py[17832]: cz.cesnet.mentat_alt.warden_filer (ERROR) 00000000/getEvents Error(0) Sending of request to server failed (cause was SSLError: [SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:727))
Mar 22 10:32:46 mentat-alt warden_client.py[17832]: cz.cesnet.mentat_alt.warden_filer (INFO) 00000000/getEvents Detail: {"headers": {"Accept": "application/json"}, "data": null, "log": "/warden3/getEvents?count=5000&client=cz.cesnet.mentat_alt.warden_filer&id=1418200759"}
Mar 22 10:32:46 mentat-alt warden_client.py[17832]: cz.cesnet.mentat_alt.warden_filer (DEBUG) 00000000/getEventsTraceback:
                                                      File "/usr/local/bin/warden_client.py", line 453, in sendRequest
                                                        conn.request(method, loc, data, self.headers)
                                                      File "/usr/lib/python2.7/httplib.py", line 1058, in request
                                                        self._send_request(method, url, body, headers)
                                                      File "/usr/lib/python2.7/httplib.py", line 1098, in _send_request
                                                        self.endheaders(body)
                                                      File "/usr/lib/python2.7/httplib.py", line 1054, in endheaders
                                                        self._send_output(message_body)
                                                      File "/usr/lib/python2.7/httplib.py", line 892, in _send_output
                                                        self.send(msg)
                                                      File "/usr/lib/python2.7/httplib.py", line 854, in send
                                                        self.connect()
                                                      File "/usr/lib/python2.7/httplib.py", line 1279, in connect
                                                        server_hostname=server_hostname)
                                                      File "/usr/lib/python2.7/ssl.py", line 369, in wrap_socket
                                                        _context=self)
                                                      File "/usr/lib/python2.7/ssl.py", line 599, in __init__
                                                        self.do_handshake()
                                                      File "/usr/lib/python2.7/ssl.py", line 828, in do_handshake
                                                        self._sslobj.do_handshake()

Maybe this could be fixed in the unit settings?
Alternatively this can be ignored as such a situation is eventually detected by mentat-watchdog (no new events reach the DB).

Actions #12

Updated by Pavel Kácha almost 3 years ago

Radko Krkoš wrote in #note-11:

Also, the status of warden_filer must be monitored more deeply as debilitating errors that do not force the daemon to exit are still reported by systemctl as 'active (running)'. For example the recent certificate expiration:

Any ideas? Some errors are transient, some are permanent, only filer might sometimes know, which are which.

It might be caught by warden lib or filer anyhow - but it would have to assume that the error it got from server is final and definitive to warrant its exit with fail. But I'm not sure any well known daemon does it this way (exiting on server error).

Sure, this specific situation could be handled by warden lib/filer entirely by finding out cert is expired and refusing to use it, but is bailing out the right way?

Maybe this could be fixed in the unit settings?

Systemd can decide based on log contents?

Alternatively this can be ignored as such a situation is eventually detected by mentat-watchdog (no new events reach the DB).

That was one of the reasons to create this check I guess.

Actions #13

Updated by Pavel Kácha almost 3 years ago

Pavel Kácha wrote in #note-12:

Sure, this specific situation could be handled by warden lib/filer entirely by finding out cert is expired and refusing to use it, but is bailing out the right way?

Alternatively this can be ignored as such a situation is eventually detected by mentat-watchdog (no new events reach the DB).

That was one of the reasons to create this check I guess.

Also, there is a nearing certificate expiration check on mentat-hub, I guess it has been considered overkill on mentat-alt.

Actions #14

Updated by Radko Krkoš almost 3 years ago

Pavel Kácha wrote in #note-12:

Radko Krkoš wrote in #note-11:

Also, the status of warden_filer must be monitored more deeply as debilitating errors that do not force the daemon to exit are still reported by systemctl as 'active (running)'. For example the recent certificate expiration:

Any ideas? Some errors are transient, some are permanent, only filer might sometimes know, which are which.

I agree. Maybe the output of ERROR line to the log is enough for detection from outside?

It might be caught by warden lib or filer anyhow - but it would have to assume that the error it got from server is final and definitive to warrant its exit with fail. But I'm not sure any well known daemon does it this way (exiting on server error).

I would definitely not go through the exit route. I just wanted to note that even if the unit does not really work, it is reported as if it was, what I consider a bug. From the mentat point of view it might not be important as the monitoring of daemons is not based on systemd in any way.
Anyway, I just wanted to note a related aspect that might be interesting one day, when this is dealt with.

Sure, this specific situation could be handled by warden lib/filer entirely by finding out cert is expired and refusing to use it, but is bailing out the right way?

I am not sure. Then, in case of non-recoverable errors (what an expired certificate is), there is nothing to do. Exiting there might have triggered systemd, making it easier to debug perhaps?

Maybe this could be fixed in the unit settings?

Systemd can decide based on log contents?

I have no idea. It already does almost everything not related to deamon management, maybe something like this also slipped in by accident? {This reminds me of: "The $PRODUCT does @UNRELATED_FEATURE1. Of course, it does $UNRELATED_FEATURE2 very well. It is also a hard to use $PRIMARY_FEATURE." }
Seriously, I would hope so, but have no idea.

Alternatively this can be ignored as such a situation is eventually detected by mentat-watchdog (no new events reach the DB).

That was one of the reasons to create this check I guess.

Yes, I suppose it was. Then it was not ever pressing enough, so maybe we can just leave it be (as we have for 2 years). In the end, mentet-watchdog does its job.

Actions #15

Updated by Pavel Kácha almost 3 years ago

Radko Krkoš wrote in #note-14:

Pavel Kácha wrote in #note-12:

Radko Krkoš wrote in #note-11:

Also, the status of warden_filer must be monitored more deeply as debilitating errors that do not force the daemon to exit are still reported by systemctl as 'active (running)'. For example the recent certificate expiration:

Any ideas? Some errors are transient, some are permanent, only filer might sometimes know, which are which.

I agree. Maybe the output of ERROR line to the log is enough for detection from outside?

We can deduce, whether error is server based, or client based, based on query id - client errors have 0. In case of server errors, HTTP 5XX are potentially transient (server based, we know nothing), 4XX are permanent, however it depends on specific error and context, whether it is event based (next query will potentially be ok), or daemon based (the error will sustain until admin interference). Similar goes for client based errors - some errors can go away, some are permanent. I'm at my wits' end.

I would definitely not go through the exit route. I just wanted to note that even if the unit does not really work, it is reported as if it was, what I consider a bug. From the mentat point of view it might not be important as the monitoring of daemons is not based on systemd in any way.
Anyway, I just wanted to note a related aspect that might be interesting one day, when this is dealt with.

From the point of systemd unit does work. When apache site cert gets expired, apache runs happily ever after, and lennartd also spits rainbows. Even if the site is the sole purpose of the whole system, nobody cares - except maybe admin, whose work is to put correct checks/processes in place on application/contents level.

Sure, this specific situation could be handled by warden lib/filer entirely by finding out cert is expired and refusing to use it, but is bailing out the right way?

I am not sure. Then, in case of non-recoverable errors (what an expired certificate is), there is nothing to do. Exiting there might have triggered systemd, making it easier to debug perhaps?

Yup, that's what I meant, however I'm not sure any (significant) project does it that way. Still, it does not solve the problem of on which errors exactly to handbrake.

Maybe this could be fixed in the unit settings?

Systemd can decide based on log contents?

I have no idea. It already does almost everything not related to deamon management, maybe something like this also slipped in by accident? {This reminds me of: "The $PRODUCT does @UNRELATED_FEATURE1. Of course, it does $UNRELATED_FEATURE2 very well. It is also a hard to use $PRIMARY_FEATURE." }
Seriously, I would hope so, but have no idea.

Well, I've never heard of that, quick internet search yielded nuthin', however error classification problem still remains.

Alternatively this can be ignored as such a situation is eventually detected by mentat-watchdog (no new events reach the DB).

That was one of the reasons to create this check I guess.

Yes, I suppose it was. Then it was not ever pressing enough, so maybe we can just leave it be (as we have for 2 years). In the end, mentet-watchdog does its job.

We have two checks able to spot this particular problem quite quickly (one before it happens) on production, I'd call it quits.

Actions #16

Updated by Radko Krkoš almost 3 years ago

Pavel Kácha wrote in #note-15:

Radko Krkoš wrote in #note-14:

I agree. Maybe the output of ERROR line to the log is enough for detection from outside?

We can deduce, whether error is server based, or client based, based on query id - client errors have 0. In case of server errors, HTTP 5XX are potentially transient (server based, we know nothing), 4XX are permanent, however it depends on specific error and context, whether it is event based (next query will potentially be ok), or daemon based (the error will sustain until admin interference). Similar goes for client based errors - some errors can go away, some are permanent. I'm at my wits' end.

I meant something along these lines:
If at any point in time in the application you know enough to log an ERROR, then the administrator is probably interested in knowing. If the specific failure is recoverable, then it probably is not an error and should not be logged as such (4xx and 5xx, although classified as errors, can be reinterpreted in context of what the application is doing). In that case there is special code for the recovery procedure, so it is easily discernible.
I generally design the logging in this way. Of course if it is not the case (I believe it actually is), doing a redesign in this manner in Warden now might not be desirable (or at least should not be logged as a Mentat issue) and I do not advocate for that.

I would definitely not go through the exit route. I just wanted to note that even if the unit does not really work, it is reported as if it was, what I consider a bug. From the mentat point of view it might not be important as the monitoring of daemons is not based on systemd in any way.
Anyway, I just wanted to note a related aspect that might be interesting one day, when this is dealt with.

From the point of systemd unit does work. When apache site cert gets expired, apache runs happily ever after, and lennartd also spits rainbows. Even if the site is the sole purpose of the whole system, nobody cares

OK, maybe my expectations of a daemon manager are grossly exaggerated. Then of course I have never relied on it in that manner.

except maybe admin, whose work is to put correct checks/processes in place on application/contents level.

And this ticket is exactly about that in my opinion. Of course there is the robustness/code-simplicity tradeoff (and we discussed that before), so the added checks might not be worth it.

Sure, this specific situation could be handled by warden lib/filer entirely by finding out cert is expired and refusing to use it, but is bailing out the right way?

I am not sure. Then, in case of non-recoverable errors (what an expired certificate is), there is nothing to do. Exiting there might have triggered systemd, making it easier to debug perhaps?

Yup, that's what I meant, however I'm not sure any (significant) project does it that way. Still, it does not solve the problem of on which errors exactly to handbrake.

I do not think that can be decided in general. You have to go through possible errors and analyze them in detail (what is done during design or implementation).

Maybe this could be fixed in the unit settings?

Systemd can decide based on log contents?

I have no idea. It already does almost everything not related to deamon management, maybe something like this also slipped in by accident? {This reminds me of: "The $PRODUCT does @UNRELATED_FEATURE1. Of course, it does $UNRELATED_FEATURE2 very well. It is also a hard to use $PRIMARY_FEATURE." }
Seriously, I would hope so, but have no idea.

Well, I've never heard of that, quick internet search yielded nuthin', however error classification problem still remains.

Yeah, that is unfortunate. How is that better than SysVinit with grep over logs (or even without) eludes me.

Alternatively this can be ignored as such a situation is eventually detected by mentat-watchdog (no new events reach the DB).

That was one of the reasons to create this check I guess.

Yes, I suppose it was. Then it was not ever pressing enough, so maybe we can just leave it be (as we have for 2 years). In the end, mentet-watchdog does its job.

We have two checks able to spot this particular problem quite quickly (one before it happens) on production, I'd call it quits.

OK, this particular problem is worked around (or detected early) in production. I just added this as a note that relying on `systemd status` is not sufficient for health monitoring, something that was a bit surprising for me, but might be well known otherwise; also knowing that the existing monitoring does not depend on `systemd status` or anything similar, really just a note for future reference (I mean I would be glad for such information if I ever had to solve an issue like this). I certainly did not expect any detailed discussion, especially as the issue is in backlog.

Actions #17

Updated by Radko Krkoš almost 3 years ago

  • Related to Bug #7121: Spool dir is sometimes created with wrong privileges on start added
Actions #18

Updated by Pavel Kácha almost 2 years ago

  • Priority changed from High to Normal
Actions #19

Updated by Rajmund Hruška 11 months ago

  • Related to Feature #4218: Hawat: Improve system status view module added
Actions

Also available in: Atom PDF