Project

General

Profile

Actions

Bug #7121

closed

Spool dir is sometimes created with wrong privileges on start

Added by Pavel Kácha about 3 years ago. Updated about 1 month ago.

Status:
Closed
Priority:
Normal
Category:
Development - Core
Target version:
Start date:
03/11/2021
Due date:
% Done:

100%

Estimated time:
To be discussed:

Description

After cold start (after reboot), when /var/mentat/spool is empty, mentat-enricher.py directory has been created with wrong privileges: root:root.

All the others have been ok (mentat:mentat).

This causes outage in startup, as previous daemon in queue (mentat-inspector-b in our case) cannot output the events.


Related issues

Related to Mentat - Feature #4447: System status monitor should also report on warden-filer statusNew11/18/2018

Actions
Related to Mentat - Config #4723: Access permisions prevent warden-filer start after system rebootClosedPavel Kácha02/08/2019

Actions
Actions #1

Updated by Radko Krkoš about 3 years ago

  • Related to Feature #4447: System status monitor should also report on warden-filer status added
Actions #2

Updated by Radko Krkoš about 3 years ago

  • Related to Config #4723: Access permisions prevent warden-filer start after system reboot added
Actions #3

Updated by Jan Mach over 2 years ago

  • Category set to Development - Core
  • Status changed from New to In Progress
  • Target version changed from Backlog to 2.9
Actions #4

Updated by Jan Mach over 2 years ago

  • Status changed from In Progress to Feedback
  • % Done changed from 0 to 100
  • To be discussed changed from No to Yes

I was unable to replicate the problem situation locally. So I have instead chosen different approach to fix this bug:

  • I have enforced the queue work directories to be created with correct user/group ownership and permissions with chown and chmod.
  • I have enhanced logging regarding creation of all queue work directories. In case this will happen again in the future we might be able to conduct better investigation of the problem. There is an intentional unhandled exception with traceback to enable us locate the source of the problem. Both EUID and EGID are logged.

Unless someone can think of some other thing to help us cover our a**es I suggest we merge this ASAP to devel branch and deploy to mentat-alt to start using it in live environment and hopefully catch next occurence of this problem.

We might consider this bug resolved and close the task until the problem emerges again. In that case I would gather as much information as possible including relevant log lines and create new issue.

Actions #5

Updated by Pavel Kácha over 2 years ago

From today's meeting:

As this is not replicable on dev env, please use mentat-alt and try to hunt it down on real iron. (Make first reboot (or couple) without your patches, to confirm it is replicable there.)

Actions #6

Updated by Jan Mach over 2 years ago

As per our agreement I have tried to reproduce the bug on mentat-alt using multiple restarts. I was not able to do it both before update and after updating the code with attached patch. System booted up correctly and both Mentat and Warden client launched perfectly every time (with the exception of the first try, there was a minor bug in the patch that prevented mentat-storage to start, but was not related to the original problem).

In case this bug reappears in the future the enhanced logging might give us better understanding of the problem, but at the moment I am not sure, what to do next with this issue.

Actions #7

Updated by Pavel Kácha over 2 years ago

So let's set to deferred and reopen if issue reappears?

Actions #8

Updated by Jan Mach over 2 years ago

Pavel Kácha wrote in #note-7:

So let's set to deferred and reopen if issue reappears?

Your call. I suggest to close it, because I feel optimistic. Enforcing the queue directory ownership should work. If issue reappears, we can try to gather more evidence and log information, file new issue and link it back to this one. If we just set it to deferred we will push it in front of us for god knows how long.

Actions #9

Updated by Pavel Kácha over 2 years ago

  • Status changed from Feedback to Closed

Jan Mach wrote in #note-8:

Pavel Kácha wrote in #note-7:

So let's set to deferred and reopen if issue reappears?

Your call. I suggest to close it, because I feel optimistic. Enforcing the queue directory ownership should work. If issue reappears, we can try to gather more evidence and log information, file new issue and link it back to this one. If we just set it to deferred we will push it in front of us for god knows how long.

No hard opinion. It's merged and deployed, closing then.

Actions #10

Updated by Pavel Kácha over 2 years ago

  • To be discussed deleted (Yes)
Actions #11

Updated by Pavel Kácha almost 2 years ago

  • Status changed from Closed to New
  • Assignee changed from Jan Mach to Rajmund Hruška
  • Target version changed from 2.9 to Backlog

Seems issue still persists - happened at least 2022-07-19 somewhere between 11:00-12:00, enricher was unable to create /var/mentat/spool/mentat-enricher.py directory.

Actions #12

Updated by Pavel Kácha about 1 month ago

  • Status changed from New to Closed

No occurrence since, let's reopen if it creeps out.

Actions

Also available in: Atom PDF