Project

General

Profile

Actions

Bug #7759

closed

Reporter doesn't update thresholds in some cases

Added by Rajmund Hruška about 2 months ago. Updated 19 days ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Development - Core
Target version:
Start date:
07/12/2024
Due date:
% Done:

0%

Estimated time:
To be discussed:
No

Description

If there are some events already thresholded, then TTL for events with the same source and event class, will not be updated.


Files

relapsed.txt (8.9 KB) relapsed.txt Rajmund Hruška, 07/18/2024 11:12 AM

Related issues

Related to Mentat - Bug #7775: Event aggregation in reports seems broken (recurrence mechanism)NewJakub Judiny09/06/2024

Actions
Actions #1

Updated by Rajmund Hruška about 2 months ago

  • Subject changed from Reporter doesn't update thresholds in some cases to Reporter always reports relapsed events
  • Description updated (diff)
  • Status changed from In Progress to New
  • Assignee deleted (Rajmund Hruška)
Actions #3

Updated by Pavel Kácha about 2 months ago

An attempt to describe reporting mechanism is here .

Attempt in my words:

  • When report is sent for particular severity:class:ip, it gets its record in thresholds table. Also, all the subsequent events during threshold period are not reported, but recorded in events_thresholded table.
  • When no events arrive during relapse period (which starts some time before the end of the threshold period, and ends along with the threshold period), then after the end of the threshold all the corresponding events are silently flushed from the events_thresholded table.
  • When some events do arrive during the relapse period, then on the first reporter run after the threshold period all the corresponding thresholded events are reported, and also a new threshold period should start.

Seems we need a bit of review whether this algorithm is indeed implemented correctly (first idea – forgotten repeated thresholding when thresholded events are reported?)

Actions #4

Updated by Pavel Kácha about 2 months ago

  • Assignee set to Jakub Judiny
Actions #5

Updated by Jakub Judiny about 2 months ago

  • Subject changed from Reporter always reports relapsed events to Reporter doesn't update thresholds in some cases
  • Description updated (diff)
  • Status changed from New to In Progress
  • Target version changed from 2.13.1 to Backlog
Actions #6

Updated by Jakub Judiny about 2 months ago

  • Status changed from In Progress to Resolved
Actions #7

Updated by Jakub Judiny about 2 months ago

  • Status changed from Resolved to In Progress
  • Target version changed from Backlog to 2.13.2
Actions #8

Updated by Jakub Judiny about 1 month ago

  • Status changed from In Progress to Resolved
Actions #9

Updated by Jakub Judiny about 1 month ago

In the case described by the relapsed.txt file, some events were thresholded and then reported as a relapse a few second later. That means thresholding period was still active - and when the thresholding is still active, the caching mechanism ensures that thresholding time will not be set again (because it was already set in this reporter run).

The problem was simple - events were reported as relapsed even when the time was equal to the end of relapse time. When combined with the mechanism of caching described above, it caused the problem described by this issue. So I changed it to report events as relapsed only AFTER the thresholding (relapse) time is over (< instead of <=).

Actions #10

Updated by Jakub Judiny about 1 month ago

So before my change, if the thresholding period ended at 2:20 and the reporter script was called at 2:20, events were thresholded and then reported a few second later as a relapse.

Now, the events will be thresholded (2:20) and reported in the next reporter run (4:20 for medium severity). Caching will work correctly in this case, because the thresholding will be long over, when the relapse is reported.

Actions #11

Updated by Rajmund Hruška about 1 month ago

  • Status changed from Resolved to Feedback

Jakub Judiny wrote in #note-10:

Now, the events will be thresholded (2:20) and reported in the next reporter run (4:20 for medium severity). Caching will work correctly in this case, because the thresholding will be long over, when the relapse is reported.

I think those events would be deleted before they could have been reported.

Also somewhat unrelated: it only takes the events from the relapsed period, but it should take all the thresholded events (even before the start of relapse period) if there is any event during the relapse period.

Actions #12

Updated by Rajmund Hruška about 1 month ago

  • Status changed from Feedback to In Review
Actions #14

Updated by Jakub Judiny about 1 month ago

Pavel Kácha wrote in #note-13:

Another example. Is it the same issue?

https://mentat-hub.cesnet.cz/mentat/events/search?dt_from=2024-07-04+05%3A26%3A26&dt_to=&source_addrs=158.194.5.37&source_ports=&groups=abuse%40upol.cz&not_groups=&not_protocols=&description=&categories=Test&not_categories=True&not_severities=&classes=vulnerable-config-ipmi&not_classess=&submit=Hledat

https://mentat-hub.cesnet.cz/mentat/reports/205924/show
https://mentat-hub.cesnet.cz/mentat/reports/206007/show (relapse)
https://mentat-hub.cesnet.cz/mentat/reports/206072/show (relapse)
https://mentat-hub.cesnet.cz/mentat/reports/206093/show (NO relapse)
https://mentat-hub.cesnet.cz/mentat/reports/206183/show (NO relapse)
https://mentat-hub.cesnet.cz/mentat/reports/206285/show (NO relapse)

Yes, this looks like the same issue - the third report did not prolong the thresholding period, because of this interval problem. So the fourth report should not have been created. Fifth and sixth reports are OK, because the events did not arrive during the relapse period (they arrived after the thresholding period ended).

Actions #15

Updated by Rajmund Hruška about 1 month ago

Seems to be working well on mentat-alt - https://mentat-alt.cesnet.cz/mentat/reports/211576/show.

And it created new threshold record:

                       id                           |    thresholdtime    |     relapsetime     |       ttltime       
 vulnerable-config-xxx+++195.113.xxx.xxx            | 2024-08-06 01:30:00 | 2024-08-10 01:30:00 | 2024-08-12 01:30:00

Actions #16

Updated by Rajmund Hruška 19 days ago

  • Status changed from In Review to Closed
Actions #17

Updated by Jakub Judiny 1 day ago

  • Related to Bug #7775: Event aggregation in reports seems broken (recurrence mechanism) added
Actions

Also available in: Atom PDF