Project

General

Profile

Feature #6257

Review calculated statistics in timeline

Added by Pavel Kácha 3 months ago. Updated 26 days ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Design
Target version:
Start date:
03/09/2020
Due date:
% Done:

20%

Estimated time:
To be discussed:

Description

It may well be that some calculated columns/aggregations in timeline are superfluous or not useful. Review needed.


Related issues

Related to Mentat - Feature #4609: Arbitrary grouping and sorting in EventsClosed01/30/2019

Blocked by Mentat - Feature #6308: Reimplement timeline calculations on databaseClosed04/09/2020

History

#1 Updated by Pavel Kácha 3 months ago

  • Related to Feature #4609: Arbitrary grouping and sorting in Events added

#2 Updated by Pavel Kácha 2 months ago

My thoughts, only proposition, open for discussion.

  • Totals
    • One of key measures.
    • For timeline can be calculated relatively quickly (count(*) ... group
      by).
    • For statistician data it’s small simple value.
    • Keep.
  • Recurrences
    • I’m not sure if its useful in its current form. Pie chart is almost
      always 100%. Bars may show that in some specific time there was a peak
      in recurrent events, alas there’s no possibility of finding out which
      ones (apart from trying to correlate with columns on other graphs with
      possibly different scales and the drilling into Alerts searches).
    • As i understand it, it needs some consultation with relapse
      cache during calculation, which might be costly .
    • If we accept classes as first class citizen (pun not intended), partial
      information can be visible from graph (sans source)
    • I incline to ditching it.
  • Abuses
    • Now makes sense only for constituency admins (like us).
    • Usable for abuse team admins only after we have hierarchical networks.
    • Keep, we’ll need it in the future.
  • Analyzers
    • Names of analyzers are completely arbitrary, often cryptic.
    • Useful only for constituency admins (like us).
    • Ditch.
  • ASNs
    • I believe this is used very seldomly.
    • Not sure if useful for security/forensics team work.
    • However nice in graphs...
    • ... if accompanied with company name.
    • Not sure.
  • Categories
    • Straightforward (one category).
    • Problem with ambigous statistics (sum is more than number of events).
    • Personaly I’ve never used it, always used category sets.
    • Calculable from Category sets statistics, if needed.
    • For my part, ditch.
  • Category sets
    • One of key measures.
    • Statistics sums up.
    • Useful would be possibility to merge some category sets, like 'omit
      Test, group the rest’.
    • Keep.
  • Countries
    • Same as ASNs.
    • Isn’t this derivable from ASNs? (For stats precalculation.)
    • Not sure.
  • Detectors
    • One of key measures.
    • Keep.
  • Detector software
    • Similar to analyzers.
    • Moreover, most of the clients have only one SW behind (except Filers, which we don’t care dearly for.)
    • Ditch.
  • IPs
    • One of key measures.
    • Rename to Sources?
    • Keep.
  • Classes
    • Mmmkay.
    • Not convinced strongly, but let’s keep.
  • Severities
    • Also not sure, if useful in current form. Bars may show that in some
      specific time there was a peak in recurrent events, alas there’s no
      possibility of finding out which ones (apart from trying to correlate
      with columns on other graphs with possibly different scales and
      drilling into Alerts searches).
    • Have anybody ever used this stats?
    • Ditch?

We don’t have:

  • Ports
    • Would be nice for spotting rise of attacks to specific ports.
    • Add?
  • Services
    • (Un)fortunately it’s partially orthogonal to ports.
    • Would be nice for spotting rise of attacks to specific services.
    • Add?
  • Targets
    • Doesn’t it make sense for network admins to see their hotspots?
    • Add?

#3 Updated by Pavel Kácha 2 months ago

  • To be discussed changed from No to Yes

#4 Updated by Jan Mach 2 months ago

According to the Pavel`s recommendation I have done example query including time measurement and profiling. Results are in attached files.

Number of events:     1,498,994
Time interval:        Mar 31, 2020, 11:00:00 PM - Apr 1, 2020, 4:00:00 PM
Time interval delta:  17 hours
Timeline steps:       170 x 0:06:00

Query:
SELECT * FROM events AS "_mentatq(88_gkbyww)_" WHERE "detecttime" >= '2020-03-31T21:00:00+00:00'::timestamptz

Time marks :
search:preprocess_begin            
search:preprocess_end     0:00:00.000624 s    0%    
search:search_begin       0:00:00.000426 s    0%    
search:search_end         0:16:19.091261 s    62%    
search:postprocess_begin  0:00:00.030789 s    0%    
search:postprocess_end    0:10:06.621327 s    38%    

Total duration: 0:26:25.744427
Duration to rendering time: 0:26:27.138112

#5 Updated by Pavel Kácha about 2 months ago

  • Assignee changed from Pavel Kácha to Radko Krkoš

From today’s video meeting:

  • From timemarks is seems like DB query is long, calculations are shorter. However, from 'search’ part comprises also creating lite Idea objects (typedcols, ipranges) and some jpath accesses, which seems to take most of the data fetching part. So the possibilities:
    • Somehow get rid of the object creation and abstraction layers, however even though it might shortcut most of the processing, it would also sidestep most of the logic and go against the Hawat architecture.
    • Somehow coerce PostgreSQL to do the calculations itself and hand back just final data.

So, Radko will try to come up with some queries, calculating data for graph, and roughly estimating performance, then we’ll evaluate next steps.

#6 Updated by Pavel Kácha about 2 months ago

Also, discussion of grouped categories appreciated.

Considering recurrence - combined aggregation of IP + class might show interesting toplist of recurrent events.

#7 Updated by Radko Krkoš about 2 months ago

The analytic queries have been developed according to the currently presented results. The performance was assessed.
As a baseline, let’s select a plain query for the then current day, 00:00:00 - 06:00:00. The current result is available here:
https://mentat-alt.cesnet.cz/mentat/timeline/search?dt_from=2020-04-08+00%3A00%3A00&dt_to=2020-04-08+06%3A00%3A00&source_addrs=&source_ports=&submit=Search

Hawat decided for 3 minute intervals and the calculation took 2 m 50 s (170s).
To obtain the same results, we need several queries, the primary:
SELECT '2020-04-08 00:00:00'::timestamp + '3 mins'::interval * (-1 + width_bucket(detecttime, (SELECT array_agg(buckets) FROM generate_series('2020-04-08 00:00:00'::timestamp, '2020-04-08 06:00:00', '3 mins'::interval) AS buckets))) AS bucket, COUNT(*) FROM events WHERE detecttime > '2020-04-08 00:00:00' AND detecttime < '2020-04-08 06:00:00' GROUP BY bucket ORDER BY bucket;
  • took ~150ms.
For abuses, this can be trivially extended:
SELECT '2020-04-08 00:00:00'::timestamp + '3 mins'::interval * (-1 + width_bucket(detecttime, (SELECT array_agg(buckets) FROM generate_series('2020-04-08 00:00:00'::timestamp, '2020-04-08 06:00:00', '3 mins'::interval) AS buckets))) AS bucket, events.cesnet_resolvedabuses AS abuses, COUNT(*) FROM events WHERE detecttime > '2020-04-08 00:00:00' AND detecttime < '2020-04-08 06:00:00' GROUP BY bucket, abuses ORDER BY bucket;
  • took ~800ms.
For multivalue arrays, unnest() can be used like this:
SELECT '2020-04-08 00:00:00'::timestamp + '3 mins'::interval * (-1 + width_bucket(detecttime, (SELECT array_agg(buckets) FROM generate_series('2020-04-08 00:00:00'::timestamp, '2020-04-08 06:00:00', '3 mins'::interval) AS buckets))) AS bucket, unnest(events.category) AS cat, COUNT(*) FROM events WHERE detecttime > '2020-04-08 00:00:00' AND detecttime < '2020-04-08 06:00:00' GROUP BY bucket, cat ORDER BY bucket;
  • took ~800ms.
Now for 1 month, 8 hour intervals (as a scaling test, this would not be practical with current implementation):
SELECT '2020-03-01 00:00:00'::timestamp + '8 hours'::interval * (-1 + width_bucket(detecttime, (SELECT array_agg(buckets) FROM generate_series('2020-03-01 00:00:00'::timestamp, '2020-04-01 00:00:00', '8 hours'::interval) AS buckets))) AS bucket, COUNT(*) FROM events WHERE detecttime > '2020-03-01 00:00:00' AND detecttime < '2020-04-01 00:00:00' GROUP BY bucket ORDER BY bucket;
  • took ~6s.
Another column:
SELECT '2020-03-01 00:00:00'::timestamp + '8 hours'::interval * (-1 + width_bucket(detecttime, (SELECT array_agg(buckets) FROM generate_series('2020-03-01 00:00:00'::timestamp, '2020-04-01 00:00:00', '8 hours'::interval) AS buckets))) AS bucket, cesnet_resolvedabuses AS abuses, COUNT(*) FROM events WHERE detecttime > '2020-03-01 00:00:00' AND detecttime < '2020-04-01 00:00:00' GROUP BY bucket, abuses ORDER BY bucket;
  • took ~15s.
Multivalue arrays:
SELECT '2020-03-01 00:00:00'::timestamp + '8 hours'::interval * (-1 + width_bucket(detecttime, (SELECT array_agg(buckets) FROM generate_series('2020-03-01 00:00:00'::timestamp, '2020-04-01 00:00:00', '8 hours'::interval) AS buckets))) AS bucket, unnest(category) as cat, COUNT(*) FROM events WHERE detecttime > '2020-03-01 00:00:00' AND detecttime < '2020-04-01 00:00:00' GROUP BY bucket, cat ORDER BY bucket;
  • took ~17s.
  • Remark: The run times also depend on the number of distinct array values (the aggregate dimensions).

The queries are parametrized by 4 values: start and end time (which are user input), the interval time, which is presumably already determined by some magic, and the optional additional column. Based on the runtimes, it should be practical to get the results per screen on user request.

If just an interval index since the beginning is required (based on how the graphing is implemented), the query can be simplified as follows:

SELECT width_bucket(detecttime, (SELECT array_agg(buckets) FROM generate_series('2020-04-08 00:00:00'::timestamp, '2020-04-08 06:00:00', '6 minutes'::interval) AS buckets)) AS bucket, COUNT(*) FROM events WHERE detecttime > '2020-04-08 00:00:00' AND detecttime < '2020-04-08 06:00:00' GROUP BY bucket ORDER BY bucket;

This version is somewhat faster due to simpler sorting.

Also, calculation inside DB would help getting rid of yet another incarnation of the timezone bug. At least for me, the graph looks like in the attachment - the bins are shifted 2 hours to the future relative to data obtained.

#8 Updated by Pavel Kácha about 2 months ago

Just a note - is ORDER BY clause really necessary? I believe that the result for frontend will be dict of dicts and order of data processing doesn’t matter.

#9 Updated by Radko Krkoš about 2 months ago

Pavel Kácha wrote:

Just a note - is ORDER BY clause really necessary? I believe that the result for frontend will be dict of dicts and order of data processing doesn’t matter.

Yes, I wanted to discuss that, but have forgotten.
My thinking is also that it should be left out. Also, the bucket column is now calculated as a timestamp. If just a bucket index would be enough, the calculation could be left out, shaving off some time.

#10 Updated by Pavel Kácha about 2 months ago

Radko Krkoš wrote:

Pavel Kácha wrote:

Just a note - is ORDER BY clause really necessary? I believe that the result for frontend will be dict of dicts and order of data processing doesn’t matter.

Yes, I wanted to discuss that, but have forgotten.
My thinking is also that it should be left out. Also, the bucket column is now calculated as a timestamp. If just a bucket index would be enough, the calculation could be left out, shaving off some time.

Okay, I’ll leave the decisions on your and Mek’s discussion during implementation.

#11 Updated by Pavel Kácha about 2 months ago

Also, somewhere on the course of implementing #6308 (before or after) we should incorporate changes/removes/additions from #6257#note-2.

#12 Updated by Radko Krkoš about 2 months ago

  • Assignee changed from Radko Krkoš to Pavel Kácha

After discussion, now that I understand it more:

Pavel Kácha wrote:

  • Totals
    • One of key measures.
    • For timeline can be calculated relatively quickly (count(*) ... group
      by).
    • For statistician data it’s small simple value.
    • Keep.

Definitely keep, this is the first level of any statistics.
As for the pie (doughnut) chart, I would either leave it out (it is always 100%), or graph individual time slots in it - displays relative distribution of events over the time period, must be done for fewer slots, hundred is too much.

Pavel Kácha wrote:

  • Recurrences
    • I’m not sure if its useful in its current form. Pie chart is almost
      always 100%. Bars may show that in some specific time there was a peak
      in recurrent events, alas there’s no possibility of finding out which
      ones (apart from trying to correlate with columns on other graphs with
      possibly different scales and the drilling into Alerts searches).
    • As i understand it, it needs some consultation with relapse
      cache during calculation, which might be costly .
    • If we accept classes as first class citizen (pun not intended), partial
      information can be visible from graph (sans source)
    • I incline to ditching it.

I am not sure how to calculate that, not even algorithmically, surely not in the DB, so I would be for ditching it also.

Pavel Kácha wrote:

  • Abuses
    • Now makes sense only for constituency admins (like us).
    • Usable for abuse team admins only after we have hierarchical networks.
    • Keep, we’ll need it in the future.
  • Analyzers
    • Names of analyzers are completely arbitrary, often cryptic.
    • Useful only for constituency admins (like us).
    • Ditch.

I incline on keep, this might be useful for some Warden related outputs, Warden client cleanup work and such.

Pavel Kácha wrote:

  • ASNs
    • I believe this is used very seldomly.
    • Not sure if useful for security/forensics team work.
    • However nice in graphs...
    • ... if accompanied with company name.
    • Not sure.
  • Categories
    • Straightforward (one category).
    • Problem with ambigous statistics (sum is more than number of events).
    • Personaly I’ve never used it, always used category sets.
    • Calculable from Category sets statistics, if needed.
    • For my part, ditch.

We discussed the “inverse Aristotelian dissonance” and the calculation should be changed to relative portion of the whole, therefore not inflating the total count. Then the result is something that was requested in SABU. I would keep this, with the changes.

Pavel Kácha wrote:

  • Category sets
    • One of key measures.
    • Statistics sums up.
    • Useful would be possibility to merge some category sets, like 'omit
      Test, group the rest’.
    • Keep.

While adding conditions like that is trivial, I am not sure how to represent that in the UI. The Test category is special, maybe have another view, sans all Test events? That could be useful for decisions around migrating Warden clients from test or how much effort to allocate to them.

Pavel Kácha wrote:

  • Countries
    • Same as ASNs.
    • Isn’t this derivable from ASNs? (For stats precalculation.)
    • Not sure.

This looks nice in presentations. Then, there were cases of geographic filtering on repackaging Warden sources in the past, so it might be (heavily) skewed, so the utility is questionable.

Pavel Kácha wrote:

  • Detectors
    • One of key measures.
    • Keep.
  • Detector software
    • Similar to analyzers.
    • Moreover, most of the clients have only one SW behind (except Filers, which we don’t care dearly for.)
    • Ditch.
  • IPs
    • One of key measures.
    • Rename to Sources?
    • Keep.

I agree with the rename, especially if the Target IP statistics would be added (and I agree with that).

Pavel Kácha wrote:

  • Classes
    • Mmmkay.
    • Not convinced strongly, but let’s keep.
  • Severities
    • Also not sure, if useful in current form. Bars may show that in some
      specific time there was a peak in recurrent events, alas there’s no
      possibility of finding out which ones (apart from trying to correlate
      with columns on other graphs with possibly different scales and
      drilling into Alerts searches).
    • Have anybody ever used this stats?
    • Ditch?

This could be useful as a high level information in a manager graph somewhere.

Pavel Kácha wrote:

We don’t have:

  • Ports
    • Would be nice for spotting rise of attacks to specific ports.
    • Add?

I am for adding this.

Pavel Kácha wrote:

  • Services
    • (Un)fortunately it’s partially orthogonal to ports.
    • Would be nice for spotting rise of attacks to specific services.
    • Add?

True as the orthogonality argument goes, but might be useful also. And it could help with analysis if any discrepancies between ports and services exist, leading to improving the Warden clients and therefore the system as a whole.

Pavel Kácha wrote:

  • Targets
    • Doesn’t it make sense for network admins to see their hotspots?
    • Add?

Definitely.

As a whole, I see the statistics and timelines as a useful tool. Only changes I would ask for are to present relative counts to the whole, instead of inflating the total count, as was discussed. The DB based way of calculation is quite efficient, even for attributes with extreme value ranges (source/target ip arrays), there is a lot of room for expansion of this feature.

#13 Updated by Pavel Kácha about 1 month ago

  • Blocked by Feature #6308: Reimplement timeline calculations on database added

#14 Updated by Pavel Kácha about 1 month ago

Lets postpone discussion after finishing #6308, as it involves adding some columns to flat data table.

#15 Updated by Pavel Kácha 26 days ago

  • To be discussed deleted (Yes)

Also available in: Atom PDF