Project

General

Profile

Actions

Feature #6308

closed

Reimplement timeline calculations on database

Added by Pavel Kácha almost 4 years ago. Updated almost 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Development - Core
Target version:
Start date:
04/09/2020
Due date:
% Done:

100%

Estimated time:
To be discussed:
No

Description

Stemming from #6257, let's reimplement calculations of graphs/pie charts on database, instead of fetching out all the data and crunching in Python.

For start, all the necessary queries should be implemented and run sequentially to create more or less identical data for the current frontend. More invasive changes (switching to event-related percentual values, switching to separate calculations) will be done later.

Radko, please prepare necessary queries, than hand the ticket over to Mek.


Files


Related issues

Related to Mentat - Feature #6332: Improve searching with caching and JavaScriptFeedbackJakub Maloštik07/08/202007/08/2020

Actions
Related to Mentat - Feature #6413: Autarkic DB queries for timelineClosedJakub Maloštik07/07/2020

Actions
Blocks Mentat - Feature #6257: Review calculated statistics in timelineClosedJakub Maloštik03/09/2020

Actions
Actions #1

Updated by Pavel Kácha almost 4 years ago

After discussion - preferable would be to keep db query as simple, as necessary, and leave pie/sum calculations to Python, as high hundreds/low thousands of calculations will presumably be reasonably cheap. This could be of course reevaluated based on results.

Actions #2

Updated by Radko Krkoš almost 4 years ago

  • Assignee changed from Radko Krkoš to Jan Mach

The general query, covering all cases discussed in #6257 and statistician, is as follows:

SELECT
    {since} + {interval} * (width_bucket(detecttime, (SELECT array_agg(buckets) FROM generate_series({since}, {until}, {interval}) AS buckets)) - 1) AS bucket, /* if more than one bucket and time-based data required; leave out for statistician */
    width_bucket(detecttime, (SELECT array_agg(buckets) FROM generate_series({since}, {until}, {interval}) AS buckets)) AS bucket, /* if more than one bucket and bucket index only is required - a bit faster; leave out for statistician */
    {column} AS set, /* if second dimension is used */
    unwrap({column}) AS set, /* if second dimension is used and based on array values */
    lower(unwrap({column})) AS set, /* to work around case issues, but this should be fixed on import - remember P2D2? */
    COUNT(*)
FROM events WHERE
    detecttime > {since} AND detecttime < {until}
GROUP BY
    bucket, /* leave out for statistician */
    set /* if second dimension is used */
ORDER BY bucket /* if ordered set is important (i.e. a dict is not built from the results), otherwise leave out - a bit faster */
;

The {} quoted strings should be replaced by parameters, {since} and {until} are user input, {interval} is calculated somehow, {column} is the second dimension column, beware, it must be psycopg2.sql.Identifier() and cannot be supplied to psycopg2.query(), but to psycopg2.sql.SQL.format().

So, the specializations are as follows.
The simplest case for statistician:

SELECT
    COUNT(*)
FROM events WHERE
    detecttime > {since} AND detecttime < {until}
;

The time binned counts for graphing with timestamps:

SELECT
    {since} + {interval} * (width_bucket(detecttime, (SELECT array_agg(buckets) FROM generate_series({since}, {until}, {interval}) AS buckets)) - 1) AS bucket,
    COUNT(*)
FROM events WHERE
    detecttime > {since} AND detecttime < {until}
GROUP BY
    bucket
;

Any additional dimension for aggregation has to be specified in both the column specifier, with any decompositions using unwrap() and in the GROUP BY part (so naming it using AS is advised).

Actions #3

Updated by Radko Krkoš almost 4 years ago

  • To be discussed changed from No to Yes

One additional remark, for attributes with very large value ranges, consider adding a HAVING clause limiting to those with more than one occurrence. For example, for the source IPs:

SELECT
    {since} + {interval} * (width_bucket(detecttime, (SELECT array_agg(buckets) FROM generate_series({since}, {until}, {interval}) AS buckets)) - 1) AS bucket,
    unnest(events.source_ip) AS src,
    COUNT(*)
FROM events WHERE
    detecttime > {since} AND detecttime < {until}
GROUP BY bucket, src
HAVING COUNT(*) > 1;

I find such filtering as quite useful as far as frequent offenders go. This would also limit the amount of data returned tremendously, so the performance impact might be substantial for longer timeframes.

Actions #4

Updated by Pavel Kácha almost 4 years ago

Radko Krkoš wrote:

One additional remark, for attributes with very large value ranges, consider adding a HAVING clause limiting to those with more than one occurrence. For example, for the source IPs:
[...]

I find such filtering as quite useful as far as frequent offenders go. This would also limit the amount of data returned tremendously, so the performance impact might be substantial for longer timeframes.

Or sort and limit at some arbitrary cutoff, as current implementation does?

Actions #5

Updated by Jan Mach almost 4 years ago

  • Target version changed from 2.8 to 2.7
Actions #6

Updated by Pavel Kácha almost 4 years ago

Pavel Kácha wrote:

Radko Krkoš wrote:

One additional remark, for attributes with very large value ranges, consider adding a HAVING clause limiting to those with more than one occurrence. For example, for the source IPs:
[...]

I find such filtering as quite useful as far as frequent offenders go. This would also limit the amount of data returned tremendously, so the performance impact might be substantial for longer timeframes.

Or sort and limit at some arbitrary cutoff, as current implementation does?

Current way is to limit results by some arbitrary number on high cardinality attributes (like IPs). Radko, could you please look into possibility of cutting this on the db? Otherwise we'll end up with fetching high numbers of data (and parsing them, albeit cheaper then full Idea) from db again.

Actions #7

Updated by Pavel Kácha almost 4 years ago

Also Mek noted, that there are attributes which are not in flat table and we are not able to do db calculations. Ok, just use what we have, we'll discuss adding others as necessary/feasible after first iterations.

Actions #8

Updated by Radko Krkoš almost 4 years ago

Pavel Kácha wrote:

Radko, could you please look into possibility of cutting this on the db? Otherwise we'll end up with fetching high numbers of data (and parsing them, albeit cheaper then full Idea) from db again.

The best identified way to approach this problem is by extending the query as follows:

SELECT dist.bucket AS bucket, dist.set AS set, dist.count AS count
FROM
    (/* here comes the original query */) AS dist
INNER JOIN
    (SELECT unnest(events.{column}) AS set, COUNT(*) FROM events WHERE detecttime > {since} AND detecttime < {until} GROUP BY set ORDER BY COUNT(*) LIMIT {topN}) AS toplist
USING (set);

this adds a new variable topN, meaning hopefully obvious.

The presented solution is the straightforward extension, but gives an optimal execution plan according to the work needed, I can look into making the calculation in one pass, but that would lead to a quite complicated query. The performance comes to ~2s for the 6 hour scenario and ~32s for the one month scenario (double the time as two passes over the data are done), not extra low, but probably still acceptable. The obvious optimizations like moving sorting or computations outside the original SELECT have negligible performance impact, therefore simplicity should win here.

Actions #9

Updated by Pavel Kácha almost 4 years ago

Hotovo, týdenní dotaz trvá cca 2 minuty pro výpočet všech agregací pro eventy, abuse, kategorie, detektory, IP, třídy, závažnosti.

  • přidat zdrojové porty, cílové porty, služby, tagy detektoru, tagy zdroje, tagy cíle; uvidíme, probereme.
  • limitování top N na straně PQ, porovnat s nelimitovaným výstupem (možná checkbox na výběr?)
  • zkusit nahradit agregační dotaz sečtením z grafových dat v Pythonu
Actions #10

Updated by Jan Mach almost 4 years ago

Notes about recent development:

  • I have added new search form multi select box for restricting set of aggregations that will be performed.
  • The "limit in database" search option is not implemented yet.
  • A have added following additional calculations: source and target ports, source and target types, detector types and protocols.
  • A have added in memory calculations of overall aggregations from timeline data to make comparison with database query providing same result.
  • Attached file illustrates current performance:
    • Time window was 3 days.
    • Total time was 00:02:54.
    • 00:02:11 - searching and fetching rows.
    • 00:00:42 - postprocessing.
      • 00:00:12 - converting rows to statistical structures.
      • 00:00:20 - truncating all timelines and pies.
      • 00:00:10 - recalculating pies from untruncated timelines and truncating (for comparison with db).

You can see, that aggregation queries for IPs, source and target ports were most demanding and took 5s, 14s, and 2s respectively, while when calculating the same data from timeline datasets it took only 2s, 5s, and 0,6s respectively. So I guess the conclusion here is to drop aggregation aggregations in database and calculate them from timeline aggregations.

Pavel Kácha wrote:

Hotovo, týdenní dotaz trvá cca 2 minuty pro výpočet všech agregací pro eventy, abuse, kategorie, detektory, IP, třídy, závažnosti.

  • přidat zdrojové porty, cílové porty, služby, tagy detektoru, tagy zdroje, tagy cíle; uvidíme, probereme.
  • limitování top N na straně PQ, porovnat s nelimitovaným výstupem (možná checkbox na výběr?)
  • zkusit nahradit agregační dotaz sečtením z grafových dat v Pythonu
Actions #11

Updated by Jan Mach almost 4 years ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 30

There are following steps remaining to be done:

  • Change the pie chart for multivalue sets like IPs according to Radko`s recommendation.
  • Consider again limiting/truncating on the side of the database. In this approach we however loose the "REST" and total sum, some of the rows will be omitted. Also the outer query will be much more complex in the where section, it should in fact be the same as the where section in the inner query, in my opinion. So is this ok?
  • Implement caching.
  • Consider, choose and implement some chart lazy loading mechanism, so that the user is not forced to wait 2 minutes for all datasets when he is only interested in one or two.
    1. On request presearch and display first chart, others will be fetched and calculated on tab switch.
    2. On request present empty page and fetch all charts on tab switch. This is a variant of the previous, because the first chart will be immediately requested and fetched.
    3. Let user choose which charts to calculate and render.
    4. Keep the things the way they are now, always calculate and display all data.
Actions #12

Updated by Jan Mach almost 4 years ago

Jan Mach wrote:

There are following steps remaining to be done:

  • Change the pie chart for multivalue sets like IPs according to Radko`s recommendation.
  • Consider again limiting/truncating on the side of the database. In this approach we however loose the "REST" and total sum, some of the rows will be omitted. Also the outer query will be much more complex in the where section, it should in fact be the same as the where section in the inner query, in my opinion. So is this ok?

Conclusions from today`s meeting:

  • In multivalue aggregations we do not care about the REST or TOTAL. We will drop pie charts as visualisations and use sorted bar charts with absolute values.
  • Consider following four calculations options:
    • Limiting in database or in Python (on/off)
    • Get aggregation aggregation from db or in Python (on/off)
  • Perform tests and comparison for longer time period to test scaling.
  • Implement caching.
  • Consider, choose and implement some chart lazy loading mechanism, so that the user is not forced to wait 2 minutes for all datasets when he is only interested in one or two.
    1. On request presearch and display first chart, others will be fetched and calculated on tab switch.
    2. On request present empty page and fetch all charts on tab switch. This is a variant of the previous, because the first chart will be immediately requested and fetched.
    3. Let user choose which charts to calculate and render.
    4. Keep the things the way they are now, always calculate and display all data.
Actions #13

Updated by Pavel Kácha almost 4 years ago

  • Blocks Feature #6257: Review calculated statistics in timeline added
Actions #14

Updated by Jan Mach almost 4 years ago

Conclusions from yesterdays meeting:
  • Based on some measurements we will use following approach:
    • Aggregation and timeline aggregations will always be calculated, those data are needed for all other visualisations.
    • Perform aggregation aggregations in Python code - calculate them by traversing timeline calculations.
    • Perform _timeline aggregations in database. For datasets with high number of keys (IPs, ports, ...) use limiting directly within database to restrict number of rows that need to be fetched and processed.
    • Enforce upper time interval threshold for all searches to enable caching.
    • Use server side caching for search results.
    • Use client side caching in browser`s javascript.
    • By default display only total event count statistics, all other tabs will be fetched on demand.
    • Keep the configurable toplist limit and list of precalculated statistics, but remove them from form to reduce the number of options user has to understand.
These decisions are based on following key data:
  • un-toplisted database search for source-port statistics, 2 weeks of data
    • timeline search and fetch: 1:18s, 11 122 841 rows
    • convert rows -> dataset: 28s
    • toplisting: 15s
  • toplisted database search for source-port statistics, 2 weeks of data:
    • search and fetch: 1:05s, 17 148 rows
    • convert rows -> dataset: 0.1s
  • aggregation search and fetch:
    • in database: 16s
    • in code: 10s from untoplisted, 0.1s from toplisted

According to the measurement database toplisting scales much better than Python code toplisting.

Actions #15

Updated by Jan Mach almost 4 years ago

  • % Done changed from 30 to 70

Jan Mach wrote:

Conclusions from yesterdays meeting:
  • Based on some measurements we will use following approach:
    • Aggregation and timeline aggregations will always be calculated, those data are needed for all other visualisations.

Done. Overall event count aggregation and timeline aggregations are always searched and calculated.

  • Perform aggregation aggregations in Python code - calculate them by traversing timeline calculations.

Done. I have disabled all aggregation aggregation queries. Only timeline aggregation queries are executed, aggregation aggregations are calculated from them afterwards.

  • Perform _timeline aggregations in database. For datasets with high number of keys (IPs, ports, ...) use limiting directly within database to restrict number of rows that need to be fetched and processed.

Done. IP, port and abuse group aggregations are toplisted by configurable limit number, all other aggregations are unlimited.

  • Enforce upper time interval threshold for all searches to enable caching.

Done. There is a default for upper time interval threshold. It is calculated by using current timestamp with anything below hours set to zero to achieve nice looking time window.

  • Use server side caching for search results.

TBD

  • Use client side caching in browser`s javascript.

TBD

  • By default display only total event count statistics, all other tabs will be fetched on demand.

TBD

  • Keep the configurable toplist limit and list of precalculated statistics, but remove them from form to reduce the number of options user has to understand.

Done. These parameters are available to API via URL parameters and are display in the HTML form only to the administrator.

These decisions are based on following key data:
  • un-toplisted database search for source-port statistics, 2 weeks of data
    • timeline search and fetch: 1:18s, 11 122 841 rows
    • convert rows -> dataset: 28s
    • toplisting: 15s
  • toplisted database search for source-port statistics, 2 weeks of data:
    • search and fetch: 1:05s, 17 148 rows
    • convert rows -> dataset: 0.1s
  • aggregation search and fetch:
    • in database: 16s
    • in code: 10s from untoplisted, 0.1s from toplisted

According to the measurement database toplisting scales much better than Python code toplisting.

Actions #16

Updated by Jan Mach almost 4 years ago

All changes mentioned in previous note are already deployed on mentat-alt server and available for testing.

Actions #17

Updated by Jan Mach almost 4 years ago

  • Related to Feature #6332: Improve searching with caching and JavaScript added
Actions #18

Updated by Jan Mach almost 4 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 70 to 100
  • To be discussed changed from Yes to No

We are happy with current state of the matters for now. New issue #6332 was created as a follow up of this one to implement remaining features like caching and on demand loading.

Actions #19

Updated by Radko Krkoš over 3 years ago

  • Related to Feature #6413: Autarkic DB queries for timeline added
Actions

Also available in: Atom PDF