https://homeproj.cesnet.cz/https://homeproj.cesnet.cz/httpauth-login/favicon.ico?16194486082020-04-09T15:04:19ZHomeproj: Redmine for CESNETMentat - Feature #6308: Reimplement timeline calculations on databasehttps://homeproj.cesnet.cz/issues/6308?journal_id=266752020-04-09T15:04:19ZPavel Káchaph@cesnet.cz
<ul></ul><p>After discussion - preferable would be to keep db query as simple, as necessary, and leave pie/sum calculations to Python, as high hundreds/low thousands of calculations will presumably be reasonably cheap. This could be of course reevaluated based on results.</p> Mentat - Feature #6308: Reimplement timeline calculations on databasehttps://homeproj.cesnet.cz/issues/6308?journal_id=266872020-04-14T06:35:57ZRadko Krkoškrkos@cesnet.cz
<ul><li><strong>Assignee</strong> changed from <i>Radko Krkoš</i> to <i>Jan Mach</i></li></ul><p>The general query, covering all cases discussed in <a class="issue tracker-2 status-5 priority-4 priority-default closed" title="Feature: Review calculated statistics in timeline (Closed)" href="https://homeproj.cesnet.cz/issues/6257">#6257</a> and statistician, is as follows:<br /><pre>
SELECT
{since} + {interval} * (width_bucket(detecttime, (SELECT array_agg(buckets) FROM generate_series({since}, {until}, {interval}) AS buckets)) - 1) AS bucket, /* if more than one bucket and time-based data required; leave out for statistician */
width_bucket(detecttime, (SELECT array_agg(buckets) FROM generate_series({since}, {until}, {interval}) AS buckets)) AS bucket, /* if more than one bucket and bucket index only is required - a bit faster; leave out for statistician */
{column} AS set, /* if second dimension is used */
unwrap({column}) AS set, /* if second dimension is used and based on array values */
lower(unwrap({column})) AS set, /* to work around case issues, but this should be fixed on import - remember P2D2? */
COUNT(*)
FROM events WHERE
detecttime > {since} AND detecttime < {until}
GROUP BY
bucket, /* leave out for statistician */
set /* if second dimension is used */
ORDER BY bucket /* if ordered set is important (i.e. a dict is not built from the results), otherwise leave out - a bit faster */
;
</pre></p>
<p>The <code>{}</code> quoted strings should be replaced by parameters, <code>{since}</code> and <code>{until}</code> are user input, <code>{interval}</code> is calculated somehow, <code>{column}</code> is the second dimension column, beware, it must be <code>psycopg2.sql.Identifier()</code> and cannot be supplied to <code>psycopg2.query()</code>, but to <code>psycopg2.sql.SQL.format()</code>.</p>
<p>So, the specializations are as follows.<br />The simplest case for statistician:<br /><pre>
SELECT
COUNT(*)
FROM events WHERE
detecttime > {since} AND detecttime < {until}
;
</pre></p>
<p>The time binned counts for graphing with timestamps:<br /><pre>
SELECT
{since} + {interval} * (width_bucket(detecttime, (SELECT array_agg(buckets) FROM generate_series({since}, {until}, {interval}) AS buckets)) - 1) AS bucket,
COUNT(*)
FROM events WHERE
detecttime > {since} AND detecttime < {until}
GROUP BY
bucket
;
</pre></p>
<p>Any additional dimension for aggregation has to be specified in both the column specifier, with any decompositions using <code>unwrap()</code> and in the <code>GROUP BY</code> part (so naming it using <code>AS</code> is advised).</p> Mentat - Feature #6308: Reimplement timeline calculations on databasehttps://homeproj.cesnet.cz/issues/6308?journal_id=266922020-04-14T08:09:01ZRadko Krkoškrkos@cesnet.cz
<ul><li><strong>To be discussed</strong> changed from <i>No</i> to <i>Yes</i></li></ul><p>One additional remark, for attributes with very large value ranges, consider adding a <code>HAVING</code> clause limiting to those with more than one occurrence. For example, for the source IPs:<br /><pre>
SELECT
{since} + {interval} * (width_bucket(detecttime, (SELECT array_agg(buckets) FROM generate_series({since}, {until}, {interval}) AS buckets)) - 1) AS bucket,
unnest(events.source_ip) AS src,
COUNT(*)
FROM events WHERE
detecttime > {since} AND detecttime < {until}
GROUP BY bucket, src
HAVING COUNT(*) > 1;
</pre></p>
<p>I find such filtering as quite useful as far as frequent offenders go. This would also limit the amount of data returned tremendously, so the performance impact might be substantial for longer timeframes.</p> Mentat - Feature #6308: Reimplement timeline calculations on databasehttps://homeproj.cesnet.cz/issues/6308?journal_id=267172020-04-15T16:04:25ZPavel Káchaph@cesnet.cz
<ul></ul><p><a class="user active" href="https://homeproj.cesnet.cz/users/391">Radko Krkoš</a> wrote:</p>
<blockquote>
<p>One additional remark, for attributes with very large value ranges, consider adding a <code>HAVING</code> clause limiting to those with more than one occurrence. For example, for the source IPs:<br />[...]</p>
<p>I find such filtering as quite useful as far as frequent offenders go. This would also limit the amount of data returned tremendously, so the performance impact might be substantial for longer timeframes.</p>
</blockquote>
<p>Or sort and limit at some arbitrary cutoff, as current implementation does?</p> Mentat - Feature #6308: Reimplement timeline calculations on databasehttps://homeproj.cesnet.cz/issues/6308?journal_id=267292020-04-16T09:15:38ZJan Machjan.mach@cesnet.cz
<ul><li><strong>Target version</strong> changed from <i>2.8</i> to <i>2.7</i></li></ul> Mentat - Feature #6308: Reimplement timeline calculations on databasehttps://homeproj.cesnet.cz/issues/6308?journal_id=267342020-04-16T09:38:35ZPavel Káchaph@cesnet.cz
<ul></ul><p><a class="user active" href="https://homeproj.cesnet.cz/users/17">Pavel Kácha</a> wrote:</p>
<blockquote>
<p><a class="user active" href="https://homeproj.cesnet.cz/users/391">Radko Krkoš</a> wrote:</p>
<blockquote>
<p>One additional remark, for attributes with very large value ranges, consider adding a <code>HAVING</code> clause limiting to those with more than one occurrence. For example, for the source IPs:<br />[...]</p>
<p>I find such filtering as quite useful as far as frequent offenders go. This would also limit the amount of data returned tremendously, so the performance impact might be substantial for longer timeframes.</p>
</blockquote>
<p>Or sort and limit at some arbitrary cutoff, as current implementation does?</p>
</blockquote>
<p>Current way is to limit results by some arbitrary number on high cardinality attributes (like IPs). Radko, could you please look into possibility of cutting this on the db? Otherwise we'll end up with fetching high numbers of data (and parsing them, albeit cheaper then full Idea) from db again.</p> Mentat - Feature #6308: Reimplement timeline calculations on databasehttps://homeproj.cesnet.cz/issues/6308?journal_id=267352020-04-16T09:40:51ZPavel Káchaph@cesnet.cz
<ul></ul><p>Also Mek noted, that there are attributes which are not in flat table and we are not able to do db calculations. Ok, just use what we have, we'll discuss adding others as necessary/feasible after first iterations.</p> Mentat - Feature #6308: Reimplement timeline calculations on databasehttps://homeproj.cesnet.cz/issues/6308?journal_id=267432020-04-17T09:28:39ZRadko Krkoškrkos@cesnet.cz
<ul></ul><p><a class="user active" href="https://homeproj.cesnet.cz/users/17">Pavel Kácha</a> wrote:</p>
<blockquote>
<p>Radko, could you please look into possibility of cutting this on the db? Otherwise we'll end up with fetching high numbers of data (and parsing them, albeit cheaper then full Idea) from db again.</p>
</blockquote>
<p>The best identified way to approach this problem is by extending the query as follows:<br /><pre>
SELECT dist.bucket AS bucket, dist.set AS set, dist.count AS count
FROM
(/* here comes the original query */) AS dist
INNER JOIN
(SELECT unnest(events.{column}) AS set, COUNT(*) FROM events WHERE detecttime > {since} AND detecttime < {until} GROUP BY set ORDER BY COUNT(*) LIMIT {topN}) AS toplist
USING (set);
</pre></p>
<p>this adds a new variable <code>topN</code>, meaning hopefully obvious. <span class="wiking smiley smiley-smiley" title=":)"></span></p>
<p>The presented solution is the straightforward extension, but gives an optimal execution plan according to the work needed, I can look into making the calculation in one pass, but that would lead to a quite complicated query. The performance comes to ~2s for the 6 hour scenario and ~32s for the one month scenario (double the time as two passes over the data are done), not extra low, but probably still acceptable. The obvious optimizations like moving sorting or computations outside the original <code>SELECT</code> have negligible performance impact, therefore simplicity should win here.</p> Mentat - Feature #6308: Reimplement timeline calculations on databasehttps://homeproj.cesnet.cz/issues/6308?journal_id=267892020-04-23T07:53:03ZPavel Káchaph@cesnet.cz
<ul></ul><p>Hotovo, týdenní dotaz trvá cca 2 minuty pro výpočet všech agregací pro eventy, abuse, kategorie, detektory, IP, třídy, závažnosti.</p>
<ul>
<li>přidat zdrojové porty, cílové porty, služby, tagy detektoru, tagy zdroje, tagy cíle; uvidíme, probereme.</li>
<li>limitování top N na straně PQ, porovnat s nelimitovaným výstupem (možná checkbox na výběr?)</li>
<li>zkusit nahradit agregační dotaz sečtením z grafových dat v Pythonu</li>
</ul> Mentat - Feature #6308: Reimplement timeline calculations on databasehttps://homeproj.cesnet.cz/issues/6308?journal_id=268152020-04-30T07:35:41ZJan Machjan.mach@cesnet.cz
<ul><li><strong>File</strong> <a href="/attachments/3531">screencapture-mentat-alt-cesnet-cz-mentat-timeline-search-2020-04-30-08_57_29.png</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/3531/screencapture-mentat-alt-cesnet-cz-mentat-timeline-search-2020-04-30-08_57_29.png">screencapture-mentat-alt-cesnet-cz-mentat-timeline-search-2020-04-30-08_57_29.png</a> added</li></ul><p>Notes about recent development:</p>
<ul>
<li>I have added new search form multi select box for restricting set of aggregations that will be performed. </li>
<li>The "limit in database" search option is not implemented yet.</li>
<li>A have added following additional calculations: source and target ports, source and target types, detector types and protocols.</li>
<li>A have added in memory calculations of overall aggregations from timeline data to make comparison with database query providing same result.</li>
<li>Attached file illustrates current performance:
<ul>
<li>Time window was 3 days.</li>
<li>Total time was 00:02:54.</li>
<li>00:02:11 - searching and fetching rows.</li>
<li>00:00:42 - postprocessing.
<ul>
<li>00:00:12 - converting rows to statistical structures.</li>
<li>00:00:20 - truncating all timelines and pies.</li>
<li>00:00:10 - recalculating pies from untruncated timelines and truncating (for comparison with db).</li>
</ul></li>
</ul></li>
</ul>
<p>You can see, that aggregation queries for IPs, source and target ports were most demanding and took 5s, 14s, and 2s respectively, while when calculating the same data from timeline datasets it took only 2s, 5s, and 0,6s respectively. So I guess the conclusion here is to drop aggregation aggregations in database and calculate them from timeline aggregations.</p>
<p><a class="user active" href="https://homeproj.cesnet.cz/users/17">Pavel Kácha</a> wrote:</p>
<blockquote>
<p>Hotovo, týdenní dotaz trvá cca 2 minuty pro výpočet všech agregací pro eventy, abuse, kategorie, detektory, IP, třídy, závažnosti.</p>
<ul>
<li>přidat zdrojové porty, cílové porty, služby, tagy detektoru, tagy zdroje, tagy cíle; uvidíme, probereme.</li>
<li>limitování top N na straně PQ, porovnat s nelimitovaným výstupem (možná checkbox na výběr?)</li>
<li>zkusit nahradit agregační dotaz sečtením z grafových dat v Pythonu</li>
</ul>
</blockquote> Mentat - Feature #6308: Reimplement timeline calculations on databasehttps://homeproj.cesnet.cz/issues/6308?journal_id=268162020-04-30T07:47:38ZJan Machjan.mach@cesnet.cz
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li><li><strong>% Done</strong> changed from <i>0</i> to <i>30</i></li></ul><p>There are following steps remaining to be done:</p>
<ul>
<li>Change the pie chart for multivalue sets like IPs according to Radko`s recommendation.</li>
<li>Consider again limiting/truncating on the side of the database. In this approach we however loose the "REST" and total sum, some of the rows will be omitted. Also the outer query will be much more complex in the where section, it should in fact be the same as the where section in the inner query, in my opinion. So is this ok?</li>
<li>Implement caching.</li>
<li>Consider, choose and implement some chart lazy loading mechanism, so that the user is not forced to wait 2 minutes for all datasets when he is only interested in one or two.
<ol>
<li>On request presearch and display first chart, others will be fetched and calculated on tab switch.</li>
<li>On request present empty page and fetch all charts on tab switch. This is a variant of the previous, because the first chart will be immediately requested and fetched.</li>
<li>Let user choose which charts to calculate and render.</li>
<li>Keep the things the way they are now, always calculate and display all data.</li>
</ol></li>
</ul> Mentat - Feature #6308: Reimplement timeline calculations on databasehttps://homeproj.cesnet.cz/issues/6308?journal_id=268172020-04-30T08:48:07ZJan Machjan.mach@cesnet.cz
<ul></ul><p><a class="user active" href="https://homeproj.cesnet.cz/users/3">Jan Mach</a> wrote:</p>
<blockquote>
<p>There are following steps remaining to be done:</p>
<ul>
<li>Change the pie chart for multivalue sets like IPs according to Radko`s recommendation.</li>
<li>Consider again limiting/truncating on the side of the database. In this approach we however loose the "REST" and total sum, some of the rows will be omitted. Also the outer query will be much more complex in the where section, it should in fact be the same as the where section in the inner query, in my opinion. So is this ok?</li>
</ul>
</blockquote>
<p>Conclusions from today`s meeting:</p>
<ul>
<li>In multivalue aggregations we do not care about the REST or TOTAL. We will drop pie charts as visualisations and use sorted bar charts with absolute values.</li>
<li>Consider following four calculations options:
<ul>
<li>Limiting in database or in Python (on/off)</li>
<li>Get aggregation aggregation from db or in Python (on/off)</li>
</ul>
</li>
<li>Perform tests and comparison for longer time period to test scaling.</li>
</ul>
<blockquote>
<ul>
<li>Implement caching.</li>
<li>Consider, choose and implement some chart lazy loading mechanism, so that the user is not forced to wait 2 minutes for all datasets when he is only interested in one or two.
<ol>
<li>On request presearch and display first chart, others will be fetched and calculated on tab switch.</li>
<li>On request present empty page and fetch all charts on tab switch. This is a variant of the previous, because the first chart will be immediately requested and fetched.</li>
<li>Let user choose which charts to calculate and render.</li>
<li>Keep the things the way they are now, always calculate and display all data.</li>
</ol></li>
</ul>
</blockquote> Mentat - Feature #6308: Reimplement timeline calculations on databasehttps://homeproj.cesnet.cz/issues/6308?journal_id=268192020-04-30T11:10:16ZPavel Káchaph@cesnet.cz
<ul><li><strong>Blocks</strong> <i><a class="issue tracker-2 status-5 priority-4 priority-default closed" href="/issues/6257">Feature #6257</a>: Review calculated statistics in timeline</i> added</li></ul> Mentat - Feature #6308: Reimplement timeline calculations on databasehttps://homeproj.cesnet.cz/issues/6308?journal_id=268532020-05-08T07:27:13ZJan Machjan.mach@cesnet.cz
<ul></ul>Conclusions from yesterdays meeting:
<ul>
<li>Based on some measurements we will use following approach:
<ul>
<li>Aggregation and timeline aggregations will always be calculated, those data are needed for all other visualisations.</li>
<li>Perform <em>aggregation</em> aggregations in Python code - calculate them by traversing timeline calculations.</li>
<li>Perform _timeline aggregations in database. For datasets with high number of keys (IPs, ports, ...) use limiting directly within database to restrict number of rows that need to be fetched and processed.</li>
<li>Enforce upper time interval threshold for all searches to enable caching.</li>
<li>Use server side caching for search results.</li>
<li>Use client side caching in browser`s javascript.</li>
<li>By default display only total event count statistics, all other tabs will be fetched on demand.</li>
<li>Keep the configurable toplist limit and list of precalculated statistics, but remove them from form to reduce the number of options user has to understand.</li>
</ul></li>
</ul>
These decisions are based on following key data:
<ul>
<li>un-toplisted database search for source-port statistics, 2 weeks of data
<ul>
<li>timeline search and fetch: 1:18s, 11 122 841 rows</li>
<li>convert rows -> dataset: 28s</li>
<li>toplisting: 15s</li>
</ul>
</li>
<li>toplisted database search for source-port statistics, 2 weeks of data:
<ul>
<li>search and fetch: 1:05s, 17 148 rows</li>
<li>convert rows -> dataset: 0.1s</li>
</ul>
</li>
<li>aggregation search and fetch:
<ul>
<li>in database: 16s</li>
<li>in code: 10s from untoplisted, 0.1s from toplisted</li>
</ul></li>
</ul>
<p>According to the measurement database toplisting scales much better than Python code toplisting.</p> Mentat - Feature #6308: Reimplement timeline calculations on databasehttps://homeproj.cesnet.cz/issues/6308?journal_id=268772020-05-13T14:51:17ZJan Machjan.mach@cesnet.cz
<ul><li><strong>% Done</strong> changed from <i>30</i> to <i>70</i></li></ul><p><a class="user active" href="https://homeproj.cesnet.cz/users/3">Jan Mach</a> wrote:</p>
<blockquote>
Conclusions from yesterdays meeting:
<ul>
<li>Based on some measurements we will use following approach:
<ul>
<li>Aggregation and timeline aggregations will always be calculated, those data are needed for all other visualisations.</li>
</ul></li>
</ul>
</blockquote>
<p>Done. Overall event count aggregation and timeline aggregations are always searched and calculated.</p>
<blockquote>
<ul>
<li>Perform <em>aggregation</em> aggregations in Python code - calculate them by traversing timeline calculations.</li>
</ul>
</blockquote>
<p>Done. I have disabled all aggregation aggregation queries. Only timeline aggregation queries are executed, aggregation aggregations are calculated from them afterwards.</p>
<blockquote>
<ul>
<li>Perform _timeline aggregations in database. For datasets with high number of keys (IPs, ports, ...) use limiting directly within database to restrict number of rows that need to be fetched and processed.</li>
</ul>
</blockquote>
<p>Done. IP, port and abuse group aggregations are toplisted by configurable limit number, all other aggregations are unlimited.</p>
<blockquote>
<ul>
<li>Enforce upper time interval threshold for all searches to enable caching.</li>
</ul>
</blockquote>
<p>Done. There is a default for upper time interval threshold. It is calculated by using current timestamp with anything below hours set to zero to achieve nice looking time window.</p>
<blockquote>
<ul>
<li>Use server side caching for search results.</li>
</ul>
</blockquote>
<p>TBD</p>
<blockquote>
<ul>
<li>Use client side caching in browser`s javascript.</li>
</ul>
</blockquote>
<p>TBD</p>
<blockquote>
<ul>
<li>By default display only total event count statistics, all other tabs will be fetched on demand.</li>
</ul>
</blockquote>
<p>TBD</p>
<blockquote>
<ul>
<li>Keep the configurable toplist limit and list of precalculated statistics, but remove them from form to reduce the number of options user has to understand.</li>
</ul>
</blockquote>
<p>Done. These parameters are available to API via URL parameters and are display in the HTML form only to the administrator.</p>
<blockquote>
These decisions are based on following key data:
<ul>
<li>un-toplisted database search for source-port statistics, 2 weeks of data
<ul>
<li>timeline search and fetch: 1:18s, 11 122 841 rows</li>
<li>convert rows -> dataset: 28s</li>
<li>toplisting: 15s</li>
</ul>
</li>
<li>toplisted database search for source-port statistics, 2 weeks of data:
<ul>
<li>search and fetch: 1:05s, 17 148 rows</li>
<li>convert rows -> dataset: 0.1s</li>
</ul>
</li>
<li>aggregation search and fetch:
<ul>
<li>in database: 16s</li>
<li>in code: 10s from untoplisted, 0.1s from toplisted</li>
</ul></li>
</ul>
<p>According to the measurement database toplisting scales much better than Python code toplisting.</p>
</blockquote> Mentat - Feature #6308: Reimplement timeline calculations on databasehttps://homeproj.cesnet.cz/issues/6308?journal_id=268782020-05-13T14:52:25ZJan Machjan.mach@cesnet.cz
<ul></ul><p>All changes mentioned in previous note are already deployed on <strong>mentat-alt</strong> server and available for testing.</p> Mentat - Feature #6308: Reimplement timeline calculations on databasehttps://homeproj.cesnet.cz/issues/6308?journal_id=269202020-05-19T11:07:44ZJan Machjan.mach@cesnet.cz
<ul><li><strong>Related to</strong> <i><a class="issue tracker-2 status-4 priority-4 priority-default overdue behind-schedule" href="/issues/6332">Feature #6332</a>: Improve searching with caching and JavaScript</i> added</li></ul> Mentat - Feature #6308: Reimplement timeline calculations on databasehttps://homeproj.cesnet.cz/issues/6308?journal_id=269222020-05-19T11:09:24ZJan Machjan.mach@cesnet.cz
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Closed</i></li><li><strong>% Done</strong> changed from <i>70</i> to <i>100</i></li><li><strong>To be discussed</strong> changed from <i>Yes</i> to <i>No</i></li></ul><p>We are happy with current state of the matters for now. New issue <a class="issue tracker-2 status-4 priority-4 priority-default overdue behind-schedule" title="Feature: Improve searching with caching and JavaScript (Feedback)" href="https://homeproj.cesnet.cz/issues/6332">#6332</a> was created as a follow up of this one to implement remaining features like caching and on demand loading.</p> Mentat - Feature #6308: Reimplement timeline calculations on databasehttps://homeproj.cesnet.cz/issues/6308?journal_id=273032020-07-07T12:35:11ZRadko Krkoškrkos@cesnet.cz
<ul><li><strong>Related to</strong> <i><a class="issue tracker-2 status-5 priority-4 priority-default closed child" href="/issues/6413">Feature #6413</a>: Autarkic DB queries for timeline</i> added</li></ul>