Project

General

Profile

Feature #6413

Feature #6256: Review possibilities of support of timeline calculation on db

Autarkic DB queries for timeline

Added by Radko Krkoš about 1 month ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Development - Database
Target version:
Start date:
07/07/2020
Due date:
% Done:

0%

Estimated time:
To be discussed:
Yes

Description

The current implementation of timeline calculation, based on #6308, is divided into the database part and Python part. The main issue is that there are multiple DB queries performed for any of the views, combined afterwards in Python, what is unfit for caching (#6310, #6332). The queries should be self-contained to support those use-cases. The common table expressions (CTE) syntax should help with this.


Files

timeline_cte.txt (204 KB) timeline_cte.txt Radko Krkoš, 07/07/2020 02:56 PM

Related issues

Related to Mentat - Feature #6308: Reimplement timeline calculations on databaseClosedJan Mach04/09/2020

Actions
Related to Mentat - Feature #6257: Review calculated statistics in timelineNewPavel Kácha03/09/2020

Actions
Precedes Mentat - Feature #6310: Implement result cachingNewJan Mach07/08/202007/08/2020

Actions
Precedes Mentat - Feature #6332: Improve searching with caching and JavaScriptNewJan Mach07/08/202007/08/2020

Actions
#1

Updated by Radko Krkoš about 1 month ago

#2

Updated by Radko Krkoš about 1 month ago

  • Precedes Feature #6332: Improve searching with caching and JavaScript added
#3

Updated by Radko Krkoš about 1 month ago

  • Related to Feature #6308: Reimplement timeline calculations on database added
#4

Updated by Radko Krkoš about 1 month ago

  • Related to Feature #6257: Review calculated statistics in timeline added
#5

Updated by Radko Krkoš about 1 month ago

Using CTE, the basic query was extended as follows:

WITH
timeline AS (SELECT generate_series(\since, \until - INTERVAL '1ms', \interval) AS bucket),
total AS (SELECT COUNT(*) AS total FROM events WHERE "detecttime" >= \since AND "detecttime" < \until),
raw AS (SELECT \since + \interval * (width_bucket(detecttime, (SELECT array_agg(bucket) FROM timeline)) - 1) AS bucket, unnest(\set) AS set, COUNT(*) AS count FROM events WHERE "detecttime" >= \since AND "detecttime" < \until GROUP BY bucket, set),
sums AS (SELECT raw.set AS set, SUM(raw.count) AS sum FROM raw GROUP BY raw.set ORDER BY sum DESC)
(SELECT bucket, set, COALESCE(count, 0) AS count, NULL AS share FROM (timeline FULL JOIN sums ON TRUE) LEFT JOIN raw USING (bucket, set) ORDER BY bucket ASC, sums.sum DESC) UNION ALL SELECT NULL, sums.set, sums.sum, sums.sum/total.total::decimal FROM sums, total UNION ALL SELECT NULL, NULL, total.total, NULL FROM total;

with parameters \since, \until, \interval and \set, which generates all the data required for the specific view, with ARRAY and no LIMITing (e.g. category), zero-filled and sorted. See attachment timeline_cte.txt for example output. An optimization is possible for non-array column statistics and for LIMITing the query must be extended.
The runtime (on mentat-alt) is about 21.5s (median estimate) for 6 days and 1 hour interval as compared to the original query that took ~20.2s (median estimate).
Also an error is fixed here, where the interval upper bound was inclusive, resulting in several events getting in with detecttime = \until, quite commonly as some detectors report times with second granularity.

#6

Updated by Radko Krkoš about 1 month ago

  • Subject changed from Autarchic DB queries for timeline to Autarkic DB queries for timeline
#7

Updated by Radko Krkoš about 1 month ago

The total count query would be as follows:

WITH
timeline AS (SELECT generate_series(\since, \until - INTERVAL '1ms', \interval) AS bucket),
raw AS (SELECT \since + \interval * (width_bucket(detecttime, (SELECT array_agg(bucket) FROM timeline)) - 1) AS bucket, COUNT(*) AS count FROM events WHERE "detecttime" >= \since AND "detecttime" < \until GROUP BY bucket)
(SELECT bucket, COALESCE(count, 0) AS count FROM timeline LEFT JOIN raw USING (bucket) ORDER BY bucket ASC) UNION ALL SELECT NULL, SUM(count) FROM raw;

The run times are indistinguishable from the original version (both about 2s) on mentat-alt.

#8

Updated by Radko Krkoš about 1 month ago

The limiting query should be as follows;

WITH
timeline AS (SELECT generate_series(\since, \until - INTERVAL '1ms', \interval) AS bucket),
toplist AS (SELECT unnest(\set) AS set, COUNT(*) AS sum FROM events WHERE "detecttime" >= \since AND "detecttime" < \until GROUP BY set ORDER BY sum DESC LIMIT \limit),
total AS (SELECT COUNT(*) AS total FROM events WHERE "detecttime" >= \since AND "detecttime" < \until),
raw AS (SELECT \since + \interval * (width_bucket(detecttime, (SELECT array_agg(bucket) FROM timeline)) - 1) AS bucket, unnest(\set) AS set, COUNT(*) AS count FROM events WHERE "detecttime" >= \since AND "detecttime" < \until GROUP BY bucket, set)
(SELECT bucket, set, COALESCE(count, 0) AS count, NULL AS share FROM (timeline FULL JOIN toplist ON TRUE) LEFT JOIN raw USING (bucket, set) ORDER BY bucket ASC, toplist.sum DESC) UNION ALL SELECT NULL, toplist.set, toplist.sum, toplist.sum/total.total::decimal FROM toplist, total UNION ALL SELECT NULL, NULL, total.total, NULL FROM total;

This adds the parameter \limit. The run time grew from about 46.5s to about 49s (median estimate) for 6 days, 1 hour intervals on mentat-alt.

Also available in: Atom PDF