Feature #4348: Better support for sparse columns - Mentat - Homeproj: Redmine for CESNET

Actions

Copy link

Feature #4348

closed

Better support for sparse columns

Added by Radko Krkoš about 6 years ago. Updated almost 6 years ago.

Status:

Closed

Priority:

Low

Assignee:

Radko Krkoš

Category:

Design

Target version:

Rejected

Start date:

10/05/2018

Due date:

% Done:

Estimated time:

To be discussed:

Description

Disclaimer¶

This is an RFC type issue, concrete recommendations will follow the discussion.

Related issues¶

Required for: #4277 (hard), #4301 (soft)

Background¶

Sparse columns come in two flavors, using NULL values (for native data types) and using empty arrays. The first variant just works while the second leaves a bit to be desired.

Example: Indexing

1. NULL value¶

CREATE INDEX on events USING btree (cesnet_eventseverity) WHERE cesnet_eventseverity IS NOT NULL;

A typical query:

SELECT * FROM events WHERE cesnet_eventseverity == 'medium';

The index is used for this query. The index is a bit smaller than the full variant (unconditional).

2. Empty arrays¶

CREATE INDEX on events USING btree (cesnet_resolvedabuses) WHERE cesnet_resolvedabuses <> '{}';

A typical query:

SELECT * FROM events WHERE cesnet_resolvedabuses && ARRAY['abuse@cesnet.cz'];

cannot use this index due to what seems to be a bug in PostgreSQL (similar issue already reported in [1] with no response from the developers). The conditional index is preferable because of the size differences (210MB vs 12MB, but much bigger difference for other, more sparse columns, for example source_type: 3.5GB vs 3MB).

There is already a version of conditional index in use in Mentat:

CREATE INDEX on events USING btree (cesnet_resolvedabuses) WHERE cesnet_resolvedabuses IS NOT NULL;

but this does not exclude any rows for array columns.

A workaround has been found:

SELECT * FROM events WHERE cesnet_resolvedabuses && ARRAY['abuse@cesnet.cz'] AND cesnet_resolvedabuses <> '{}';

This seemingly redundant addition allows the planner to use the index. The speed improvement is actually negative for cesnet_resolvedabuses (2300ms vs 1900ms for unconditional, actual machine load is on the other hand lower) but this improves considerably for more sparse columns (or rarer values). The index size difference is huge overall.

Multicolumn indices¶

As GIN indices can be very efficiently combined into multicolumn versions and this property is used extensively in Mentat, it would be beneficial to allow for this option also for sparse columns.
An analysis has shown that this can be done efficiently, for example an combined index for both source_type and tartget_type would be created like this:

CREATE INDEX ON events USING GIN (source_type, target_type) WHERE source_type <> '{}' OR target_type <> '{}';

Such an index is very efficient as the aggregate of source and target types is still very sparse, the index size is currently about 5MB.

Conclusion¶

To efficiently support sparse columns in Mentat's DB design, all queries using filtering by a sparse column should have the condition changed as follows:

cesnet_resolvedabuses && ARRAY['abuse@cesnet.cz']
to
cesnet_resolvedabuses && ARRAY['abuse@cesnet.cz'] AND cesnet_resolvedabuses <> '{}'

and the indices should be recreated as follows:

CREATE INDEX ON events USING GIN (cesnet_resolvedabuses);
or
CREATE INDEX ON events USING GIN (cesnet_resolvedabuses) WHERE cesnet_resolvedabuses IS NOT NULL;
to
CREATE INDEX ON events USING GIN (cesnet_resolvedabuses) WHERE cesnet_resolvedabuses <> '{}';

References¶

[1] https://www.postgresql.org/message-id/152242988260.6322.11237886300088068445@wrigleys.postgresql.org

Related issues

Actions

Copy link

Updated by Radko Krkoš about 6 years ago

Status changed from New to Closed

As this did not (probably righfully) attract any strong feelings or any discussion, let's ignore this proposal. An alternative solution requiring much less programming effort was proposed #4363.

Actions

Copy link

Updated by Pavel Kácha about 6 years ago

Status changed from Closed to Feedback

Do you think #4363 is in overall better? I had not much time to grasp this RFC and thought of having live discussion sometimes.

Actions

Copy link

Updated by Radko Krkoš about 6 years ago

Assignee changed from Radko Krkoš to Pavel Kácha

The solution proposed in #4363 is not better overall, it is just sufficient for our use case and requires less changes to code. There are distinct situations when approach described in this RFC is preferable. Now that #4363 was already implemented, it would still be an interesting discussion but I would give it a low priority as it is probably not worth redoing.

Actions

Copy link

Updated by Pavel Kácha about 6 years ago

Assignee changed from Pavel Kácha to Radko Krkoš
Priority changed from Normal to Low

Ack, let it sink for a while. No harm in leaving it open for some talk and reference in 'Future' for some time.

Actions

Copy link

Updated by Pavel Kácha about 6 years ago

Related to Feature #4385: Unify empty arrays and NULLs added

Actions

Copy link

Updated by Pavel Kácha about 6 years ago

Status changed from Feedback to Closed

Closing after discussion, we don't have a lot of sparse indices (and the big GIN one is compound from a lot of columns and effectively not sparse) to benefit from this.

Actions

Copy link

Updated by Jan Mach almost 6 years ago

Target version changed from Backlog to Rejected

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Mentat

Custom queries

Feature #4348

Better support for sparse columns

Disclaimer¶

Related issues¶

Background¶

1. NULL value¶

2. Empty arrays¶

Multicolumn indices¶

Conclusion¶

References¶

Updated by Radko Krkoš about 6 years ago

Updated by Pavel Kácha about 6 years ago

Updated by Radko Krkoš about 6 years ago

Updated by Pavel Kácha about 6 years ago

Updated by Pavel Kácha about 6 years ago

Updated by Pavel Kácha about 6 years ago

Updated by Jan Mach almost 6 years ago