Project

General

Profile

Actions

Feature #4348

closed

Better support for sparse columns

Added by Radko Krkoš about 6 years ago. Updated almost 6 years ago.

Status:
Closed
Priority:
Low
Assignee:
Category:
Design
Target version:
Start date:
10/05/2018
Due date:
% Done:

0%

Estimated time:
To be discussed:

Description

Disclaimer

This is an RFC type issue, concrete recommendations will follow the discussion.

Related issues

Required for: #4277 (hard), #4301 (soft)

Background

Sparse columns come in two flavors, using NULL values (for native data types) and using empty arrays. The first variant just works while the second leaves a bit to be desired.

Example: Indexing

1. NULL value

CREATE INDEX on events USING btree (cesnet_eventseverity) WHERE cesnet_eventseverity IS NOT NULL;

A typical query:

SELECT * FROM events WHERE cesnet_eventseverity == 'medium';

The index is used for this query. The index is a bit smaller than the full variant (unconditional).

2. Empty arrays

CREATE INDEX on events USING btree (cesnet_resolvedabuses) WHERE cesnet_resolvedabuses <> '{}';

A typical query:

SELECT * FROM events WHERE cesnet_resolvedabuses && ARRAY['abuse@cesnet.cz'];

cannot use this index due to what seems to be a bug in PostgreSQL (similar issue already reported in [1] with no response from the developers). The conditional index is preferable because of the size differences (210MB vs 12MB, but much bigger difference for other, more sparse columns, for example source_type: 3.5GB vs 3MB).

There is already a version of conditional index in use in Mentat:

CREATE INDEX on events USING btree (cesnet_resolvedabuses) WHERE cesnet_resolvedabuses IS NOT NULL;

but this does not exclude any rows for array columns.

A workaround has been found:

SELECT * FROM events WHERE cesnet_resolvedabuses && ARRAY['abuse@cesnet.cz'] AND cesnet_resolvedabuses <> '{}';

This seemingly redundant addition allows the planner to use the index. The speed improvement is actually negative for cesnet_resolvedabuses (2300ms vs 1900ms for unconditional, actual machine load is on the other hand lower) but this improves considerably for more sparse columns (or rarer values). The index size difference is huge overall.

Multicolumn indices

As GIN indices can be very efficiently combined into multicolumn versions and this property is used extensively in Mentat, it would be beneficial to allow for this option also for sparse columns.
An analysis has shown that this can be done efficiently, for example an combined index for both source_type and tartget_type would be created like this:

CREATE INDEX ON events USING GIN (source_type, target_type) WHERE source_type <> '{}' OR target_type <> '{}';

Such an index is very efficient as the aggregate of source and target types is still very sparse, the index size is currently about 5MB.

Conclusion

To efficiently support sparse columns in Mentat's DB design, all queries using filtering by a sparse column should have the condition changed as follows:

cesnet_resolvedabuses && ARRAY['abuse@cesnet.cz']
to
cesnet_resolvedabuses && ARRAY['abuse@cesnet.cz'] AND cesnet_resolvedabuses <> '{}'

and the indices should be recreated as follows:

CREATE INDEX ON events USING GIN (cesnet_resolvedabuses);
or
CREATE INDEX ON events USING GIN (cesnet_resolvedabuses) WHERE cesnet_resolvedabuses IS NOT NULL;
to
CREATE INDEX ON events USING GIN (cesnet_resolvedabuses) WHERE cesnet_resolvedabuses <> '{}';

References

[1] https://www.postgresql.org/message-id/152242988260.6322.11237886300088068445@wrigleys.postgresql.org


Related issues

Related to Mentat - Feature #4385: Unify empty arrays and NULLsRejectedRadko Krkoš10/19/2018

Actions
Actions #1

Updated by Radko Krkoš about 6 years ago

  • Status changed from New to Closed

As this did not (probably righfully) attract any strong feelings or any discussion, let's ignore this proposal. An alternative solution requiring much less programming effort was proposed #4363.

Actions #2

Updated by Pavel Kácha about 6 years ago

  • Status changed from Closed to Feedback

Do you think #4363 is in overall better? I had not much time to grasp this RFC and thought of having live discussion sometimes.

Actions #3

Updated by Radko Krkoš about 6 years ago

  • Assignee changed from Radko Krkoš to Pavel Kácha

The solution proposed in #4363 is not better overall, it is just sufficient for our use case and requires less changes to code. There are distinct situations when approach described in this RFC is preferable. Now that #4363 was already implemented, it would still be an interesting discussion but I would give it a low priority as it is probably not worth redoing.

Actions #4

Updated by Pavel Kácha about 6 years ago

  • Assignee changed from Pavel Kácha to Radko Krkoš
  • Priority changed from Normal to Low

Ack, let it sink for a while. No harm in leaving it open for some talk and reference in 'Future' for some time.

Actions #5

Updated by Pavel Kácha about 6 years ago

Actions #6

Updated by Pavel Kácha about 6 years ago

  • Status changed from Feedback to Closed

Closing after discussion, we don't have a lot of sparse indices (and the big GIN one is compound from a lot of columns and effectively not sparse) to benefit from this.

Actions #7

Updated by Jan Mach almost 6 years ago

  • Target version changed from Backlog to Rejected
Actions

Also available in: Atom PDF