Feature #4348
closedBetter support for sparse columns
0%
Description
Disclaimer¶
This is an RFC type issue, concrete recommendations will follow the discussion.
Related issues¶
Required for: #4277 (hard), #4301 (soft)
Background¶
Sparse columns come in two flavors, using NULL values (for native data types) and using empty arrays. The first variant just works while the second leaves a bit to be desired.
Example: Indexing
1. NULL value¶
CREATE INDEX on events USING btree (cesnet_eventseverity) WHERE cesnet_eventseverity IS NOT NULL;
A typical query:
SELECT * FROM events WHERE cesnet_eventseverity == 'medium';
The index is used for this query. The index is a bit smaller than the full variant (unconditional).
2. Empty arrays¶
CREATE INDEX on events USING btree (cesnet_resolvedabuses) WHERE cesnet_resolvedabuses <> '{}';
A typical query:
SELECT * FROM events WHERE cesnet_resolvedabuses && ARRAY['abuse@cesnet.cz'];
cannot use this index due to what seems to be a bug in PostgreSQL (similar issue already reported in [1] with no response from the developers). The conditional index is preferable because of the size differences (210MB vs 12MB, but much bigger difference for other, more sparse columns, for example source_type: 3.5GB vs 3MB).
There is already a version of conditional index in use in Mentat:
CREATE INDEX on events USING btree (cesnet_resolvedabuses) WHERE cesnet_resolvedabuses IS NOT NULL;
but this does not exclude any rows for array columns.
A workaround has been found:
SELECT * FROM events WHERE cesnet_resolvedabuses && ARRAY['abuse@cesnet.cz'] AND cesnet_resolvedabuses <> '{}';
This seemingly redundant addition allows the planner to use the index. The speed improvement is actually negative for cesnet_resolvedabuses (2300ms vs 1900ms for unconditional, actual machine load is on the other hand lower) but this improves considerably for more sparse columns (or rarer values). The index size difference is huge overall.
Multicolumn indices¶
As GIN indices can be very efficiently combined into multicolumn versions and this property is used extensively in Mentat, it would be beneficial to allow for this option also for sparse columns.
An analysis has shown that this can be done efficiently, for example an combined index for both source_type and tartget_type would be created like this:
CREATE INDEX ON events USING GIN (source_type, target_type) WHERE source_type <> '{}' OR target_type <> '{}';
Such an index is very efficient as the aggregate of source and target types is still very sparse, the index size is currently about 5MB.
Conclusion¶
To efficiently support sparse columns in Mentat's DB design, all queries using filtering by a sparse column should have the condition changed as follows:
cesnet_resolvedabuses && ARRAY['abuse@cesnet.cz'] to cesnet_resolvedabuses && ARRAY['abuse@cesnet.cz'] AND cesnet_resolvedabuses <> '{}'
and the indices should be recreated as follows:
CREATE INDEX ON events USING GIN (cesnet_resolvedabuses); or CREATE INDEX ON events USING GIN (cesnet_resolvedabuses) WHERE cesnet_resolvedabuses IS NOT NULL; to CREATE INDEX ON events USING GIN (cesnet_resolvedabuses) WHERE cesnet_resolvedabuses <> '{}';
References¶
[1] https://www.postgresql.org/message-id/152242988260.6322.11237886300088068445@wrigleys.postgresql.org
Related issues
Updated by Radko Krkoš about 6 years ago
- Status changed from New to Closed
As this did not (probably righfully) attract any strong feelings or any discussion, let's ignore this proposal. An alternative solution requiring much less programming effort was proposed #4363.
Updated by Pavel Kácha about 6 years ago
- Status changed from Closed to Feedback
Do you think #4363 is in overall better? I had not much time to grasp this RFC and thought of having live discussion sometimes.
Updated by Radko Krkoš about 6 years ago
- Assignee changed from Radko Krkoš to Pavel Kácha
The solution proposed in #4363 is not better overall, it is just sufficient for our use case and requires less changes to code. There are distinct situations when approach described in this RFC is preferable. Now that #4363 was already implemented, it would still be an interesting discussion but I would give it a low priority as it is probably not worth redoing.
Updated by Pavel Kácha about 6 years ago
- Assignee changed from Pavel Kácha to Radko Krkoš
- Priority changed from Normal to Low
Ack, let it sink for a while. No harm in leaving it open for some talk and reference in 'Future' for some time.
Updated by Pavel Kácha about 6 years ago
- Related to Feature #4385: Unify empty arrays and NULLs added
Updated by Pavel Kácha about 6 years ago
- Status changed from Feedback to Closed
Closing after discussion, we don't have a lot of sparse indices (and the big GIN one is compound from a lot of columns and effectively not sparse) to benefit from this.
Updated by Jan Mach almost 6 years ago
- Target version changed from Backlog to Rejected