Feature #4348
closedBetter support for sparse columns
0%
Description
Disclaimer¶
This is an RFC type issue, concrete recommendations will follow the discussion.
Related issues¶
Required for: #4277 (hard), #4301 (soft)
Background¶
Sparse columns come in two flavors, using NULL values (for native data types) and using empty arrays. The first variant just works while the second leaves a bit to be desired.
Example: Indexing
1. NULL value¶
CREATE INDEX on events USING btree (cesnet_eventseverity) WHERE cesnet_eventseverity IS NOT NULL;
A typical query:
SELECT * FROM events WHERE cesnet_eventseverity == 'medium';
The index is used for this query. The index is a bit smaller than the full variant (unconditional).
2. Empty arrays¶
CREATE INDEX on events USING btree (cesnet_resolvedabuses) WHERE cesnet_resolvedabuses <> '{}';
A typical query:
SELECT * FROM events WHERE cesnet_resolvedabuses && ARRAY['abuse@cesnet.cz'];
cannot use this index due to what seems to be a bug in PostgreSQL (similar issue already reported in [1] with no response from the developers). The conditional index is preferable because of the size differences (210MB vs 12MB, but much bigger difference for other, more sparse columns, for example source_type: 3.5GB vs 3MB).
There is already a version of conditional index in use in Mentat:
CREATE INDEX on events USING btree (cesnet_resolvedabuses) WHERE cesnet_resolvedabuses IS NOT NULL;
but this does not exclude any rows for array columns.
A workaround has been found:
SELECT * FROM events WHERE cesnet_resolvedabuses && ARRAY['abuse@cesnet.cz'] AND cesnet_resolvedabuses <> '{}';
This seemingly redundant addition allows the planner to use the index. The speed improvement is actually negative for cesnet_resolvedabuses (2300ms vs 1900ms for unconditional, actual machine load is on the other hand lower) but this improves considerably for more sparse columns (or rarer values). The index size difference is huge overall.
Multicolumn indices¶
As GIN indices can be very efficiently combined into multicolumn versions and this property is used extensively in Mentat, it would be beneficial to allow for this option also for sparse columns.
An analysis has shown that this can be done efficiently, for example an combined index for both source_type and tartget_type would be created like this:
CREATE INDEX ON events USING GIN (source_type, target_type) WHERE source_type <> '{}' OR target_type <> '{}';
Such an index is very efficient as the aggregate of source and target types is still very sparse, the index size is currently about 5MB.
Conclusion¶
To efficiently support sparse columns in Mentat's DB design, all queries using filtering by a sparse column should have the condition changed as follows:
cesnet_resolvedabuses && ARRAY['abuse@cesnet.cz'] to cesnet_resolvedabuses && ARRAY['abuse@cesnet.cz'] AND cesnet_resolvedabuses <> '{}'
and the indices should be recreated as follows:
CREATE INDEX ON events USING GIN (cesnet_resolvedabuses); or CREATE INDEX ON events USING GIN (cesnet_resolvedabuses) WHERE cesnet_resolvedabuses IS NOT NULL; to CREATE INDEX ON events USING GIN (cesnet_resolvedabuses) WHERE cesnet_resolvedabuses <> '{}';
References¶
[1] https://www.postgresql.org/message-id/152242988260.6322.11237886300088068445@wrigleys.postgresql.org
Related issues