Project

General

Profile

Actions

Feature #4348

closed

Better support for sparse columns

Added by Radko Krkoš about 6 years ago. Updated almost 6 years ago.

Status:
Closed
Priority:
Low
Assignee:
Category:
Design
Target version:
Start date:
10/05/2018
Due date:
% Done:

0%

Estimated time:
To be discussed:

Description

Disclaimer

This is an RFC type issue, concrete recommendations will follow the discussion.

Related issues

Required for: #4277 (hard), #4301 (soft)

Background

Sparse columns come in two flavors, using NULL values (for native data types) and using empty arrays. The first variant just works while the second leaves a bit to be desired.

Example: Indexing

1. NULL value

CREATE INDEX on events USING btree (cesnet_eventseverity) WHERE cesnet_eventseverity IS NOT NULL;

A typical query:

SELECT * FROM events WHERE cesnet_eventseverity == 'medium';

The index is used for this query. The index is a bit smaller than the full variant (unconditional).

2. Empty arrays

CREATE INDEX on events USING btree (cesnet_resolvedabuses) WHERE cesnet_resolvedabuses <> '{}';

A typical query:

SELECT * FROM events WHERE cesnet_resolvedabuses && ARRAY['abuse@cesnet.cz'];

cannot use this index due to what seems to be a bug in PostgreSQL (similar issue already reported in [1] with no response from the developers). The conditional index is preferable because of the size differences (210MB vs 12MB, but much bigger difference for other, more sparse columns, for example source_type: 3.5GB vs 3MB).

There is already a version of conditional index in use in Mentat:

CREATE INDEX on events USING btree (cesnet_resolvedabuses) WHERE cesnet_resolvedabuses IS NOT NULL;

but this does not exclude any rows for array columns.

A workaround has been found:

SELECT * FROM events WHERE cesnet_resolvedabuses && ARRAY['abuse@cesnet.cz'] AND cesnet_resolvedabuses <> '{}';

This seemingly redundant addition allows the planner to use the index. The speed improvement is actually negative for cesnet_resolvedabuses (2300ms vs 1900ms for unconditional, actual machine load is on the other hand lower) but this improves considerably for more sparse columns (or rarer values). The index size difference is huge overall.

Multicolumn indices

As GIN indices can be very efficiently combined into multicolumn versions and this property is used extensively in Mentat, it would be beneficial to allow for this option also for sparse columns.
An analysis has shown that this can be done efficiently, for example an combined index for both source_type and tartget_type would be created like this:

CREATE INDEX ON events USING GIN (source_type, target_type) WHERE source_type <> '{}' OR target_type <> '{}';

Such an index is very efficient as the aggregate of source and target types is still very sparse, the index size is currently about 5MB.

Conclusion

To efficiently support sparse columns in Mentat's DB design, all queries using filtering by a sparse column should have the condition changed as follows:

cesnet_resolvedabuses && ARRAY['abuse@cesnet.cz']
to
cesnet_resolvedabuses && ARRAY['abuse@cesnet.cz'] AND cesnet_resolvedabuses <> '{}'

and the indices should be recreated as follows:

CREATE INDEX ON events USING GIN (cesnet_resolvedabuses);
or
CREATE INDEX ON events USING GIN (cesnet_resolvedabuses) WHERE cesnet_resolvedabuses IS NOT NULL;
to
CREATE INDEX ON events USING GIN (cesnet_resolvedabuses) WHERE cesnet_resolvedabuses <> '{}';

References

[1] https://www.postgresql.org/message-id/152242988260.6322.11237886300088068445@wrigleys.postgresql.org


Related issues

Related to Mentat - Feature #4385: Unify empty arrays and NULLsRejectedRadko Krkoš10/19/2018

Actions
Actions

Also available in: Atom PDF