Feature #4383
closedConsider switching string classification fields to enums
Added by Pavel Kácha about 6 years ago. Updated almost 6 years ago.
30%
Description
Currently at least Category, *.Type, Severity and Class are stored as strings. Changing them to enums might gain us
- space reduction
- performance (very likely)
- possibility of GIN index
- possibility of range queries ("gimme everything from Abusive.Spam to Abusive.Violence", i.e. "all under Abusive")
Warden does it in a similar way:
warden:source:warden3/warden_server/catmap_mysql.json
warden:source:warden3/warden_server/tagmap_mysql.json
Related issues
Updated by Radko Krkoš about 6 years ago
After testing on mentat-dev, replacing the cesnet_eventseverity with enum as the most obvious candidate using:
CREATE TYPE eventseverity AS ENUM ('low', 'medium', 'high', 'critical'); ALTER TABLE events ADD COLUMN cesnet_eventseverity_enum eventseverity; VACUUM VERBOSE events; UPDATE events SET cesnet_eventseverity_enum = 'low' WHERE cesnet_eventseverity = 'low'; VACUUM VERBOSE events; UPDATE events SET cesnet_eventseverity_enum = 'medium' WHERE cesnet_eventseverity = 'medium'; VACUUM VERBOSE events; UPDATE events SET cesnet_eventseverity_enum = 'high' WHERE cesnet_eventseverity = 'high'; VACUUM VERBOSE events; UPDATE events SET cesnet_eventseverity_enum = 'critical' WHERE cesnet_eventseverity = 'critical'; VACUUM VERBOSE events; ALTER TABLE events DROP COLUMN cesnet_eventseverity; VACUUM VERBOSE events; CLUSTER VERBOSE events;
(autovacuum disabled for all DB alteration tests as mentat-dev has no disk IO to spare, hence the VACUUMs)
the table contains 9,65M rows and the impact on table size is ~4MB, index grew by ~5MB. The performance seems a bit better (from memory, no hard numbers).
Updated by Radko Krkoš about 6 years ago
- Related to Feature #4398: Use established lists of possible values for suitable columns instead of dynamic enumerations added
Updated by Radko Krkoš about 6 years ago
- % Done changed from 0 to 30
Just for reference, this approach was used for Source/Target Type conversion (array):
CREATE TYPE src_tgt_type AS ENUM ('Backscatter', 'Botnet', 'CC', 'FastFlux', 'Incomplete', 'Malware', 'MITM', 'Open', 'OriginMalware', 'OriginSandbox', 'OriginSpam', 'Phishing', 'Poisoned', 'Proxy', 'Spam', 'TOR'); ALTER TABLE events ADD COLUMN source_type_enum src_tgt_type[]; ALTER TABLE events ADD COLUMN target_type_enum src_tgt_type[]; VACUUM VERBOSE events; UPDATE events SET source_type_enum = source_type::src_tgt_type[]; VACUUM VERBOSE events; UPDATE events SET target_type_enum = target_type::src_tgt_type[]; VACUUM VERBOSE events; ALTER TABLE events DROP COLUMN source_type; ALTER TABLE events DROP COLUMN target_type; VACUUM VERBOSE events; CLUSTER VERBOSE events;
The limitations as discussed in #4398 still apply.
Updated by Pavel Kácha almost 6 years ago
Note that I was not ruling out "homemade" integer enums, as those allow for inserting new values and range query (Vulnerable.First <= something <= Vulnerable.Last). However, as Radko notes, integers cannot be indexed by GIN, which leaves us for cold.
But - another Radko idea - enums support ranges also, and also support renaming, so we could create predefined placeholders for future use (Vulnerable.Open, Vulnerable.Config, Vulnerable.Reserved1, Vulnerable.Reserved2, Vulnerable.Reserver3).
As far as I grasp, the only advantage is the possibility to have "events_cesnet_eventseverity_idx" under "events_combined_idx", which might speed up combined queries using them.
So - to wrap up - is it worth the effort?
Updated by Pavel Kácha almost 6 years ago
(Also - one index less = more hot data in memory)
Updated by Radko Krkoš almost 6 years ago
- Status changed from New to Feedback
Pavel Kácha wrote:
As far as I grasp, the only advantage is the possibility to have “events_cesnet_eventseverity_idx” under “events_combined_idx”, which might speed up combined queries using them.
This might be useful for reporter which often filters based on severity. Then again, those queries are limited to a very recent time window so using detecttime index is probably a better choice.
So - to wrap up - is it worth the effort?
From the DB perspective, it is not a complicated procedure. Also it is already thought out.
However if the enums are not used to enforce keywords, the only real benefits are getting rid of the "events_cesnet_eventseverity_idx" index (4GB on production) plus the slight reduction in row size (not much as the most frequent value by far is 'low', while enum also takes 4 bytes).
Therefore it is probably not worth it. Let's discuss and resolve this online.
Updated by Radko Krkoš almost 6 years ago
- Category changed from Design to Development - Core
- Status changed from Feedback to Deferred
Based on VC discussion, this will aim for the lower hanging fruit currently. Let's keep this in the arsenal for the future.
Updated by Radko Krkoš almost 6 years ago
- Related to deleted (Feature #4398: Use established lists of possible values for suitable columns instead of dynamic enumerations)
Updated by Radko Krkoš almost 6 years ago
- Has duplicate Feature #4398: Use established lists of possible values for suitable columns instead of dynamic enumerations added
Updated by Radko Krkoš almost 6 years ago
- Status changed from Deferred to Rejected
Updated by Pavel Kácha almost 6 years ago
- Blocked by Feature #4230: Make use of existing or implement own mechanism for handling SQL schema migrations. added
Updated by Pavel Kácha almost 6 years ago
- Assignee changed from Radko Krkoš to Jan Mach
I think the resolution was that changing Severity to enums seems ok (no changes in code, only schema + addition to GIN index, and we get rid of one index). So reopening with dependency on #4230 and reassigning directly to Mek.
Updated by Jan Mach almost 6 years ago
- Target version changed from Backlog to Rejected