Project

General

Profile

Actions

Feature #4383

closed

Consider switching string classification fields to enums

Added by Pavel Kácha almost 3 years ago. Updated over 2 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
Development - Core
Target version:
Start date:
10/19/2018
Due date:
% Done:

30%

Estimated time:
To be discussed:

Description

Currently at least Category, *.Type, Severity and Class are stored as strings. Changing them to enums might gain us

  • space reduction
  • performance (very likely)
  • possibility of GIN index
  • possibility of range queries ("gimme everything from Abusive.Spam to Abusive.Violence", i.e. "all under Abusive")

Warden does it in a similar way:
warden:source:warden3/warden_server/catmap_mysql.json
warden:source:warden3/warden_server/tagmap_mysql.json


Related issues

Has duplicate Mentat - Feature #4398: Use established lists of possible values for suitable columns instead of dynamic enumerationsClosedRadko Krkoš10/24/2018

Actions
Blocked by Mentat - Feature #4230: Make use of existing or implement own mechanism for handling SQL schema migrations.ClosedPavel Kácha07/27/2018

Actions
Actions #1

Updated by Radko Krkoš almost 3 years ago

After testing on mentat-dev, replacing the cesnet_eventseverity with enum as the most obvious candidate using:

CREATE TYPE eventseverity AS ENUM ('low', 'medium', 'high', 'critical');
ALTER TABLE events ADD COLUMN cesnet_eventseverity_enum eventseverity;
VACUUM VERBOSE events;
UPDATE events SET cesnet_eventseverity_enum = 'low' WHERE cesnet_eventseverity = 'low';
VACUUM VERBOSE events;
UPDATE events SET cesnet_eventseverity_enum = 'medium' WHERE cesnet_eventseverity = 'medium';
VACUUM VERBOSE events;
UPDATE events SET cesnet_eventseverity_enum = 'high' WHERE cesnet_eventseverity = 'high';
VACUUM VERBOSE events;
UPDATE events SET cesnet_eventseverity_enum = 'critical' WHERE cesnet_eventseverity = 'critical';
VACUUM VERBOSE events;
ALTER TABLE events DROP COLUMN cesnet_eventseverity;
VACUUM VERBOSE events;
CLUSTER VERBOSE events;

(autovacuum disabled for all DB alteration tests as mentat-dev has no disk IO to spare, hence the VACUUMs)
the table contains 9,65M rows and the impact on table size is ~4MB, index grew by ~5MB. The performance seems a bit better (from memory, no hard numbers).

Actions #2

Updated by Radko Krkoš almost 3 years ago

  • Related to Feature #4398: Use established lists of possible values for suitable columns instead of dynamic enumerations added
Actions #3

Updated by Radko Krkoš almost 3 years ago

  • % Done changed from 0 to 30

Just for reference, this approach was used for Source/Target Type conversion (array):

CREATE TYPE src_tgt_type AS ENUM ('Backscatter', 'Botnet', 'CC', 'FastFlux', 'Incomplete', 'Malware', 'MITM', 'Open', 'OriginMalware', 'OriginSandbox', 'OriginSpam', 'Phishing', 'Poisoned', 'Proxy', 'Spam', 'TOR');
ALTER TABLE events ADD COLUMN source_type_enum src_tgt_type[];
ALTER TABLE events ADD COLUMN target_type_enum src_tgt_type[];
VACUUM VERBOSE events;
UPDATE events SET source_type_enum = source_type::src_tgt_type[];
VACUUM VERBOSE events;
UPDATE events SET target_type_enum = target_type::src_tgt_type[];
VACUUM VERBOSE events;
ALTER TABLE events DROP COLUMN source_type;
ALTER TABLE events DROP COLUMN target_type;
VACUUM VERBOSE events;
CLUSTER VERBOSE events;

The limitations as discussed in #4398 still apply.

Actions #4

Updated by Pavel Kácha over 2 years ago

Note that I was not ruling out "homemade" integer enums, as those allow for inserting new values and range query (Vulnerable.First <= something <= Vulnerable.Last). However, as Radko notes, integers cannot be indexed by GIN, which leaves us for cold.
But - another Radko idea - enums support ranges also, and also support renaming, so we could create predefined placeholders for future use (Vulnerable.Open, Vulnerable.Config, Vulnerable.Reserved1, Vulnerable.Reserved2, Vulnerable.Reserver3).
As far as I grasp, the only advantage is the possibility to have "events_cesnet_eventseverity_idx" under "events_combined_idx", which might speed up combined queries using them.
So - to wrap up - is it worth the effort?

Actions #5

Updated by Pavel Kácha over 2 years ago

(Also - one index less = more hot data in memory)

Actions #6

Updated by Radko Krkoš over 2 years ago

  • Status changed from New to Feedback

Pavel Kácha wrote:

As far as I grasp, the only advantage is the possibility to have “events_cesnet_eventseverity_idx” under “events_combined_idx”, which might speed up combined queries using them.

This might be useful for reporter which often filters based on severity. Then again, those queries are limited to a very recent time window so using detecttime index is probably a better choice.

So - to wrap up - is it worth the effort?

From the DB perspective, it is not a complicated procedure. Also it is already thought out.
However if the enums are not used to enforce keywords, the only real benefits are getting rid of the "events_cesnet_eventseverity_idx" index (4GB on production) plus the slight reduction in row size (not much as the most frequent value by far is 'low', while enum also takes 4 bytes).
Therefore it is probably not worth it. Let's discuss and resolve this online.

Actions #7

Updated by Radko Krkoš over 2 years ago

  • Category changed from Design to Development - Core
  • Status changed from Feedback to Deferred

Based on VC discussion, this will aim for the lower hanging fruit currently. Let's keep this in the arsenal for the future.

Actions #8

Updated by Radko Krkoš over 2 years ago

  • Related to deleted (Feature #4398: Use established lists of possible values for suitable columns instead of dynamic enumerations)
Actions #9

Updated by Radko Krkoš over 2 years ago

  • Has duplicate Feature #4398: Use established lists of possible values for suitable columns instead of dynamic enumerations added
Actions #10

Updated by Radko Krkoš over 2 years ago

  • Status changed from Deferred to Rejected
Actions #11

Updated by Pavel Kácha over 2 years ago

  • Blocked by Feature #4230: Make use of existing or implement own mechanism for handling SQL schema migrations. added
Actions #12

Updated by Pavel Kácha over 2 years ago

  • Assignee changed from Radko Krkoš to Jan Mach

I think the resolution was that changing Severity to enums seems ok (no changes in code, only schema + addition to GIN index, and we get rid of one index). So reopening with dependency on #4230 and reassigning directly to Mek.

Actions #13

Updated by Jan Mach over 2 years ago

  • Target version changed from Backlog to Rejected
Actions

Also available in: Atom PDF