Feature #3364

Task #3374: Migrate all core modules from legacy Mentat

Implement mentat-storage.py module

Added by Jan Mach over 2 years ago. Updated about 1 year ago.

Status:ClosedStart date:03/21/2017
Priority:HighDue date:
Assignee:Jan Mach% Done:

100%

Category:Development - Core
Target version:2.0

Description

Implement daemon module for storing IDEA messages into database.

Key features:
  • Support for multiple database engines (at least enable quick change)
  • Enable configurable target database (database and collection names should be part of configuration file)
  • Enable storing of multiple messages in batches possibly for better database performance
  • Asynchronous querying (with possibility to ask for query state or partial results) would be nice

Associated revisions

Revision 5b01f6fa
Added by Jan Mach over 2 years ago

Implemented internal IDEA message representation class.

The default idea.lite library for representing IDEA messages does not take into account custom subkeys, that can appear in messages handled by Mentat. This patch introduces prototype of new library, which is based on idea.lite and takes custom keys into account. (Redmine issues: #3364 and #1017)

Revision 4164e95a
Added by Jan Mach over 2 years ago

Improvements in mentat.idea.internal library.

Greatly improved code reusability by employing typedef generator approach like in the underlying typedcol and idea.lite library. Additionally the documentation and unit tests were both improved, documentation can now be generated using Sphinx-doc tool. (Redmine issues: #3364, #3361 and #1017)

Revision 53dd8f62
Added by Jan Mach over 2 years ago

Added missing init.py file for module mentat.idea.

(Redmine issue: #3364)

Revision daafd215
Added by Jan Mach over 2 years ago

Implemented module for converting IDEA messages to and from MongoDB.

There are classess in mentat.idea.mongodb module for easy conversion of IDEA messages to and from MongoDB representation. Appropriate documentation and unit tests were also created. (Redmine issue: #3364, #3361 and #1017)

NOTE: There is probably bug in current Perl-based library for message conversion, because all unit test conversions related to timestamps from database messages stored by legacy code fail. There is always 1 or 2 hour time difference. This issue is not yet fixed.

Revision 9865c900
Added by Jan Mach over 2 years ago

Issue with bson.BSON.encode being unable to encode typedcol.TypedList objects.

The issue is further described in Redmine issue #3364 comment 6.

Revision 6830a9e4
Added by Jan Mach over 2 years ago

Fixed the problem with bson.BSON.encode being unable to encode typedcol.TypedList objects.

In commit 9865c900af39a98c2c5254b06ed4f132dded6929 the mentat.idea.mongodb was unable to store IDEA messages into MongoDB. The issue was with bson.BSON encoder, which was hardcoded in a way that handled any unknown object as dict. We were not able to convince the encoder to treat TypedList objects as lists, so we had to use different approach and supply appropriate data structure. The mentat.idea.mongodb.IdeaIn convertor now produces data structure composed of simple dicts and lists instead of TypedDicts and TypedLists.

Current implementation should however be considered as prototype and proof of concept, because it probably will be possible to write it in more elegant way. The current problem is, that the idea.base.idea_typedef contains hardcoded calls for typedcol.typed_list(), which are not customizable from outside of the module via flavour mechanism. The addon feature was used to monkeypatch these definitions. This is of course not optimal solution, because any changes in underlying library must be propagated manually into mongodb library.

Additionally, IDEA messages stored in database contain some additional attributes, that are database specific and internal and should be stripped upor retrieving from database. Currently this must be done manually using truncate() function call, more optimal solution would be to incorporate this into typedcol library and strip these attributes during object instantination/conversion process.

(Redmine issues: #3364 and #1017)

Revision 25b51380
Added by Jan Mach over 2 years ago

Finished prototype of mentat-storage.py module.

This commit introduces finished working prototype of mentat-storage.py real-time message processing module including appropriate unit tests and basic documentation work. Key features are possible customization of target database and collection, usage of core database configuration file, which can be overriden with local config file, or command line options. Messages are currently stored in database one by one, however batch processing will possibly be implemented in the future.

(Redmine issues: #3364, #1017 and #3361)

Revision b8c44f38
Added by Jan Mach over 2 years ago

Reimplemented the mentat.idea.internal and mentat.idea.mongodb libraries using latest idea library features.

Latest version of IDEA library (v0.1.6) resolved both issues mentioned in commit 6830a9e47ef914fc3aac8db9ce594fc40f235223. First issue was with the hardcoded call fot typed_list inside the idea library, which made customizations challenging. Second issue was with the inability to discard certain message keys using some kind of internal library feature.

Current version of both libraries now uses features mentioned above and the implementation is thus more clean. (Redmine issue: #3364)

Revision d0c65fc9
Added by Jan Mach over 2 years ago

Prepared configuration files for pilot installation.

(Redmine issue: #3364)

Revision 01a731e4
Added by Jan Mach over 2 years ago

Changed default configuration for mentat-storage.py module.

The mentat-storage.py module will most likelly be the last module in processing chain, so it should delete the messages by default, othewise the whole procesing chain will hang. Additionally, there was a small fix that needed to be done in module unit test file. (Redmine issue: #3364 and #1017)

Revision d3656333
Added by Jan Mach over 2 years ago

Changed default value for output queue for mentat-storage.py.

Default value must be none, otherwise it is not possible to turn the output queue off. (Redmine issues: #3364 and #3387)

Revision 6c9d3e14
Added by Jan Mach over 2 years ago

Fix: Storage daemon component must re-serialize processed IDEA messages.

Legacy Mentat system needs retrieves the messages from database by parsing 'raw_msg’ attribute of MongoDB object. The storage daemon component was not updating the IDEA serialization after update ant messages retrieved by legacy component did not see the updates. (Redmine issue: #3364)

Revision d7849af9
Added by Jan Mach over 1 year ago

Changed mentat-storage.py module to store IDEA messages into PostgreSQL instead of MongoDB.

(Redmine issue: #3364)

Revision a873a2d7
Added by Jan Mach over 1 year ago

Removed unnecessary configurations and command line options from mentat-storage.py module.

(Redmine issue: #3364)

History

#1 Updated by Jan Mach over 2 years ago

  • Parent task set to #3374

#2 Updated by Jan Mach over 2 years ago

  • Priority changed from Normal to High

#3 Updated by Pavel Kácha over 2 years ago

  • Description updated (diff)

#4 Updated by Jan Mach over 2 years ago

  • Status changed from New to In Progress

#5 Updated by Jan Mach over 2 years ago

  • Status changed from In Progress to Feedback
  • Assignee changed from Jan Mach to Pavel Kácha

Implemented base libraries for representing IDEA messages in Mentat system and converting them from that internal representation to appropriate representation in MongoDB and back.

Because the implementation is based on Pavel`s typedcol and idea.lite libraries, I would like to ask him to please make a quick review and provide feedback, whether the implementation makes sense to the author of original library.

Most relevant files for convenience:

source:lib/mentat/idea/internal.py
source:lib/mentat/idea/test_internal.py
source:lib/mentat/idea/mongodb.py
source:lib/mentat/idea/test_mongodb.py

#6 Updated by Jan Mach over 2 years ago

I have encountered showstopper that is currently preventing us from successfully using the mentat.idea.mongodb library for storing messages into database. The issue seems to be with native BSON encoder, which is unable to encode objects of type typedcol.TypedList into BSON:

Traceback (most recent call last):
  File "test_mongodb.py", line 354, in test_04_basic
    result_b = self.collection.insert_one(idea_mongo_in_l)
  File "/usr/local/lib/python3.5/dist-packages/pymongo/collection.py", line 630, in insert_one
    bypass_doc_val=bypass_document_validation),
  File "/usr/local/lib/python3.5/dist-packages/pymongo/collection.py", line 535, in _insert
    check_keys, manipulate, write_concern, op_id, bypass_doc_val)
  File "/usr/local/lib/python3.5/dist-packages/pymongo/collection.py", line 516, in _insert_one
    check_keys=check_keys)
  File "/usr/local/lib/python3.5/dist-packages/pymongo/pool.py", line 244, in command
    self._raise_connection_failure(error)
  File "/usr/local/lib/python3.5/dist-packages/pymongo/pool.py", line 372, in _raise_connection_failure
    raise error
  File "/usr/local/lib/python3.5/dist-packages/pymongo/pool.py", line 239, in command
    read_concern)
  File "/usr/local/lib/python3.5/dist-packages/pymongo/network.py", line 82, in command
    None, codec_options, check_keys)
bson.errors.InvalidDocument: Cannot encode object: ['abuse@cesnet.cz']

----------------------------------------------------------------------

A have implemented horrible hack function primitivize() into module mentat.idea.mongodb to verify this. The approprite code can be found in separate branch dev_mek in repository, in unit test file mentat.idea.test_mongodb. The primitivize() function is directly in module mentat.idea.mongodb. I was unable to primitivize only TypedList to list and due to implementation conststraints had to primitivize also TypedDict to dict.

The bson.BSON.encode documentation sadly confirms, that encode method can cope only with MutableMapping objects:

https://api.mongodb.com/python/3.4.0/api/bson/index.html

#7 Updated by Pavel Kácha over 2 years ago

Quick brainblender:
  • wrap TypedList derivatives in simple
    return tl.data
  • try adding
    bson._ENCODERS[TypedList] = _encode_list

#8 Updated by Pavel Kácha over 2 years ago

Pavel Kácha wrote:

Quick brainblender:
  • try adding bson._ENCODERS[TypedList] = _encode_list

Self answer: won’t work, bson has also fast C version, which is used by default, and is not monkey-patchable.

#9 Updated by Jan Mach over 2 years ago

Latest commit 6830a9e4 in development branch mek_dev fixed the problem with bson.BSON.encode being unable to encode typedcol.TypedList objects. In commit 9865c900 the source:lib/mentat/idea/mongodb.py library was unable to store IDEA messages into MongoDB. The issue was with bson.BSON encoder, which was hardcoded in a way that handled any unknown object as dict. We were not able to convince the encoder to treat TypedList objects as lists, so we had to use different approach and supply appropriate data structure. The mentat.idea.mongodb.IdeaIn convertor now produces data structure composed of simple dicts and lists instead of TypedDicts and TypedLists.

Current implementation should however be considered as prototype and proof of concept, because it probably will be possible to write it in more elegant way. The current problem is, that the idea.base.idea_typedef contains hardcoded calls for typedcol.typed_list(), which are not customizable from outside of the module via flavour mechanism. The addon feature was used to monkeypatch these definitions. This is of course not optimal solution, because any changes in underlying library must be propagated manually into source:lib/mentat/idea/mongodb.py library.

Additionally, IDEA messages stored in database contain some additional attributes, that are database specific and internal and should be stripped upor retrieving from database. Currently this must be done manually using truncate() function call, more optimal solution would be to incorporate this into typedcol library and strip these attributes during object instantination/conversion process.

#10 Updated by Jan Mach over 2 years ago

Finished prototype of mentat-storage.py module.

The commit 25b51380 introduces finished working prototype of mentat-storage.py real-time message processing module including appropriate unit tests and basic documentation work. Key features are possible customization of target database and collection, usage of core database configuration file, which can be overridden with local config file, or command line options. Messages are currently stored in database one by one, however batch processing will possibly be implemented in the future.

Next work:
  • test deployment on development server with continuous processing of randomly generated messages
  • test deployment on production server with continuous processing of real messages and storing them to different database and collection
  • production deployment

#11 Updated by Pavel Kácha over 2 years ago

Jan Mach wrote:

Current implementation should however be considered as prototype and proof of concept, because it probably will be possible to write it in more elegant way. The current problem is, that the idea.base.idea_typedef contains hardcoded calls for typedcol.typed_list(), which are not customizable from outside of the module via flavour mechanism. The addon feature was used to monkeypatch these definitions. This is of course not optimal solution, because any changes in underlying library must be propagated manually into source:lib/mentat/idea/mongodb.py library.

Ahha, hardcoded typedcol in idea.base.idea_typedef. Good point. Ok, how about changing:

def idea_typedef(flavour, list_flavour, defaults_flavour, source_target_dict, attach_dict, node_dict, addon=None)

to explicit

def idea_typedef(flavour, list_flavour, defaults_flavour, source_list, target_list, attach_list, node_list, addon=None)

Usage then would be something akin to:

    typedef = base.idea_typedef(
        idea_types,
        idea_lists,
        idea_defaults,
        typedcol.typed_list("SourceList", SourceTargetDict),
        typedcol.typed_list("TargetList", SourceTargetDict),
        typedcol.typed_list("AttachList", AttachDict),
        typedcol.typed_list("NodeList", NodeDict)

or, for type-stripped version:

    class SourceTargetList(typedcol.TypedList):
        item_type = simplify(SourceTargetDict)

    class AttachList(typedcol.TypedList):
        item_type = simplify(AttachDict)

    class NodeList(typedcol.TypedList):
        item_type = simplify(NodeDict)

    typedef = base.idea_typedef(
        idea_types,
        idea_lists,
        idea_defaults,
        simplify(SourceTargetList),
        simplify(SourceTargetList),
        simplify(AttachList),
        simplify(NodeList)

Would that be ok?

#12 Updated by Jan Mach over 2 years ago

Pavel Kácha wrote:

Would that be ok?

Yes, that was also my initial idea. That would definitely solve our issue and all custom libraries would be more robust and more customizable.

#13 Updated by Pavel Kácha over 2 years ago

  • Assignee changed from Pavel Kácha to Jan Mach

Jan Mach wrote:

Additionally, IDEA messages stored in database contain some additional attributes, that are database specific and internal and should be stripped upor retrieving from database. Currently this must be done manually using truncate() function call, more optimal solution would be to incorporate this into typedcol library and strip these attributes during object instantination/conversion process.

Are you positively sure that you want them stripped completely? You can’t get to them later.

Is it necessary only statically and only in TypedDict (not TypedList)?

Possible solutions:

typedef = {
    "unwanted_one": {
        "drop": True
    }
}

Or, slightly more flexible (however seems out of scope of typedcol for me) possibility, usable also in TypedList:

# More pythonic
def UnwantedType(s):
    raise typedcol.Drop

or

# probably faster
def UnwantedType(s):
    return typedcol.Drop

and

typedef = {
    "unwanted_one": {
        "type": UnwantedType
    }
}

#14 Updated by Pavel Kácha over 2 years ago

Both in.

idea:102034e87b794fcb2f5c5ca2c225e167ebe4fcda

  • explicit list args in idea_typedef
  • list_factory callable in list_types

idea:3a43637b9b6c24cdee66a564b91bd68a8f0d924e

  • Discarding of elements in TypedDict (most common use: “type”: Discard)

#15 Updated by Pavel Kácha over 2 years ago

Please check the correctness of the generated structure for IPs - min, max, NO ip for networks, min max ip for single ips.

#16 Updated by Jan Mach almost 2 years ago

  • Status changed from Feedback to In Progress
  • % Done changed from 0 to 80

#17 Updated by Jan Mach about 1 year ago

  • Status changed from In Progress to Closed
  • % Done changed from 80 to 100

Current state of this module is sufficient for production environment. We are finally releasing 2.0 version of Mentat system, so the period of frantic coding and implementation chaos is over. Any further improvements of this module will be done as they should in separate Redmine issues.

Also available in: Atom PDF