Project

General

Profile

Actions

Task #4570

closed

Use RAM based file system for message queue directory

Added by Jan Mach over 5 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Installation
Target version:
Start date:
01/17/2019
Due date:
% Done:

100%

Estimated time:
To be discussed:

Description

We are encountering some very heavy IO on our servers due to the nature of message exchange protocol (filesystem-based queues). Try to implement and test a RAM based filesystem for the message queue directory. Start on our test server and monitor before possibly deploying the solution to production server.

Actions #1

Updated by Jan Mach over 5 years ago

  • Status changed from New to In Progress
  • Assignee changed from Jan Mach to Radko Krkoš
  • % Done changed from 0 to 50

I have just set up the RAM based filesystem (tmpfs) for message exchange queues on our test server mentat-alt. I have also added the appropriate section to documentation (see attached commit for details, at this time it is not yet available on our build server), that describes the steps taken in the process.

I am switching this task temporarily to Radko to let him monitor the performance on the target system. Please move the task back to me when the time is right to implement the solution on our production server (or abandon the idea and move back to previous setup).

Actions #2

Updated by Jan Mach over 5 years ago

  • Status changed from In Progress to Feedback
Actions #3

Updated by Radko Krkoš over 5 years ago

  • Status changed from Feedback to Deferred

The amount of stored data on mentat-alt is not enough to model the out of cache situation. The cleanup limit was increased to 12 weeks what should be enough. We need to wait for the data to flow in, so deferring until then.

Actions #4

Updated by Radko Krkoš over 5 years ago

Just a remark, according to [1], tmpfs content can be swapped to disk if low on memory, so in the end we might only save on inode manupulation. Nevertheless, our import pipeline is the intended use case of tmpfs, so this was a good idea. Also, according to [1], no RAM is actually wasted if the ramdisk is empty, the ramdisk size is technically just an upper limit.

[1] https://www.kernel.org/doc/Documentation/filesystems/tmpfs.txt

Actions #5

Updated by Jan Mach over 5 years ago

Radko Krkoš wrote:

Just a remark, according to [1], tmpfs content can be swapped to disk if low on memory, so in the end we might only save on inode manupulation. Nevertheless, our import pipeline is the intended use case of tmpfs, so this was a good idea. Also, according to [1], no RAM is actually wasted if the ramdisk is empty, the ramdisk size is technically just an upper limit.

[1] https://www.kernel.org/doc/Documentation/filesystems/tmpfs.txt

Thank you for the remarks, I think I will add them to the documentation page, because then it will be clear we have taken this into consideration.

Radko, please feel free to extend and update the documentation as you see fit. This is not the fist time you have provided valuable information and I am not sure we have written all of it to appropriate documentation pages (I am talking specifically about your outstanding database work).

Actions #6

Updated by Jan Mach over 5 years ago

  • Status changed from Deferred to Feedback

So what do you think about this guys. Should we try to implement it on our production server, or shoudl I move it to the next release? The process is documented here and I haven`t encountered any problems on our mentat-alt server.

Actions #7

Updated by Pavel Kácha over 5 years ago

I think we won't get much more "observations" about benefits/drawbacks on mentat-alt. I'd go for pushing this to production and observe behaviour during crises (maybe ask Dan to turn the Proki on again *grin*). Your opinion, Radko?

Actions #8

Updated by Radko Krkoš over 5 years ago

Pavel Kácha wrote:

I think we won't get much more "observations" about benefits/drawbacks on mentat-alt. I'd go for pushing this to production and observe behaviour during crises (maybe ask Dan to turn the Proki on again *grin*). Your opinion, Radko?

I would still like to test this. Quite a bit can be learned by looking at io-wait and user times. This got deferred because the amount of data on mentat-alt fitted into the memory and I could not model the out-of-cache scenario reliably. The amount of data has risen since, we can proceed with the observations. Also, reusing the "Proki load-test" and monitoring the import behaviour on mentat-hub and mentat-alt in parallel could be highly beneficial.
That being said, the drawbacks have been looked into extensively and the only one remaining is ensuring that import is re-run since the last commited event to not lose data. I would like a solution for that. Except this, I see no other drawbacks, as was already discussed.
If this issue blocks the 2.3 release, please move it to the next. I think it is important to have an understanding of the performance difference before rolling it out to world/production.

Actions #9

Updated by Pavel Kácha over 5 years ago

Radko Krkoš wrote:

Also, reusing the "Proki load-test" and monitoring the import behaviour on mentat-hub and mentat-alt in parallel could be highly beneficial.

Ok. I have chatted with Dan right now, he is eager to hit us again. Radko, please, fire up some mail to him (dans), Mek (would be find to have him around for possible mitigation on Mentat-hub) and me and let's come up with time.

That being said, the drawbacks have been looked into extensively and the only one remaining is ensuring that import is re-run since the last commited event to not lose data. I would like a solution for that.

Not sure I get it, what do you mean by ensuring import re-run?

Actions #10

Updated by Radko Krkoš over 5 years ago

Pavel Kácha wrote:

Radko Krkoš wrote:

Also, reusing the "Proki load-test" and monitoring the import behaviour on mentat-hub and mentat-alt in parallel could be highly beneficial.

Ok. I have chatted with Dan right now, he is eager to hit us again. Radko, please, fire up some mail to him (dans), Mek (would be find to have him around for possible mitigation on Mentat-hub) and me and let's come up with time.

Will do.

That being said, the drawbacks have been looked into extensively and the only one remaining is ensuring that import is re-run since the last commited event to not lose data. I would like a solution for that.

Not sure I get it, what do you mean by ensuring import re-run?

If the data in the Mentat import pipeline is stored on volatile memory, all is lost on reboot. A mechanism needs to be in place that allows reading the data from Warden since the last successfully commited event. Otherwise there will be a window of lost events.

Actions #11

Updated by Pavel Kácha over 5 years ago

Radko Krkoš wrote:

If the data in the Mentat import pipeline is stored on volatile memory, all is lost on reboot. A mechanism needs to be in place that allows reading the data from Warden since the last successfully commited event. Otherwise there will be a window of lost events.

Ah, see what you mean. This is not easy, that would need some mechanism which feeds last Warden id from mentat-storage back to warden-filer - but there is no info, or rather notion of warden event id in the pipeline after warden-filer.

Actions #12

Updated by Jan Mach over 5 years ago

  • Target version changed from 2.3 to 2.4

Pavel Kácha wrote:

Radko Krkoš wrote:

If the data in the Mentat import pipeline is stored on volatile memory, all is lost on reboot. A mechanism needs to be in place that allows reading the data from Warden since the last successfully commited event. Otherwise there will be a window of lost events.

Ah, see what you mean. This is not easy, that would need some mechanism which feeds last Warden id from mentat-storage back to warden-filer - but there is no info, or rather notion of warden event id in the pipeline after warden-filer.

Yes, there is no way how Mentat can pair an event from database with Warden ID. We could of think of some solutions, but I am not sure, if this mechanism is even necessary. Such a mechanism would most likely slow the processing even in case of using regular file system queues. I think, that in this case the volatility of RAM and possibility of data loss in case of power outage should be considered a feature, not a bug. Administrator must decide based on the performance of host system, whether the possibility of data loss is a worthy trade off for performance increase. I am not sure it is possible to guarantee to prevent from data loss in every possible scenario.

I do not have problem moving this to next version, if you think there is still reason for that.

Actions #13

Updated by Pavel Kácha over 5 years ago

We could do some dances with init scripts (like save ramdisks' contents when shutting down, recover on start), but not sure if its worth it. If system hits the pavement, we cannot do much.

Actions #14

Updated by Radko Krkoš over 5 years ago

I have a quite nice solution in mind, but please let's discuss this on a VC, it is too lengthy to explain properly in writing. This issue is on the agenda for the next meeting anyways.

Actions #15

Updated by Pavel Kácha over 5 years ago

Radko Krkoš wrote:

I have a quite nice solution in mind, but please let's discuss this on a VC, it is too lengthy to explain properly in writing. This issue is on the agenda for the next meeting anyways.

Wrap-up from 2019-02-07: We could learn Warden server to get id by event id, or even replace Warden numeric last-id by event id. Then we could push some sophistication into filer initscripts - forget your own notion of id and get last event id for querying Warden server from Mentat alerts database.

But - seems like lot of work for future, let's stay realistic for now - if we lose, we lose.

Easier (and less reliable) way could be to save non-empty contents of ramdisks during system shutdown and restore during boot. Saving must run after all Mentat daemons (using filesystem queues) are already down, restoring can be parallel. Not sure we have made a decision. Is it worth it? Opinions?

Actions #16

Updated by Radko Krkoš over 5 years ago

Another option that was tested during mentat-alt preparation for stress-testing (PROKI) is shutting down warden-filer first and waiting for the Mentat import pipeline to finish processing. This generally takes only few seconds and covers the same use cases as saving the RAMdisk to non-volatile storage. It should also be comparatively easy to implement. Maybe by adding another shutdown signal?
System restarts are infrequent and the added shutdown latency will not matter significantly. Of course this still leaves the "recovery from improper shutdown" use case which would require a more complex solution such as the one discussed.

Actions #17

Updated by Radko Krkoš over 5 years ago

An analysis of system behaviour was performed during the PROKI stress test. The data is not terribly insightful as the amount of stored events was quite low and the "out of cache" state was not triggered. Furthermore the performance characteristics of the disk-subsystems on mentat-hub and mentat-alt are very different with the same IOps resulting in much lower total utilization on mentat-alt, what skews the results. Two lessons have been learned:
  1. The RAM disk approach seems to lead to lower total writes (about 1/2). This seems counter-intuitive and has not been analyzed further yet.
  2. The sheer number of events per second is not the defining factor for overload. The case of extra large events seems to be the true problem. This was not encountered during the test. The expectation is that the RAMdisk would help in this case.

Let's discuss this sometimes, but that should not be viewed as a blocker for deployment on production.

Actions #18

Updated by Radko Krkoš over 5 years ago

  • Assignee changed from Radko Krkoš to Jan Mach
Actions #19

Updated by Pavel Kácha over 5 years ago

Radko Krkoš wrote:

Another option that was tested during mentat-alt preparation for stress-testing (PROKI) is shutting down warden-filer first and waiting for the Mentat import pipeline to finish processing. This generally takes only few seconds and covers the same use cases as saving the RAMdisk to non-volatile storage. It should also be comparatively easy to implement. Maybe by adding another shutdown signal?
System restarts are infrequent and the added shutdown latency will not matter significantly. Of course this still leaves the "recovery from improper shutdown" use case which would require a more complex solution such as the one discussed.

That's it, one hung daemon in the pipeline would mean it does not work. When writing/changing start/shutdown scripts already, simple cp -r (or find ... | cp ...) seems quite simple to me...

Actions #20

Updated by Radko Krkoš over 5 years ago

Pavel Kácha wrote:

That's it, one hung daemon in the pipeline would mean it does not work.

True. Is that worse than current state? Does this ever happen? (I am just curious).

When writing/changing start/shutdown scripts already, simple cp -r (or find ... | cp ...) seems quite simple to me...

I cannot disagree.

Actions #21

Updated by Pavel Kácha over 5 years ago

Radko Krkoš wrote:

Pavel Kácha wrote:

That's it, one hung daemon in the pipeline would mean it does not work.

True. Is that worse than current state? Does this ever happen? (I am just curious).

I hope that during standard operation not. However reboot can be sometimes used as a last resort measure in the case of hard-to-debug-or-solve problems, where something in system is congested or hung.

When writing/changing start/shutdown scripts already, simple cp -r (or find ... | cp ...) seems quite simple to me...

I cannot disagree.

... aand seems to me that simpler "copy out - copy in" solves both cases.

Actions #22

Updated by Pavel Kácha over 5 years ago

After a brief talk about draining queues:

  • we could use different signal for graceful stop (finish queue first, then shutdown)
    • either replace SIGUSR1
    • or use SIGINT (Ctrl-C) for immediate shutdown (so user is able to stop the script immediately when run from command line as one-shot) and change SIGTERM to do immediate shutdown (finish what is in memory or rollback)
  • then we can have default target in controller for graceful stop and another for immediate stop
  • filer would need some more care - then we can learn him to also use these signals, until then some timeouts may work
  • we would have to create dependencies (fetching filer shuts first, mentat second, sending filer last)

(Doesn't cp out and cp in look better now? )

Actions #23

Updated by Jan Mach over 5 years ago

  • Status changed from Feedback to Closed
  • % Done changed from 50 to 100

I have worked according to the prepared documentation and everything went smoothly. RAM based filesystem message queues are now deployed on mentat-hub server.

We have been testing this concept for quite some time now on mentat-alt test server without any problems, so I am fairly confident and closing this issue as resolved.

Actions

Also available in: Atom PDF