Task #4570: Use RAM based file system for message queue directory - Mentat - Homeproj: Redmine for CESNET

Actions

Copy link

Task #4570

closed

Use RAM based file system for message queue directory

Added by Jan Mach almost 6 years ago. Updated over 5 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Jan Mach

Category:

Installation

Target version:

2.4

Start date:

01/17/2019

Due date:

% Done:

100%

Estimated time:

To be discussed:

Description

We are encountering some very heavy IO on our servers due to the nature of message exchange protocol (filesystem-based queues). Try to implement and test a RAM based filesystem for the message queue directory. Start on our test server and monitor before possibly deploying the solution to production server.

Actions

Copy link

Updated by Jan Mach almost 6 years ago

Status changed from New to In Progress
Assignee changed from Jan Mach to Radko Krkoš
% Done changed from 0 to 50

I have just set up the RAM based filesystem (tmpfs) for message exchange queues on our test server mentat-alt. I have also added the appropriate section to documentation (see attached commit for details, at this time it is not yet available on our build server), that describes the steps taken in the process.

I am switching this task temporarily to Radko to let him monitor the performance on the target system. Please move the task back to me when the time is right to implement the solution on our production server (or abandon the idea and move back to previous setup).

Actions

Copy link

Updated by Jan Mach almost 6 years ago

Status changed from In Progress to Feedback

Actions

Copy link

Updated by Radko Krkoš almost 6 years ago

Status changed from Feedback to Deferred

The amount of stored data on mentat-alt is not enough to model the out of cache situation. The cleanup limit was increased to 12 weeks what should be enough. We need to wait for the data to flow in, so deferring until then.

Actions

Copy link

Updated by Radko Krkoš almost 6 years ago

Just a remark, according to [1], tmpfs content can be swapped to disk if low on memory, so in the end we might only save on inode manupulation. Nevertheless, our import pipeline is the intended use case of tmpfs, so this was a good idea. Also, according to [1], no RAM is actually wasted if the ramdisk is empty, the ramdisk size is technically just an upper limit.

[1] https://www.kernel.org/doc/Documentation/filesystems/tmpfs.txt

Actions

Copy link

Updated by Jan Mach almost 6 years ago

Radko Krkoš wrote:

Just a remark, according to [1], tmpfs content can be swapped to disk if low on memory, so in the end we might only save on inode manupulation. Nevertheless, our import pipeline is the intended use case of tmpfs, so this was a good idea. Also, according to [1], no RAM is actually wasted if the ramdisk is empty, the ramdisk size is technically just an upper limit.

[1] https://www.kernel.org/doc/Documentation/filesystems/tmpfs.txt

Thank you for the remarks, I think I will add them to the documentation page, because then it will be clear we have taken this into consideration.

Radko, please feel free to extend and update the documentation as you see fit. This is not the fist time you have provided valuable information and I am not sure we have written all of it to appropriate documentation pages (I am talking specifically about your outstanding database work).

Actions

Copy link

Updated by Jan Mach almost 6 years ago

Status changed from Deferred to Feedback

So what do you think about this guys. Should we try to implement it on our production server, or shoudl I move it to the next release? The process is documented here and I haven`t encountered any problems on our mentat-alt server.

Actions

Copy link

Updated by Pavel Kácha almost 6 years ago

I think we won't get much more "observations" about benefits/drawbacks on mentat-alt. I'd go for pushing this to production and observe behaviour during crises (maybe ask Dan to turn the Proki on again *grin*). Your opinion, Radko?

Actions

Copy link

Updated by Radko Krkoš almost 6 years ago

Pavel Kácha wrote:

I think we won't get much more "observations" about benefits/drawbacks on mentat-alt. I'd go for pushing this to production and observe behaviour during crises (maybe ask Dan to turn the Proki on again *grin*). Your opinion, Radko?

I would still like to test this. Quite a bit can be learned by looking at io-wait and user times. This got deferred because the amount of data on mentat-alt fitted into the memory and I could not model the out-of-cache scenario reliably. The amount of data has risen since, we can proceed with the observations. Also, reusing the "Proki load-test" and monitoring the import behaviour on mentat-hub and mentat-alt in parallel could be highly beneficial.
That being said, the drawbacks have been looked into extensively and the only one remaining is ensuring that import is re-run since the last commited event to not lose data. I would like a solution for that. Except this, I see no other drawbacks, as was already discussed.
If this issue blocks the 2.3 release, please move it to the next. I think it is important to have an understanding of the performance difference before rolling it out to world/production.

Actions

Copy link

Updated by Pavel Kácha almost 6 years ago

Radko Krkoš wrote:

Also, reusing the "Proki load-test" and monitoring the import behaviour on mentat-hub and mentat-alt in parallel could be highly beneficial.

Ok. I have chatted with Dan right now, he is eager to hit us again. Radko, please, fire up some mail to him (dans), Mek (would be find to have him around for possible mitigation on Mentat-hub) and me and let's come up with time.

That being said, the drawbacks have been looked into extensively and the only one remaining is ensuring that import is re-run since the last commited event to not lose data. I would like a solution for that.

Not sure I get it, what do you mean by ensuring import re-run?

Actions

Copy link

#10

Updated by Radko Krkoš almost 6 years ago

Pavel Kácha wrote:

Radko Krkoš wrote:

Also, reusing the "Proki load-test" and monitoring the import behaviour on mentat-hub and mentat-alt in parallel could be highly beneficial.

Ok. I have chatted with Dan right now, he is eager to hit us again. Radko, please, fire up some mail to him (dans), Mek (would be find to have him around for possible mitigation on Mentat-hub) and me and let's come up with time.

Will do.

That being said, the drawbacks have been looked into extensively and the only one remaining is ensuring that import is re-run since the last commited event to not lose data. I would like a solution for that.

Not sure I get it, what do you mean by ensuring import re-run?

If the data in the Mentat import pipeline is stored on volatile memory, all is lost on reboot. A mechanism needs to be in place that allows reading the data from Warden since the last successfully commited event. Otherwise there will be a window of lost events.

Actions

Copy link

#11

Updated by Pavel Kácha almost 6 years ago

Radko Krkoš wrote:

If the data in the Mentat import pipeline is stored on volatile memory, all is lost on reboot. A mechanism needs to be in place that allows reading the data from Warden since the last successfully commited event. Otherwise there will be a window of lost events.

Ah, see what you mean. This is not easy, that would need some mechanism which feeds last Warden id from mentat-storage back to warden-filer - but there is no info, or rather notion of warden event id in the pipeline after warden-filer.

Actions

Copy link

#12

Updated by Jan Mach almost 6 years ago

Target version changed from 2.3 to 2.4

Pavel Kácha wrote:

Radko Krkoš wrote:

If the data in the Mentat import pipeline is stored on volatile memory, all is lost on reboot. A mechanism needs to be in place that allows reading the data from Warden since the last successfully commited event. Otherwise there will be a window of lost events.

Ah, see what you mean. This is not easy, that would need some mechanism which feeds last Warden id from mentat-storage back to warden-filer - but there is no info, or rather notion of warden event id in the pipeline after warden-filer.

Yes, there is no way how Mentat can pair an event from database with Warden ID. We could of think of some solutions, but I am not sure, if this mechanism is even necessary. Such a mechanism would most likely slow the processing even in case of using regular file system queues. I think, that in this case the volatility of RAM and possibility of data loss in case of power outage should be considered a feature, not a bug. Administrator must decide based on the performance of host system, whether the possibility of data loss is a worthy trade off for performance increase. I am not sure it is possible to guarantee to prevent from data loss in every possible scenario.

I do not have problem moving this to next version, if you think there is still reason for that.

Actions

Copy link

#13

Updated by Pavel Kácha almost 6 years ago

We could do some dances with init scripts (like save ramdisks' contents when shutting down, recover on start), but not sure if its worth it. If system hits the pavement, we cannot do much.

Actions

Copy link

#14

Updated by Radko Krkoš almost 6 years ago

I have a quite nice solution in mind, but please let's discuss this on a VC, it is too lengthy to explain properly in writing. This issue is on the agenda for the next meeting anyways.

Actions

Copy link

#15

Updated by Pavel Kácha almost 6 years ago

Radko Krkoš wrote:

I have a quite nice solution in mind, but please let's discuss this on a VC, it is too lengthy to explain properly in writing. This issue is on the agenda for the next meeting anyways.

Wrap-up from 2019-02-07: We could learn Warden server to get id by event id, or even replace Warden numeric last-id by event id. Then we could push some sophistication into filer initscripts - forget your own notion of id and get last event id for querying Warden server from Mentat alerts database.

But - seems like lot of work for future, let's stay realistic for now - if we lose, we lose.

Easier (and less reliable) way could be to save non-empty contents of ramdisks during system shutdown and restore during boot. Saving must run after all Mentat daemons (using filesystem queues) are already down, restoring can be parallel. Not sure we have made a decision. Is it worth it? Opinions?

Actions

Copy link

#16

Updated by Radko Krkoš almost 6 years ago

Another option that was tested during mentat-alt preparation for stress-testing (PROKI) is shutting down warden-filer first and waiting for the Mentat import pipeline to finish processing. This generally takes only few seconds and covers the same use cases as saving the RAMdisk to non-volatile storage. It should also be comparatively easy to implement. Maybe by adding another shutdown signal?
System restarts are infrequent and the added shutdown latency will not matter significantly. Of course this still leaves the "recovery from improper shutdown" use case which would require a more complex solution such as the one discussed.

Actions

Copy link

#17

Updated by Radko Krkoš almost 6 years ago

An analysis of system behaviour was performed during the PROKI stress test. The data is not terribly insightful as the amount of stored events was quite low and the "out of cache" state was not triggered. Furthermore the performance characteristics of the disk-subsystems on mentat-hub and mentat-alt are very different with the same IOps resulting in much lower total utilization on mentat-alt, what skews the results. Two lessons have been learned:

The RAM disk approach seems to lead to lower total writes (about 1/2). This seems counter-intuitive and has not been analyzed further yet.
The sheer number of events per second is not the defining factor for overload. The case of extra large events seems to be the true problem. This was not encountered during the test. The expectation is that the RAMdisk would help in this case.

Let's discuss this sometimes, but that should not be viewed as a blocker for deployment on production.

Actions

Copy link

#18

Updated by Radko Krkoš almost 6 years ago

Assignee changed from Radko Krkoš to Jan Mach

Actions

Copy link

#19

Updated by Pavel Kácha almost 6 years ago

Radko Krkoš wrote:

Another option that was tested during mentat-alt preparation for stress-testing (PROKI) is shutting down warden-filer first and waiting for the Mentat import pipeline to finish processing. This generally takes only few seconds and covers the same use cases as saving the RAMdisk to non-volatile storage. It should also be comparatively easy to implement. Maybe by adding another shutdown signal?
System restarts are infrequent and the added shutdown latency will not matter significantly. Of course this still leaves the "recovery from improper shutdown" use case which would require a more complex solution such as the one discussed.

That's it, one hung daemon in the pipeline would mean it does not work. When writing/changing start/shutdown scripts already, simple cp -r (or find ... | cp ...) seems quite simple to me...

Actions

Copy link

#20

Updated by Radko Krkoš almost 6 years ago

Pavel Kácha wrote:

That's it, one hung daemon in the pipeline would mean it does not work.

True. Is that worse than current state? Does this ever happen? (I am just curious).

When writing/changing start/shutdown scripts already, simple cp -r (or find ... | cp ...) seems quite simple to me...

I cannot disagree.

Actions

Copy link

#21

Updated by Pavel Kácha almost 6 years ago

Radko Krkoš wrote:

Pavel Kácha wrote:

That's it, one hung daemon in the pipeline would mean it does not work.

True. Is that worse than current state? Does this ever happen? (I am just curious).

I hope that during standard operation not. However reboot can be sometimes used as a last resort measure in the case of hard-to-debug-or-solve problems, where something in system is congested or hung.

When writing/changing start/shutdown scripts already, simple cp -r (or find ... | cp ...) seems quite simple to me...

I cannot disagree.

... aand seems to me that simpler "copy out - copy in" solves both cases.

Actions

Copy link

#22

Updated by Pavel Kácha almost 6 years ago

After a brief talk about draining queues:

we could use different signal for graceful stop (finish queue first, then shutdown)
- either replace SIGUSR1
- or use SIGINT (Ctrl-C) for immediate shutdown (so user is able to stop the script immediately when run from command line as one-shot) and change SIGTERM to do immediate shutdown (finish what is in memory or rollback)
then we can have default target in controller for graceful stop and another for immediate stop
filer would need some more care - then we can learn him to also use these signals, until then some timeouts may work
we would have to create dependencies (fetching filer shuts first, mentat second, sending filer last)

(Doesn't cp out and cp in look better now? )

Actions

Copy link

#23

Updated by Jan Mach over 5 years ago

Status changed from Feedback to Closed
% Done changed from 50 to 100

I have worked according to the prepared documentation and everything went smoothly. RAM based filesystem message queues are now deployed on mentat-hub server.

We have been testing this concept for quite some time now on mentat-alt test server without any problems, so I am fairly confident and closing this issue as resolved.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Mentat

Custom queries

Task #4570

Use RAM based file system for message queue directory

Updated by Jan Mach almost 6 years ago

Updated by Jan Mach almost 6 years ago

Updated by Radko Krkoš almost 6 years ago

Updated by Radko Krkoš almost 6 years ago

Updated by Jan Mach almost 6 years ago

Updated by Jan Mach almost 6 years ago

Updated by Pavel Kácha almost 6 years ago

Updated by Radko Krkoš almost 6 years ago

Updated by Pavel Kácha almost 6 years ago

Updated by Radko Krkoš almost 6 years ago

Updated by Pavel Kácha almost 6 years ago

Updated by Jan Mach almost 6 years ago

Updated by Pavel Kácha almost 6 years ago

Updated by Radko Krkoš almost 6 years ago

Updated by Pavel Kácha almost 6 years ago

Updated by Radko Krkoš almost 6 years ago

Updated by Radko Krkoš almost 6 years ago

Updated by Radko Krkoš almost 6 years ago

Updated by Pavel Kácha almost 6 years ago

Updated by Radko Krkoš almost 6 years ago

Updated by Pavel Kácha almost 6 years ago

Updated by Pavel Kácha almost 6 years ago

Updated by Jan Mach over 5 years ago