Project

General

Profile

Actions

Feature #4572

closed

Minimise number of commits

Added by Pavel Kácha over 3 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Development - Core
Target version:
Start date:
01/18/2019
Due date:
% Done:

100%

Estimated time:
To be discussed:
No

Description

Storage daemon could commit stored events in bunches, after suitable timeout and/or number of events emitted.


Files

giphy.gif (761 KB) giphy.gif Sorry Jan Mach, 11/21/2019 11:51 AM
Actions #1

Updated by Pavel Kácha over 3 years ago

Radko, could you please search up some backing (like what you said you've already read), like what might be good initial guess for bunch sizes, or what should be taken into consideration?

Actions #2

Updated by Radko Krkoš over 3 years ago

I have been searching for the resources, [1] is the best document about the topic that I remember. Unfortunately it only deals with loading the data into an empty table with no indices. That is not the case for us. There was a measurement indicating that with indices, the advantage of multiple-value INSERTs is diminishing, not so for multiple-INSERT transactions. I cannot find it.

Then there is [2], showing (Images 3, 4) that for tables with many columns (such as ours), the multiple-value INSERTs are less efficient. It goes into some details of this, but basically it boils down to the size of cleartext representation of the query, which can be large in our case (such as the LaBrea events).

To sum it up, multiple INSERTs per transaction will help, prepared statements might also (although this will mostly lower the CPU load, not disk IO). Multiple-row INSERTs might not be as efficient in our case. Should we make measurements to confirm this?

[1] https://www.depesz.com/2007/07/05/how-to-insert-data-to-database-as-fast-as-possible/
[2] https://www.red-gate.com/simple-talk/sql/performance/comparing-multiple-rows-insert-vs-single-row-insert-with-three-data-load-methods/

Actions #3

Updated by Pavel Kácha over 3 years ago

However, second source is for MS SQL Server, isn't it?

How about COPY data FROM STDIN WITH BINARY? poc here

Programatically it would be similar concept, except instead of inserts, binary data would get appended to ByteIO and when counters expire, written into db in one COPY statement. However, depends on complexity of data conversion. Maybe simpler and still feasible could be COPY with CSV/TSV data.

At least crude test would be cool, to know we're not chasing ghosts, how much work would it take?

Actions #4

Updated by Radko Krkoš over 3 years ago

Pavel Kácha wrote:

However, second source is for MS SQL Server, isn't it?

Yeah, obviously it is. Sorry, I made a quick search and thought to have found the sources I used before (quite some time ago). I will search for the relevant sources, pity I cannot find my notes on this topic.

How about COPY data FROM STDIN WITH BINARY? poc here

Programatically it would be similar concept, except instead of inserts, binary data would get appended to ByteIO and when counters expire, written into db in one COPY statement. However, depends on complexity of data conversion. Maybe simpler and still feasible could be COPY with CSV/TSV data.

I do not like the binary mode, our source data is in text format, not in binary as in the poc case. Also, I am not sure COPY is actually that much less load if indices are present. I will research this more.

At least crude test would be cool, to know we're not chasing ghosts, how much work would it take?

Not that much. And it can be generally done in parallel with some other, more involved, work.

Actions #5

Updated by Radko Krkoš over 3 years ago

  • Assignee changed from Jan Mach to Radko Krkoš

Taking this over for testing. A test script is in the mentat-dbbench repo, preliminary results are interesting (grouping multiple events for one INSERT is more efficient even with indices and our table structure but the difference to grouping INSERTs in transaction is not big and as that is much easier to implement, I would choose it). Detailed results soon.

Actions #6

Updated by Radko Krkoš over 3 years ago

  • Status changed from New to Feedback

The tests were run at mentat-dev, here are the numbers: (the tests were done for metadata-only table - no event column)

Inserting 40000 rows in 40000 transactions, 1 INSERTs per transaction, 1 rows per INSERT, with indices
 > Took 231.945 s, 223.771 s, 235.124 s
 > WALwriter: 2.05 M
 > IO-wait: 6-7%, 200-250 writes/s, Utilization: 80%
 > User: 1-2%

This is modeling the current approach - all INSERTs are done in it's own transaction. The tests were done on 40k events as that is roughly the amassed amount during overload. Insertion like this can load the disk subsystem considerably and is heavily IO bound.

Inserting 40000 rows in 2000 transactions, 20 INSERTs per transaction, 1 rows per INSERT, with indices
 > Took 29.112 s, 28.976 s
 > WALwriter: 2.06 M, 2032.00 K
 > IO-wait: 2-3%, 70-80 writes/s, Utilization: 25-30%
 > User: 5-6%

First approach is to perform multiple INSERTs in one transaction. With 20, about the number received per second, we can see great reduction in both run time and disk load. Much less time is spent waiting for disk operations to finish.

Inserting 40000 rows in 1000 transactions, 40 INSERTs per transaction, 1 rows per INSERT, with indices
 > Took 24.888 s, 24.736 s
 > WALwriter: 2.88 M, 2.14 M
 > IO-wait: 2%, 40-50 writes/s, Utilization: 20%
 > User: 6%

With the increase to 2 seconds of data in each transaction, there is still some improvement, albeit just a marginal.

Inserting 40000 rows in 2000 transactions, 1 INSERTs per transaction, 20 rows per INSERT, with indices
 > Took 24.335 s, 24.145 s
 > WALwriter: 1176.00 K, 1200.00 K
 > IO-wait: 3-4%, 80-90 writes/s, Utilization: 35-40%
 > User: 4-5%

Another approach is to bundle multiple events into one INSERT statement. This allows for lower write-ahead log (sequential writes only) overhead as the amount of modification commands is lower but does not offer the disk-load reduction of the first approach. The process is still somewhat IO-bound at this level.

Inserting 40000 rows in 1000 transactions, 1 INSERTs per transaction, 40 rows per INSERT, with indices
 > Took 17.627 s, 17.297 s
 > WALwriter: 1216.00 K
 > IO-wait: 2-3%, 60 writes/s, Utilization: 25%
 > User: 6%

With the increase to 2 seconds of data, the disk is no longer the limiting factor, the disk-load is still somewhat higher than with the first approach.

Conclusion:
Performing bulk INSERTs is definitely a good idea, they offer much better performance. As for the two alternative ways, they look roughly the same with multiple INSERTs per transaction allowing for lower disk load and multiple rows per INSERT leading to better total run time. The difference is small but measurable, the biggest argument should be the ease of implementation. For the problem we are solving, bundling multiple INSERTs into one transaction seems to work somewhat better.

Remarks:
  • A combination of the two methods is also possible, the improvements are in line with the provided data. The implementation complication is not worth the effort in my opinion.
  • Another way of implementing data import in PostgreSQL is using COPY FROM. According to [1] (the whole thread is a good read; although older - 2005, no indication that this has since changes was found), this is not very different from bundling all the data into one transaction, so similar to the first approach. The differences are in harder error handling, skipping the parser (not that useful for IO-bound cases) and on-wire protocol (related to latency - not an issue on localhost).

[1] https://www.postgresql.org/message-id/1115246490.1709.233393619%40webmail.messagingengine.com

Actions #7

Updated by Radko Krkoš over 3 years ago

This was not discussed on the last meeting due to lack of time. Moving to next meeting.

Actions #8

Updated by Pavel Kácha over 3 years ago

  • Assignee changed from Radko Krkoš to Jan Mach

Conclusion from meeting: Write performance during peaks might go down one order of magnitude, so let's implement.

  • Let's go for easier variant: txn start, N * execute insert, txn commit/rollback
  • Reasonable would be to have two limits - max event inserts per txn, max time per txn; close txn in whatever comes first. That should be doable by means of event driven daemon architecture,
  • Limits should be configurable, one second and some corresponding pretty high number of inserts (50? 100?)

Reassigning for implementation.

Actions #9

Updated by Pavel Kácha over 2 years ago

  • Assignee changed from Jan Mach to Radko Krkoš

On today's meeting valid question of error handling was raised.

If we can get the cause of the error after aborted transaction, we can log (as we do now) and recover.

If we cannot, we could back off, bisect, or use another costly strategies, question is if it is viable.

If we can catch the error during processing, log it, and continue without transaction breaking, event better. Might be possible by means of WHENEVER.

Reassigning back to DB department for investigation. (Not urgent.)

Actions #10

Updated by Radko Krkoš over 2 years ago

  • Assignee changed from Radko Krkoš to Pavel Kácha
  • To be discussed changed from No to Yes

The WHENEVER approach is an overkill, it would complicate the python code needlessly.
The DB department suggests using the following strategy:

BEGIN;
SAVEPOINT last_known_good;

// Loop block for pending INSERTions:
try:
    INSERT INTO events ...;
    INSERT INTO events_json ...;
    SAVEPOINT last_known_good;
except:
    // Error info provided as usual
    ROLLBACK TO SAVEPOINT last_known_good;
    log_collision(id, ...)

if (insert_count_over or insert_delay_over):
    COMMIT;

This provides the same error reporting and recovery capabilities as the currently deployed approach, just the COMMITs will be batched, allowing for benefits mentioned in detail above.

Actions #11

Updated by Pavel Kácha over 2 years ago

  • Assignee changed from Pavel Kácha to Radko Krkoš

Radko Krkoš wrote:

The WHENEVER approach is an overkill, it would complicate the python code needlessly.

Why so? My understanding was that WHENEVER could fire stored procedure, which could swallow the error and note it somewhere, unaffecting other writes. Python code could then just pick up noted info after the transaction, log it and move affected events from tmp to error dir. Am I misguided?

The DB department suggests using the following strategy:
[...]

This provides the same error reporting and recovery capabilities as the currently deployed approach, just the COMMITs will be batched, allowing for benefits mentioned in detail above.

Mmmmkay, souns fine.

Actions #12

Updated by Jan Mach over 2 years ago

  • Status changed from Feedback to In Progress
  • Assignee changed from Radko Krkoš to Jan Mach
  • Target version changed from Backlog to 2.6
  • To be discussed changed from Yes to No

I`ll take it from here, thank you Radko.

Actions #13

Updated by Radko Krkoš over 2 years ago

  • Assignee changed from Jan Mach to Radko Krkoš

NO WAY! I still have to do the SAVEPOINT performance impact measurements! Mine! Mine! Mine!

Actions #14

Updated by Jan Mach over 2 years ago

Sorry

Actions #15

Updated by Radko Krkoš over 2 years ago

  • Assignee changed from Radko Krkoš to Pavel Kácha
  • To be discussed changed from No to Yes

The measurement script was extended to allow SAVEPOINTs. The measurements were again performed on mentat-dev:

krkos@mentat-dev:~$ ./insert_perf.py -t 1000 -i 40 -r 1
Inserting 40000 rows in 1000 transactions, 40 INSERTs per transaction, 1 rows per INSERT, with indices, without SAVEPOINTs
 > Took 22.061 s
krkos@mentat-dev:~$ ./insert_perf.py -t 1000 -i 40 -r 1
Inserting 40000 rows in 1000 transactions, 40 INSERTs per transaction, 1 rows per INSERT, with indices, without SAVEPOINTs
 > Took 22.034 s
krkos@mentat-dev:~$ ./insert_perf.py -t 1000 -i 40 -r 1 -s
Inserting 40000 rows in 1000 transactions, 40 INSERTs per transaction, 1 rows per INSERT, with indices, with SAVEPOINTs
 > Took 25.851 s
krkos@mentat-dev:~$ ./insert_perf.py -t 1000 -i 40 -r 1 -s
Inserting 40000 rows in 1000 transactions, 40 INSERTs per transaction, 1 rows per INSERT, with indices, with SAVEPOINTs
 > Took 25.710 s
krkos@mentat-dev:~$ ./insert_perf.py -t 40000 -i 1 -r 1
Inserting 40000 rows in 40000 transactions, 1 INSERTs per transaction, 1 rows per INSERT, with indices, without SAVEPOINTs
 > Took 219.242 s
krkos@mentat-dev:~$ ./insert_perf.py -t 40000 -i 1 -r 1
Inserting 40000 rows in 40000 transactions, 1 INSERTs per transaction, 1 rows per INSERT, with indices, without SAVEPOINTs
 > Took 198.840 s

1) As can be seen, there is some, <20% performance impact from using SAVEPOINTs, yet it is still much better than isolating each INSERT (pair) in its own transaction - roughly 8x faster.

2) If compared with the old results, there is a visible ~10% performance improvement across the board attributable to PostgreSQL, kernel and whatnot upgrade in the meantime. Keep in mind that this is an AMD system, with little negative impact from CPU vulnerability mitigations in the meantime (such an improvement cannot be expected on mentat-hub or mentat-alt - both Intel systems).

3) Non-scripted tests indicate that ROLLBACK TO SAVEPOINT is quite efficient when rolling back even successful INSERTs (another +10% time over SAVEPOINT above) - it definitely does not break the transaction. With unsuccessful INSERTs, this is even lower (as less work has to be done) - roughly +5% over SAVEPOINT time above.

Actions #16

Updated by Pavel Kácha over 2 years ago

Radko Krkoš wrote:

The measurement script was extended to allow SAVEPOINTs. The measurements were again performed on mentat-dev:
[...]

1) As can be seen, there is some, <20% performance impact from using SAVEPOINTs, yet it is still much better than isolating each INSERT (pair) in its own transaction - roughly 8x faster.

So still very good.

2) If compared with the old results, there is a visible ~10% performance improvement across the board attributable to PostgreSQL, kernel and whatnot upgrade in the meantime. Keep in mind that this is an AMD system, with little negative impact from CPU vulnerability mitigations in the meantime (such an improvement cannot be expected on mentat-hub or mentat-alt - both Intel systems).

Sure, we have to count with some performance losses in Meltdown/Spectre/whatnot mitigations, but this we'll have to pay anyway.

3) Non-scripted tests indicate that ROLLBACK TO SAVEPOINT is quite efficient when rolling back even successful INSERTs (another +10% time over SAVEPOINT above) - it definitely does not break the transaction. With unsuccessful INSERTs, this is even lower (as less work has to be done) - roughly +5% over SAVEPOINT time above.

Very good. So we just have to take care to take the INSERT, which is more probable to fail, first. I'd guess its 'events' because of possible duplicities. But no big deal perhaps.

So what's exactly do discuss? Let's go for it, or not?

Actions #17

Updated by Radko Krkoš over 2 years ago

  • Assignee changed from Pavel Kácha to Jan Mach
  • To be discussed changed from Yes to No

Pavel Kácha wrote:

Radko Krkoš wrote:

The measurement script was extended to allow SAVEPOINTs. The measurements were again performed on mentat-dev:
[...]

1) As can be seen, there is some, <20% performance impact from using SAVEPOINTs, yet it is still much better than isolating each INSERT (pair) in its own transaction - roughly 8x faster.

So still very good.

Yes.

2) If compared with the old results, there is a visible ~10% performance improvement across the board attributable to PostgreSQL, kernel and whatnot upgrade in the meantime. Keep in mind that this is an AMD system, with little negative impact from CPU vulnerability mitigations in the meantime (such an improvement cannot be expected on mentat-hub or mentat-alt - both Intel systems).

Sure, we have to count with some performance losses in Meltdown/Spectre/whatnot mitigations, but this we'll have to pay anyway.

Yes, this was more of a interesting fact I noticed, so to let you know too.

3) Non-scripted tests indicate that ROLLBACK TO SAVEPOINT is quite efficient when rolling back even successful INSERTs (another +10% time over SAVEPOINT above) - it definitely does not break the transaction. With unsuccessful INSERTs, this is even lower (as less work has to be done) - roughly +5% over SAVEPOINT time above.

Very good. So we just have to take care to take the INSERT, which is more probable to fail, first. I'd guess its 'events' because of possible duplicities. But no big deal perhaps.

The INSERT into `events` must go first because of the non-conflicting case - events_json(id) references events(id). As for the conflict, it would probably not matter which would go first.

So what's exactly do discuss? Let's go for it, or not?

I agree. I posted this to give you the hard numbers you asked for. Mek already wanted to implement it yesterday.

Actions #18

Updated by Pavel Kácha over 2 years ago

Radko Krkoš wrote:

Pavel Kácha wrote:

Very good. So we just have to take care to take the INSERT, which is more probable to fail, first. I'd guess its 'events' because of possible duplicities. But no big deal perhaps.

The INSERT into `events` must go first because of the non-conflicting case - events_json(id) references events(id). As for the conflict, it would probably not matter which would go first.

Mmmkay, understood.

So what's exactly do discuss? Let's go for it, or not?

I agree. I posted this to give you the hard numbers you asked for. Mek already wanted to implement it yesterday.

Cool, he can still implement it yesterday, if he insists.

Actions #19

Updated by Radko Krkoš over 2 years ago

Pavel Kácha wrote:

Cool, he can still implement it yesterday, if he insists.

Yes, fine by me. Mek?

Actions #20

Updated by Jan Mach over 2 years ago

  • To be discussed changed from No to Yes
Actions #21

Updated by Jan Mach over 2 years ago

  • % Done changed from 0 to 30

Based on first implementation I have written and executed first benchmark with following results:

 BENCHMARKING MENTAT.EVENTSTORAGE MODULE (v2.5.17)
 2019-12-03 12:44:01.977904

====================================================================================
 Name                   | Iterations (#)   | Duration (s)         | Speed (#/s)         
====================================================================================
 commit                 |          100,000 |       161.3952113680 |         619.597
 bulk commit            |          100,000 |        91.4733231670 |       1,093.215
====================================================================================

Each time 100 000 events are stored into the database. In the case of 'commit' each event insertion is immediately committed, in case of 'bulk commit' the commit is made after 1000 event insertions with savepoints between each event insertion. The performance improvement is significant. In real application there will be some additional overhead of course, because there must be additional mechanism to perform commits based on timeout (in case events are not coming in fast enough).

Actions #22

Updated by Jan Mach over 2 years ago

  • Status changed from In Progress to Feedback
  • % Done changed from 30 to 100

Attached commits should fully resolve this issue. I have updated the mentat.services.eventservice DBAL to support PGSQL savepoints and bulk commits. I have then adjusted the mentat.daemon.component.storage module to support these new features. The bulk commit mode is optional, but enabled by default. It is possible to disable this feature altogether, at which point the module reverts back to current mode of operation (commit after each insert). In other case the transaction savepoints, rollbacks to savepoint end delayed commits are used to enhance the event insertion performance. The whole mechanism is configurable with two new additional configuration parameters commit_bulk_threshold (number of events to savepoint before full commit is performed, defaults to 1000) and commit_bulk_timeout (timeout in seconds after which to force commit of currently uncommitted events, defaults to 5s).

Following statistics are gathered and periodically written to log:

2019-12-06 11:47:07,867 mentat-storage.py [20761] INFO: Component 'storage': *** Processing statistics ***
    cnt_stored                    67,811 (    +2,410,   120.50 #/s)
    cnt_errors                         0 (        +0,     0.00 #/s)
    cnt_eci_timeout               57,741 (      +340,    17.00 #/s)
    cnt_eci_threshold             10,000 (    +2,000,   100.00 #/s)
    cnt_cis_timeout                  339 (        +1,     0.05 #/s)
    cnt_cis_threshold                 10 (        +2,     0.10 #/s)

  • cnt_eci_timeout - Number of events committed via timeout mechanism
  • cnt_eci_threshold - Number of events committed via threshold mechanism
  • cnt_cis_timeout - Number of commits performed by timeout mechanism
  • cnt_cis_threshold - Number of commits performed by threshold mechanism

These new statistics enable administrators to fine tune the configuration for best performance depending on the event processing workload of their particular installation.

The feature is already deployed on our mentat-alt server, you may monitor it in /var/mentat/log/mentat-storage.py.log file.

Actions #23

Updated by Pavel Kácha over 2 years ago

Just wow, you're realy quick. Very cool.

Actions #24

Updated by Pavel Kácha over 2 years ago

Please, check error handling, and if ok, we can close.

Actions #25

Updated by Jan Mach over 2 years ago

  • Status changed from Feedback to In Progress
  • % Done changed from 100 to 90

I think one more thing needs to be addressed before closing this issue. Currently it is not guaranteed, that all currently uncommitted IDEA events will be committed on daemon shutdown. It is necessary to implement some sort of stop event, that will trigger this cleanup to avoid data loss.

Actions #26

Updated by Jan Mach over 2 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 90 to 100

I have reviewed the implementation of a mechanism for graceful daemon shutdown, it now commits all currently uncommitted IDEA messages and switches to immediate commit mode in case some other messages end up in the queue before final shutdown (it is not guaranteed the stop event will be last event handled).

Actions #27

Updated by Jan Mach over 2 years ago

  • To be discussed changed from Yes to No
Actions

Also available in: Atom PDF