Project

General

Profile

Actions

Bug #4515

closed

Runaway query?

Added by Pavel Kácha over 3 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Development - Core
Target version:
Start date:
01/14/2019
Due date:
% Done:

100%

Estimated time:
To be discussed:

Description

https://mentat-hub.cesnet.cz/mentat/events/search?dt_from=2018-01-01+10%3A48%3A49&dt_to=&source_addrs=195.113.180.82&source_ports=&submit=Hledat

SELECT "id", "detecttime", "category", "description", "source_ip", "target_ip", "source_port", "target_port", "source_type", "target_type", "protocol", "node_name", "node_type", "cesnet_resolvedabuses", "cesnet_storagetime", "cesnet_eventclass", "cesnet_eventseverity", "cesnet_inspectionerrors", "event" 
FROM events 
WHERE "detecttime" >= '2018-01-01T10:48:49+00:00'::timestamptz AND ('195.113.180.82' && ANY("source_ip"))
ORDER BY "detecttime" DESC LIMIT 100

Stránka vygenerována v: 14. 1. 2019 10:23:07 | Stránka vygenerována za: 0:34:08.597094

Can you please glance into it?


Related issues

Related to Mentat - Bug #4384: Possibility of DoS by repeating long queryClosedJan Mach10/19/2018

Actions
Related to Mentat - Feature #4277: Test split ranges in metadata tableClosedRadko Krkoš08/21/2018

Actions
Related to Mentat - Feature #4275: Split jsonb column into its own tableClosedPavel Kácha

Actions
Related to Mentat - Task #4531: Tweak the AUTOVACUUM and AUTOANALYZE parameters for lower peak loadClosedJan Mach01/14/2019

Actions
Actions #1

Updated by Radko Krkoš over 3 years ago

  • Related to Bug #4384: Possibility of DoS by repeating long query added
Actions #2

Updated by Radko Krkoš over 3 years ago

The query is IO heavy and was launched multiple times in parallel. That is exactly the attack vector as described in bug #4384. The events table is 209GB in size, indices are 35GB more. The sum is ~244GB, with application code, this will not fit in RAM. For queries requiring table reads (like IP filters), disk access is the bottleneck. If multiple such queries are ran at the same time, thrashing occurs. A solution could be indexing as proposed in #4277 (not written yet), but that is not ready for deployment now.

Actions #3

Updated by Radko Krkoš over 3 years ago

  • Related to Feature #4277: Test split ranges in metadata table added
Actions #4

Updated by Radko Krkoš over 3 years ago

  • Related to Feature #4275: Split jsonb column into its own table added
Actions #5

Updated by Radko Krkoš over 3 years ago

Another solution would be splitting the events into a separate table as discussed before in #4275.

Actions #6

Updated by Pavel Kácha over 3 years ago

The first query, in 10:48:49, have run in single instance, concurrent ones were afternoon user intellectual minimum error. (Nonetheless, we should solve these instances as well, as noted by you.)

So, the question is, is it now the time to end the "let's test what fits" experiment and return back to safe time window, which fits into memory now? And in fact - what this window is?

Actions #7

Updated by Radko Krkoš over 3 years ago

  • Related to Task #4531: Tweak the AUTOVACUUM and AUTOANALYZE parameters for lower peak load added
Actions #8

Updated by Radko Krkoš over 3 years ago

The original run (finished) at 10:23 hit some unusually heavy INSERT activity (at least including large events) and subsequent AUTOANALYZE of the events table. Also the enum update ran at the time. All these IO heavy activities at the same time was most probably the reason for query run time.
I will tweak the AUTO values, see #4531 for details.
As for the large events, this was discussed before, maybe we can further optimize the event reports. One of the inserted events was this:
https://mentat-hub.cesnet.cz/mentat/events/118c5d57-69eb-4958-bd78-bf99814a711e/show
Maybe we can discuss dropping source ports on Recon.Scanning as they provide little useful information (in the form as they are reported they cannot be used to even identify the flows). Also the address ranges could be aggregated. At a quick glance I see few holes.

Actions #9

Updated by Radko Krkoš over 3 years ago

Pavel Kácha wrote:

So, the question is, is it now the time to end the "let's test what fits" experiment and return back to safe time window, which fits into memory now? And in fact - what this window is?

This is already the case, the "let's test what fits" was ended a long time ago. The event count kept in DB is roughly steady now. We are at the limit to what we can keep before applying at least some of the discussed optimizations. The current optimal window is about the current size, with some unlucky queries hitting high IO loads like this one. Can we tolerate that?
Note that reducing the window size will not impact table and index sizes immediately, a CLUSTER (requires table lock) must be run after trimming the window outliers.

Actions #10

Updated by Pavel Kácha over 3 years ago

Radko Krkoš wrote:

The original run (finished) at 10:23 hit some unusually heavy INSERT activity (at least including large events) and subsequent AUTOANALYZE of the events table. Also the enum update ran at the time. All these IO heavy activities at the same time was most probably the reason for query run time.
I will tweak the AUTO values, see #4531 for details.

Thx.

As for the large events, this was discussed before, maybe we can further optimize the event reports. One of the inserted events was this:
https://mentat-hub.cesnet.cz/mentat/events/118c5d57-69eb-4958-bd78-bf99814a711e/show
Maybe we can discuss dropping source ports on Recon.Scanning as they provide little useful information (in the form as they are reported they cannot be used to even identify the flows). Also the address ranges could be aggregated. At a quick glance I see few holes.

Yup, ranges could be aggregated. Concerning source ports - ostensibly some typical malware can be identified by source port (or source port pattern). Of course, thing is how much we do care.

Actions #11

Updated by Pavel Kácha over 3 years ago

Radko Krkoš wrote:

Pavel Kácha wrote:

So, the question is, is it now the time to end the "let's test what fits" experiment and return back to safe time window, which fits into memory now? And in fact - what this window is?

This is already the case, the "let's test what fits" was ended a long time ago. The event count kept in DB is roughly steady now. We are at the limit to what we can keep before applying at least some of the discussed optimizations. The current optimal window is about the current size, with some unlucky queries hitting high IO loads like this one. Can we tolerate that?

I'd rather not. Minutes are acceptable, tens of minutes are... bad.

Note that reducing the window size will not impact table and index sizes immediately, a CLUSTER (requires table lock) must be run after trimming the window outliers.

Aren't the caches block based? So unused (=out of new smaller window) blocks will slip out and memory will be reused by more recent blocks regardless of what is or is not on the disk, won't they?

Actions #12

Updated by Radko Krkoš over 3 years ago

Pavel Kácha wrote:

Radko Krkoš wrote:

Pavel Kácha wrote:

So, the question is, is it now the time to end the "let's test what fits" experiment and return back to safe time window, which fits into memory now? And in fact - what this window is?

This is already the case, the "let's test what fits" was ended a long time ago. The event count kept in DB is roughly steady now. We are at the limit to what we can keep before applying at least some of the discussed optimizations. The current optimal window is about the current size, with some unlucky queries hitting high IO loads like this one. Can we tolerate that?

I'd rather not. Minutes are acceptable, tens of minutes are... bad.

OK.

Note that reducing the window size will not impact table and index sizes immediately, a CLUSTER (requires table lock) must be run after trimming the window outliers.

Aren't the caches block based? So unused (=out of new smaller window) blocks will slip out and memory will be reused by more recent blocks regardless of what is or is not on the disk, won't they?

Roughly yes. Of course the data is organized according to storage time and deleted based on detect time (if I remember correctly). Then again, these two diverge substantially only rarely. The problem is that all my performance estimates are based on these values being (semi)accurate. I guess I can live with that.

Actions #13

Updated by Radko Krkoš over 3 years ago

Pavel Kácha wrote:

Radko Krkoš wrote:

Note that reducing the window size will not impact table and index sizes immediately, a CLUSTER (requires table lock) must be run after trimming the window outliers.

Aren't the caches block based? So unused (=out of new smaller window) blocks will slip out and memory will be reused by more recent blocks regardless of what is or is not on the disk, won't they?

Then there are indices, where the problem stands. So at least REINDEX TABLE events; should be run after the DELETE. Ideally at a time of lower reporter activity (the currently rebuilt index is locked out and cannot be used for queries).

Actions #14

Updated by Pavel Kácha over 3 years ago

Another example today (run really ONCE):

https://mentat-hub.cesnet.cz/mentat/events/search?dt_from=2018-10-17+00%3A00%3A00&dt_to=2019-01-12+10%3A40%3A14&source_addrs=104.133.17.82&source_ports=&submit=Search

Ended up with 500 server error, presumably because query or process got killed for running too long?

How many events will fit into memory together with indices?

Actions #15

Updated by Radko Krkoš over 3 years ago

Pavel Kácha wrote:

Another example today (run really ONCE):

https://mentat-hub.cesnet.cz/mentat/events/search?dt_from=2018-10-17+00%3A00%3A00&dt_to=2019-01-12+10%3A40%3A14&source_addrs=104.133.17.82&source_ports=&submit=Search

Ended up with 500 server error, presumably because query or process got killed for running too long?

How many events will fit into memory together with indices?

Yes, I was already looking at that one. It finished at 9:55, took 831 seconds, but completed successfully (that is under 20min, even shorter than the one from yesterday, so no idea what caused the 500).

It seems that the INSERT load on disk IO increased substantially recently. I am looking into the cause.

Actions #16

Updated by Jan Mach over 3 years ago

I have inspected the log file /var/mentat/log/mentat-hawat.py.log and found this in relevant time interval:

2019-01-15 10:41:09,790 INFO: Mark preprocess_end:search (Done preprocessing for /events/search?dt_from=2018-10-17+00%3A00%3A00&dt_to=2019-01-12+10%3A40%3A14&source_addrs=104.133.17.82&source_ports=&submit=Search) delta=0:00:00.000560 [in /usr/lib/python3/dist-packages/hawat/base.py:923]
2019-01-15 10:41:09,791 INFO: Mark search_begin:search (Begin searching for /events/search?dt_from=2018-10-17+00%3A00%3A00&dt_to=2019-01-12+10%3A40%3A14&source_addrs=104.133.17.82&source_ports=&submit=Search) delta=0:00:00.000346 [in /usr/lib/python3/dist-packages/hawat/base.py:923]
2019-01-15 10:55:01,302 ERROR: Exception on /events/search [GET] [in /usr/local/lib/python3.5/dist-packages/flask/app.py:1761]
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/mentat/services/eventstorage.py", line 970, in search_events
    count, result = self.cursor.search_events(parameters, event_factory, columns)
  File "/usr/lib/python3/dist-packages/mentat/services/eventstorage.py", line 458, in search_events
    events_raw  = self.cursor.fetchall()
  File "/usr/local/lib/python3.5/dist-packages/psycopg2/extras.py", line 342, in fetchall
    ts = super(NamedTupleCursor, self).fetchall()
psycopg2.ProgrammingError: no results to fetch

I am not sure, what to make out of it. The search correctly started, but it seems something timeouted within psycopg2 or postgresql.

Actions #17

Updated by Radko Krkoš over 3 years ago

Jan Mach wrote:

I have inspected the log file /var/mentat/log/mentat-hawat.py.log and found this in relevant time interval:

[...]

I am not sure, what to make out of it. The search correctly started, but it seems something timeouted within psycopg2 or postgresql.

Possibly related to:
https://github.com/psycopg/psycopg2/issues/346

It seems there is an obscure bug in psycopg2 that triggers under high commit load. I can only advise to catch the exception and log the content of cursor.statusmessage for reference. A retry of cursor.fetchall() may work based on description (it looks like some sort of race condition to me).

Actions #18

Updated by Pavel Kácha over 3 years ago

Back to disk congestion - how many items will safely fit into mem together with indices? Month? Two?

Actions #19

Updated by Radko Krkoš over 3 years ago

Pavel Kácha wrote:

Back to disk congestion - how many items will safely fit into mem together with indices? Month? Two?

We are now storing 260 days (based on cesnet_storagetime). So 8 months should fit safely. (A CLUSTER would even suffice to optimize the table layout and shrink the indices.)

Actions #20

Updated by Pavel Kácha over 3 years ago

Radko Krkoš wrote:

Pavel Kácha wrote:

Back to disk congestion - how many items will safely fit into mem together with indices? Month? Two?

We are now storing 260 days (based on cesnet_storagetime). So 8 months should fit safely. (A CLUSTER would even suffice to optimize the table layout and shrink the indices.)

Beware, it's not 260 days of full data. We now have cutoff only for non CESNET data (and I don't know, what is the most recent one, I thought you've worked that out together with Mek), CESNET related data (fraction of full dataset) is not deleted. So finding min(cesnet_storagetime) is not the correct way.

Actions #21

Updated by Radko Krkoš over 3 years ago

Pavel Kácha wrote:

Beware, it's not 260 days of full data. We now have cutoff only for non CESNET data (and I don't know, what is the most recent one, I thought you've worked that out together with Mek), CESNET related data (fraction of full dataset) is not deleted. So finding min(cesnet_storagetime) is not the correct way.

Yes, of course.
According to /etc/mentat/mentat-cleanup.py.conf and /var/mentat/log/mentat-cleanup.py.log the values are 1 year for CESNET data and 2 weeks for other. That is quite different from what I recall. Cutting back by 1 week should be more than enough.
That being said, with recent INSERT load, all bets are off:
https://mentat-hub.cesnet.cz/mentat/performance/view (2018-01-15 13:00-15:00)

Actions #22

Updated by Jan Mach over 3 years ago

Radko Krkoš wrote:

Pavel Kácha wrote:

Beware, it's not 260 days of full data. We now have cutoff only for non CESNET data (and I don't know, what is the most recent one, I thought you've worked that out together with Mek), CESNET related data (fraction of full dataset) is not deleted. So finding min(cesnet_storagetime) is not the correct way.

Yes, of course.
According to /etc/mentat/mentat-cleanup.py.conf and /var/mentat/log/mentat-cleanup.py.log the values are 1 year for CESNET data and 2 weeks for other. That is quite different from what I recall. Cutting back by 1 week should be more than enough.
That being said, with recent INSERT load, all bets are off:
https://mentat-hub.cesnet.cz/mentat/performance/view (2018-01-15 13:00-15:00)

Actually the current configuration for cleanup is 1 year of CESNET data and 8 weeks full. We might go back to 4 weeks full to make some time to fix these issues and optimize database schema, then we can start increasing the data volume again. In my opinion 4 weeks full data are more than enough for our purposes. 1 year of CESNET data is insignificant in comparison, so I would keep it that way for now (last we Mentat received roughly 12mil events and only 80 000 were CESNET related).

Should I proceed and lower the cleanup threshold to only 4 weeks full?

Actions #23

Updated by Radko Krkoš over 3 years ago

Jan Mach wrote:

Radko Krkoš wrote:

Pavel Kácha wrote:

Beware, it's not 260 days of full data. We now have cutoff only for non CESNET data (and I don't know, what is the most recent one, I thought you've worked that out together with Mek), CESNET related data (fraction of full dataset) is not deleted. So finding min(cesnet_storagetime) is not the correct way.

Yes, of course.
According to /etc/mentat/mentat-cleanup.py.conf and /var/mentat/log/mentat-cleanup.py.log the values are 1 year for CESNET data and 2 weeks for other. That is quite different from what I recall. Cutting back by 1 week should be more than enough.
That being said, with recent INSERT load, all bets are off:
https://mentat-hub.cesnet.cz/mentat/performance/view (2018-01-15 13:00-15:00)

Actually the current configuration for cleanup is 1 year of CESNET data and 8 weeks full. We might go back to 4 weeks full to make some time to fix these issues and optimize database schema, then we can start increasing the data volume again. In my opinion 4 weeks full data are more than enough for our purposes. 1 year of CESNET data is insignificant in comparison, so I would keep it that way for now (last we Mentat received roughly 12mil events and only 80 000 were CESNET related).

Should I proceed and lower the cleanup threshold to only 4 weeks full?

Yes, sorry. I meant 2 months = 8 weeks for non-CESNET data. In my opinion cutting back 1 week is enough. As of how much data is useful, I have little idea. If 4 weeks are enough, cut back there. It will save a lot of time with DB optimizations. Going to approximately half the data will give a lot of breathing room.

Actions #24

Updated by Jan Mach over 3 years ago

Radko Krkoš wrote:

Jan Mach wrote:

Radko Krkoš wrote:

Pavel Kácha wrote:

Beware, it's not 260 days of full data. We now have cutoff only for non CESNET data (and I don't know, what is the most recent one, I thought you've worked that out together with Mek), CESNET related data (fraction of full dataset) is not deleted. So finding min(cesnet_storagetime) is not the correct way.

Yes, of course.
According to /etc/mentat/mentat-cleanup.py.conf and /var/mentat/log/mentat-cleanup.py.log the values are 1 year for CESNET data and 2 weeks for other. That is quite different from what I recall. Cutting back by 1 week should be more than enough.
That being said, with recent INSERT load, all bets are off:
https://mentat-hub.cesnet.cz/mentat/performance/view (2018-01-15 13:00-15:00)

Actually the current configuration for cleanup is 1 year of CESNET data and 8 weeks full. We might go back to 4 weeks full to make some time to fix these issues and optimize database schema, then we can start increasing the data volume again. In my opinion 4 weeks full data are more than enough for our purposes. 1 year of CESNET data is insignificant in comparison, so I would keep it that way for now (last we Mentat received roughly 12mil events and only 80 000 were CESNET related).

Should I proceed and lower the cleanup threshold to only 4 weeks full?

Yes, sorry. I meant 2 months = 8 weeks for non-CESNET data. In my opinion cutting back 1 week is enough. As of how much data is useful, I have little idea. If 4 weeks are enough, cut back there. It will save a lot of time with DB optimizations. Going to approximately half the data will give a lot of breathing room.

Pavel, do you support lowering the threshold of full data to 4 weeks until we optimize the database schema, so that we (Radko) do not have to monitor and optimize the database so thoroughly?

Actions #25

Updated by Pavel Kácha over 3 years ago

Jan Mach wrote:

Radko Krkoš wrote:

Jan Mach wrote:

Radko Krkoš wrote:

Pavel Kácha wrote:

Beware, it's not 260 days of full data. We now have cutoff only for non CESNET data (and I don't know, what is the most recent one, I thought you've worked that out together with Mek), CESNET related data (fraction of full dataset) is not deleted. So finding min(cesnet_storagetime) is not the correct way.

Yes, of course.
According to /etc/mentat/mentat-cleanup.py.conf and /var/mentat/log/mentat-cleanup.py.log the values are 1 year for CESNET data and 2 weeks for other. That is quite different from what I recall. Cutting back by 1 week should be more than enough.
That being said, with recent INSERT load, all bets are off:
https://mentat-hub.cesnet.cz/mentat/performance/view (2018-01-15 13:00-15:00)

Actually the current configuration for cleanup is 1 year of CESNET data and 8 weeks full. We might go back to 4 weeks full to make some time to fix these issues and optimize database schema, then we can start increasing the data volume again. In my opinion 4 weeks full data are more than enough for our purposes. 1 year of CESNET data is insignificant in comparison, so I would keep it that way for now (last we Mentat received roughly 12mil events and only 80 000 were CESNET related).

Should I proceed and lower the cleanup threshold to only 4 weeks full?

Yes, sorry. I meant 2 months = 8 weeks for non-CESNET data. In my opinion cutting back 1 week is enough. As of how much data is useful, I have little idea. If 4 weeks are enough, cut back there. It will save a lot of time with DB optimizations. Going to approximately half the data will give a lot of breathing room.

Pavel, do you support lowering the threshold of full data to 4 weeks until we optimize the database schema, so that we (Radko) do not have to monitor and optimize the database so thoroughly?

Yes, go for it. Thanks.

Actions #26

Updated by Pavel Kácha over 3 years ago

Radko Krkoš wrote:

That being said, with recent INSERT load, all bets are off:
https://mentat-hub.cesnet.cz/mentat/performance/view (2018-01-15 13:00-15:00)

Seems like Dan and Edvard have set up CZ.NIC Proki channel (cz.cesnet.supplier.proki). I don't believe we need/want all the Proki data, I notified Dan to try to cut it down.

Actions #27

Updated by Jan Mach over 3 years ago

Pavel Kácha wrote:

Jan Mach wrote:

Radko Krkoš wrote:

Jan Mach wrote:

Radko Krkoš wrote:

Pavel Kácha wrote:

Beware, it's not 260 days of full data. We now have cutoff only for non CESNET data (and I don't know, what is the most recent one, I thought you've worked that out together with Mek), CESNET related data (fraction of full dataset) is not deleted. So finding min(cesnet_storagetime) is not the correct way.

Yes, of course.
According to /etc/mentat/mentat-cleanup.py.conf and /var/mentat/log/mentat-cleanup.py.log the values are 1 year for CESNET data and 2 weeks for other. That is quite different from what I recall. Cutting back by 1 week should be more than enough.
That being said, with recent INSERT load, all bets are off:
https://mentat-hub.cesnet.cz/mentat/performance/view (2018-01-15 13:00-15:00)

Actually the current configuration for cleanup is 1 year of CESNET data and 8 weeks full. We might go back to 4 weeks full to make some time to fix these issues and optimize database schema, then we can start increasing the data volume again. In my opinion 4 weeks full data are more than enough for our purposes. 1 year of CESNET data is insignificant in comparison, so I would keep it that way for now (last we Mentat received roughly 12mil events and only 80 000 were CESNET related).

Should I proceed and lower the cleanup threshold to only 4 weeks full?

Yes, sorry. I meant 2 months = 8 weeks for non-CESNET data. In my opinion cutting back 1 week is enough. As of how much data is useful, I have little idea. If 4 weeks are enough, cut back there. It will save a lot of time with DB optimizations. Going to approximately half the data will give a lot of breathing room.

Pavel, do you support lowering the threshold of full data to 4 weeks until we optimize the database schema, so that we (Radko) do not have to monitor and optimize the database so thoroughly?

Yes, go for it. Thanks.

DONE.

Actions #28

Updated by Radko Krkoš over 3 years ago

  • Status changed from New to Resolved
  • Assignee changed from Radko Krkoš to Pavel Kácha

VACUUM FREEZE; seems to have restored much of the previous performance. The indices are still bloated but we can live with that until the next downtime.
As for the whole issue, it was caused by the DB growing over the RAM and heavy disk load at the same time. Steps have been taken to limit the impact of such coincidence in the future. A true solution is the DB structure reorganization as discussed elsewhere.

Actions #29

Updated by Pavel Kácha over 3 years ago

Thanks to both.

Anyway - it shows us, that in focusing on searching performance we should not disregard writing performance. Do we have some options here? Like cutting down on indices (#4383), somehow cutting down pqsql fsyncing, bulk writes, ...?

Actions #30

Updated by Jan Mach over 3 years ago

Radko Krkoš wrote:

VACUUM FREEZE; seems to have restored much of the previous performance. The indices are still bloated but we can live with that until the next downtime.
As for the whole issue, it was caused by the DB growing over the RAM and heavy disk load at the same time. Steps have been taken to limit the impact of such coincidence in the future. A true solution is the DB structure reorganization as discussed elsewhere.

I looked into the log file of mentat-cleanup and can now confirm, that new configuration is in effect and was already executed several times. 60mil events were removed during the initial cleanup execution, we are now down to cca 60mil. I have kept the configuration for CESNET related data to 1 year, because their size is insignificant (about ~1%).

Actions #31

Updated by Pavel Kácha over 3 years ago

Pavel Kácha wrote:

Thanks to both.

Anyway - it shows us, that in focusing on searching performance we should not disregard writing performance. Do we have some options here? Like cutting down on indices (#4383), somehow cutting down pqsql fsyncing, bulk writes, ...?

Also after quick googling: checkpoint twiddling, synchronous_commit? (link)

Actions #32

Updated by Radko Krkoš over 3 years ago

As a note, the CLUSTER command acquires an ACCESS EXCLUSIVE explicit lock [1], so read access is also disabled. Web interface should therefore also be down for the clustering.

[1] https://www.postgresql.org/docs/11/explicit-locking.html

Actions #33

Updated by Pavel Kácha over 3 years ago

  • Status changed from Resolved to Closed

Pavel Kácha wrote:

Pavel Kácha wrote:

Thanks to both.

Anyway - it shows us, that in focusing on searching performance we should not disregard writing performance. Do we have some options here? Like cutting down on indices (#4383), somehow cutting down pqsql fsyncing, bulk writes, ...?

From discussion: Apart from what is already in the run, #4571 might help with trimming down big events, #4572 for bulk writes, #4570 for testing ramdrive for disk queues (minimising other IO).

Also after quick googling: checkpoint twiddling, synchronous_commit? (link)

From discussion: Already tweaked and does not show among IO hogs.

Squeezed out what we could from this issue, time to close.

Actions #34

Updated by Radko Krkoš over 3 years ago

Pavel Kácha wrote:

Pavel Kácha wrote:

Also after quick googling: checkpoint twiddling, synchronous_commit? (link)

From discussion: Already tweaked and does not show among IO hogs.

Some fresh tests were done, while the WAL writes are not negligible during normal operation (~500MB/hour), they are sequential in nature and low in IOps (0.1% of all IOps if running a large SELECT in parallel). Nevertheless there might be ways to reduce the amount written to disk (wal_level=minimal), what might also help with performance of operations such as CLUSTER - lower downtime. It could be worth a try.

Actions #35

Updated by Pavel Kácha over 3 years ago

Radko Krkoš wrote:

Pavel Kácha wrote:

Pavel Kácha wrote:

Also after quick googling: checkpoint twiddling, synchronous_commit? (link)

From discussion: Already tweaked and does not show among IO hogs.

Some fresh tests were done, while the WAL writes are not negligible during normal operation (~500MB/hour), they are sequential in nature and low in IOps (0.1% of all IOps if running a large SELECT in parallel). Nevertheless there might be ways to reduce the amount written to disk (wal_level=minimal), what might also help with performance of operations such as CLUSTER - lower downtime. It could be worth a try.

Doc says 'minimal' is the default, so it seems we cannot go any lower?

Actions #36

Updated by Radko Krkoš over 3 years ago

The default was changed to replica in v10 doc. We are probably not going to use the replication any time soon so we can drop it back to minimal. The reduction in WAL size for operations such as CLUSTER is substantial. The difference in total operation time is not that large (no hard numbers yet).

Actions #37

Updated by Pavel Kácha over 3 years ago

Radko Krkoš wrote:

The default was changed to replica in v10 doc. We are probably not going to use the replication any time soon so we can drop it back to minimal. The reduction in WAL size for operations such as CLUSTER is substantial. The difference in total operation time is not that large (no hard numbers yet).

Ahha... let's retreat back to 'minimal' then. We've been running with it for some time before v10, and we can bump it up in case we are considering replication in some time in the future

Actions #38

Updated by Radko Krkoš over 3 years ago

Test results for wal_level = {replica, minimal} impact on CLUSTER: (The test was done on mentat-dev, so ignore the absolute values).

minimal:
Vacuum part: U: 20.05s, S: 10.22s, E: 62.46s
Total time: 456046 ms (7m 36s)
WAL writes: 2.08 GB

replica:
Vacuum part: U: 27.01s, S: 13.19s, E:107.23s
Total time: 514674 ms (8m 34s)
WAL writes: 4.90 GB

Time reduction: 11.39%
Write reduction: 57.55%

Based on the numbers, the drop to minimal WAL would lower considerably the background maintenance disk writes and also help the out of cache situation. Unfortunately a DB (and therefore Mentat) restart is required for the change.

Actions #39

Updated by Pavel Kácha over 3 years ago

Radko Krkoš wrote:

Based on the numbers, the drop to minimal WAL would lower considerably the background maintenance disk writes and also help the out of cache situation. Unfortunately a DB (and therefore Mentat) restart is required for the change.

If that doesn need any rebuild or other long outage, I believe restart (even on production) can be done pretty much anytime (just take look around whether no important cronjob is not running or Mek isn't conjuring).

Actions #40

Updated by Jan Mach over 3 years ago

Pavel Kácha wrote:

Radko Krkoš wrote:

Based on the numbers, the drop to minimal WAL would lower considerably the background maintenance disk writes and also help the out of cache situation. Unfortunately a DB (and therefore Mentat) restart is required for the change.

If that doesn need any rebuild or other long outage, I believe restart (even on production) can be done pretty much anytime (just take look around whether no important cronjob is not running or Mek isn't conjuring).

I have just restarted postgresql, apache and mentat daemons, so any new configuration should now be in effect.

Actions #41

Updated by Jan Mach over 3 years ago

  • Target version changed from Backlog to 2.3
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF