Bug #4569: Hawat DB connection is not recreated if dropped - Mentat - Homeproj: Redmine for CESNET

Actions

Copy link

Bug #4569

closed

Hawat DB connection is not recreated if dropped

Added by Radko Krkoš almost 6 years ago. Updated about 5 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Radko Krkoš

Category:

Development - GUI

Target version:

2.5

Start date:

01/17/2019

Due date:

% Done:

100%

Estimated time:

To be discussed:

Description

The database connection of the web interface is not reopened if the server becomes temporarily unavailable. This happens if the DB server is restarted (reliably reproducible) and in some other cases as well (possibly procedural timeouts or even race conditions?). To improve overall robustness, this should be fixed.

The issue has low priority as the problem does not manifest itself under normal operation.

Files

Download all files

0001-DRAFT-v3-Fix-eternal-500-in-Hawat-after-DB-restart.patch (21 KB) 0001-DRAFT-v3-Fix-eternal-500-in-Hawat-after-DB-restart.patch		Radko Krkoš, 02/22/2019 07:53 PM
0002-Fix-Fixed-enricher-failure-on-DB-restart.patch (2.04 KB) 0002-Fix-Fixed-enricher-failure-on-DB-restart.patch		Radko Krkoš, 05/21/2019 12:32 PM
0003-Fix-Fixed-mentat-storage-failing-on-DB-restart.patch (1.16 KB) 0003-Fix-Fixed-mentat-storage-failing-on-DB-restart.patch		Radko Krkoš, 05/21/2019 12:32 PM
0004-Fix-Use-RetryingQuery-where-possible-also-in-Hawat.patch (2.75 KB) 0004-Fix-Use-RetryingQuery-where-possible-also-in-Hawat.patch		Radko Krkoš, 07/12/2019 01:50 PM

Related issues

Actions

Copy link

Updated by Radko Krkoš almost 6 years ago

This may also be an issue if cancelling long-running queries using pg_cancel_backend(), what is considered to be implemented for #4384.

Actions

Copy link

Updated by Radko Krkoš almost 6 years ago

File 0001-DRAFT-Fix-eternal-500-in-Hawat-after-DB-restart.patch added
Status changed from New to Feedback

Several issues leading to this problem have been identified. The first one stems from how SQLAlchemy handles reconnection and will be tricky to fix. Anyways, after a couple (usually 2) retries, the connection is renewed. I would leave this for later. (Also, this one would not affect the pg_cancel_backend() use case.)
That leads to how we handle the DB connection in EventStorageService. The attached draft patch shows how this problem can be fixed. It is an ugly hack so I expect several iterations before a merge. Only the method search_events() is fixed. Other data access methods must be fixed as well. As the exception handling is the same in all of them, may I suggest using a decorator for this? This would save us a lot of lines of code. Another advantage would be that we could very easily (without changing even more lines) reconnect and retry if the connection was found to be lost, what would be a proper fix as far as the user is concerned.
As for the patch, I am not sure what does the psycopg2.DatabaseError handling fix, but psycopg2.OperationalError is derived from that and therefore must be given priority. Maybe we should replace the psycopg2.DatabaseError with something more specific?

Actions

Copy link

Updated by Radko Krkoš almost 6 years ago

An obvious candidate for the fix is the EventStorageService where the data access methods reside. The approach using a decorator looks quite elegant. Unfortunately (for an automatic heal), EventStorageCursor needs a modification also to support cursor recreation.

Actions

Copy link

Updated by Radko Krkoš almost 6 years ago

File deleted (~~0001-DRAFT-Fix-eternal-500-in-Hawat-after-DB-restart.patch~~)

Actions

Copy link

Updated by Radko Krkoš almost 6 years ago

File 0001-DRAFT-Fix-eternal-500-in-Hawat-after-DB-restart.patch added

In the end, no modification of EventStorageCursor was necessary to fix the non-SQLAlchemy part of Mentat. The updated attached patch unifies exception handling in data access methods of EventStorageService using a decorator and recreates the DB connection after PostgreSQL restart, reducing the number of page refreshes to heal (and as a bonus, removes 100 lines of code). This is still a draft, has a few rough edges, but please let's discuss this on the next meeting.

Actions

Copy link

Updated by Pavel Kácha almost 6 years ago

Lotta work. Mek, please, take a look sometimes, preferably before meeting, if possible.

Notes to discussion:
If I read it correctly, it retries right away three times. I think that second and subsequent retries should be delayed couple of seconds.

And probably would be fine to put these values somewhere into config.

Actions

Copy link

Updated by Pavel Kácha almost 6 years ago

Related to Bug #4251: User data access through SQLAlchemy keeps a transaction open added

Actions

Copy link

Updated by Radko Krkoš almost 6 years ago

This was partially (not enough time) discussed online, Mek likes the decorator approach, so this is good to go for further development and polishing.

The outstanding issues are:

Retry count for renewing the DB connection should be configurable.
A sleep time after unsuccessful connection renewal should be added and made configurable.
The exception thrown after all DB reconnects time out should be based on original psycopg2.OperationalError (with details) exception.
The data access methods should be looked into. They explicitly state that in case of exception a rollback is performed, this is:
a) not the case anymore as the decorator is responsible for the task;
b) not the only way to handle exceptions as the new added approach is DB reconnection;
The issue of psycopg2.DatabaseError should be analyzed. This is a super-class of psycopg2.OperationalError that indicates DB disconnect in the background. What other uses are there?

Any other comments?

The SQLAlchemy part was not diagnosed yet. This regards the event data access only.

Actions

Copy link

Updated by Pavel Kácha almost 6 years ago

After meeting:

no retry/timeout, try just once (hence not blocking... too much)
docstrings don't need to be updated
own exception (with encapsulated original exception) (StorageConnectionException)
OperationalError - reconnect, DatabaseError - rollback

Actions

Copy link

#10

Updated by Radko Krkoš over 5 years ago

File deleted (~~0001-DRAFT-Fix-eternal-500-in-Hawat-after-DB-restart.patch~~)

Actions

Copy link

#11

Updated by Radko Krkoš over 5 years ago

File 0001-DRAFT-v3-Fix-eternal-500-in-Hawat-after-DB-restart.patch 0001-DRAFT-v3-Fix-eternal-500-in-Hawat-after-DB-restart.patch added

Here is v3 of the patch with the feedback incorporated. It comes with additional testing, a thorough review of all replaced exception handling to make sure nothing was missed and some cosmetic improvements.

Actions

Copy link

#12

Updated by Radko Krkoš over 5 years ago

Assignee changed from Radko Krkoš to Jan Mach

This should be good to go. Mek?

Actions

Copy link

#13

Updated by Jan Mach over 5 years ago

Status changed from Feedback to In Progress
Assignee changed from Jan Mach to Radko Krkoš

I have just deployed the attached patch to the mentat-alt test system and sadly I have to report, that the issue remains. If you restart postgresql database without restarting apache and all Mentat daemons, the connection does not get recreated and system fails (daemons terminate and apache returns 500). Could you please look into that again? You are free to test this on mentat-alt system.

Mentat is now installed in its own virtual environment and I have verified that the patch code was successfully applied to /var/mentat/venv/lib/python3.5/site-packages/mentat/services/eventstorage.py file.

Actions

Copy link

#14

Updated by Radko Krkoš over 5 years ago

Status changed from In Progress to Feedback

Jan Mach wrote:

I have just deployed the attached patch to the mentat-alt test system and sadly I have to report, that the issue remains. If you restart postgresql database without restarting apache and all Mentat daemons, the connection does not get recreated and system fails (daemons terminate and apache returns 500). Could you please look into that again? You are free to test this on mentat-alt system.

Apache indeed does return 500 for the first time, this is expected. The patch was supposed to fix the failure on event database access. The first failure is due to SQLAlchemy (user data access) and will be more complicated to fix. SQLAlchemy however recovers on retry (sometimes 2 retries are needed - seems to be related to some timing of failed transaction rollback in SQLAlchemy, this will be a true delight to debug), handing execution to the event table access code, where this patch comes into action, allowing for instant recovery if the connection was lost. (Without the patch, the recovery would only happen on apache restart - hence eternal 500.)

Sadly I can confirm that some Mentat daemons (I have seen enricher) sometimes fail after DB restart. I was not able to reproduce this reliably however, it seems to happen very rarely for me.

As for the apache restart, it does not seem necessary for me in any case (after applying the patch). Can you describe the case?

Mentat is now installed in its own virtual environment and I have verified that the patch code was successfully applied to /var/mentat/venv/lib/python3.5/site-packages/mentat/services/eventstorage.py file.

Thanks, this was useful info.

Actions

Copy link

#15

Updated by Jan Mach over 5 years ago

Radko Krkoš wrote:

Jan Mach wrote:

I have just deployed the attached patch to the mentat-alt test system and sadly I have to report, that the issue remains. If you restart postgresql database without restarting apache and all Mentat daemons, the connection does not get recreated and system fails (daemons terminate and apache returns 500). Could you please look into that again? You are free to test this on mentat-alt system.

Apache indeed does return 500 for the first time, this is expected. The patch was supposed to fix the failure on event database access. The first failure is due to SQLAlchemy (user data access) and will be more complicated to fix. SQLAlchemy however recovers on retry (sometimes 2 retries are needed - seems to be related to some timing of failed transaction rollback in SQLAlchemy, this will be a true delight to debug), handing execution to the event table access code, where this patch comes into action, allowing for instant recovery if the connection was lost. (Without the patch, the recovery would only happen on apache restart - hence eternal 500.)

Sadly I can confirm that some Mentat daemons (I have seen enricher) sometimes fail after DB restart. I was not able to reproduce this reliably however, it seems to happen very rarely for me.

Enricher has internal reload timer, that triggers fetching fresh enrichment data from database to internal in-memory cache. I have observed daemon failure always after each database restart, it just does not occur always immediately after the restart.

As for the apache restart, it does not seem necessary for me in any case (after applying the patch). Can you describe the case?

If you restart postgresql without restarting apache, for example the event search form https://mentat-alt.cesnet.cz/mentat/events/search?dt_from=&dt_to=&source_addrs=&source_ports=&submit=Search starts throwing 500s. But I now see, that the problem described in traceback is related to not yet patched sqlalchemy code, not the already fixed psycopg2 code. Sorry for the misdirection.

Mentat is now installed in its own virtual environment and I have verified that the patch code was successfully applied to /var/mentat/venv/lib/python3.5/site-packages/mentat/services/eventstorage.py file.

Thanks, this was useful info.

Actions

Copy link

#16

Updated by Radko Krkoš over 5 years ago

Jan Mach wrote:

Radko Krkoš wrote:

Sadly I can confirm that some Mentat daemons (I have seen enricher) sometimes fail after DB restart. I was not able to reproduce this reliably however, it seems to happen very rarely for me.

Enricher has internal reload timer, that triggers fetching fresh enrichment data from database to internal in-memory cache. I have observed daemon failure always after each database restart, it just does not occur always immediately after the restart.

Yes, that makes sense. Should we create a different issue for that as it is not strictly Hawat related or do you want this issue converted to an umbrella DB conn drop recovery one?
In any case, do you want me to look into that?

As for the apache restart, it does not seem necessary for me in any case (after applying the patch). Can you describe the case?

If you restart postgresql without restarting apache, for example the event search form https://mentat-alt.cesnet.cz/mentat/events/search?dt_from=&dt_to=&source_addrs=&source_ports=&submit=Search starts throwing 500s. But I now see, that the problem described in traceback is related to not yet patched sqlalchemy code, not the already fixed psycopg2 code. Sorry for the misdirection.

OK, but those disappear on their own after a few requests. Is apache restart really necessary?

Actions

Copy link

#17

Updated by Jan Mach over 5 years ago

Radko Krkoš wrote:

Jan Mach wrote:

Radko Krkoš wrote:

Sadly I can confirm that some Mentat daemons (I have seen enricher) sometimes fail after DB restart. I was not able to reproduce this reliably however, it seems to happen very rarely for me.

Enricher has internal reload timer, that triggers fetching fresh enrichment data from database to internal in-memory cache. I have observed daemon failure always after each database restart, it just does not occur always immediately after the restart.

Yes, that makes sense. Should we create a different issue for that as it is not strictly Hawat related or do you want this issue converted to an umbrella DB conn drop recovery one?

I think we should continue to resolve the matter in this issue to reduce the management overhead. I think, that the issue title is ok, when we resolve the matter for Hawat, we resolve it also for Mentat backend.

In any case, do you want me to look into that?

Yes please, look into it. I am occupied elsewhere and I would be grateful, if you could look into it further.

As for the apache restart, it does not seem necessary for me in any case (after applying the patch). Can you describe the case?

If you restart postgresql without restarting apache, for example the event search form https://mentat-alt.cesnet.cz/mentat/events/search?dt_from=&dt_to=&source_addrs=&source_ports=&submit=Search starts throwing 500s. But I now see, that the problem described in traceback is related to not yet patched sqlalchemy code, not the already fixed psycopg2 code. Sorry for the misdirection.

OK, but those disappear on their own after a few requests. Is apache restart really necessary?

Actions

Copy link

#18

Updated by Radko Krkoš over 5 years ago

The two attached patches are aimed at newfound issues regarding recovery after DB restart. See commit messages for details. Fixes are for mentat-enricher and for mentat-storage (partial fix).
The enricher one was tricky to figure out and it uses the method devised by Mek - overriding SQLAlchemy's Query class. It does not fix Hawat issues regarding SQLAlchemy as described before. Those will be dealt with separately.

Actions

Copy link

#19

Updated by Jan Mach over 5 years ago

Target version changed from Backlog to 2.5

Actions

Copy link

#20

Updated by Jan Mach over 5 years ago

% Done changed from 0 to 100

Took a while, but I have just deployed your work on our mentat-alt server.

Actions

Copy link

#21

Updated by Radko Krkoš over 5 years ago

I am glad it helped. With these patches, all known points of failure regarding DB connection are addressed, resulting in automatic recovery after connection loss. The only edge case is the user data access code using SQLAlchemy, which only recovers after the second try. We should solve that at some point.
In the meantime, I would like an intensive test to validate the patches and possibly find other points of failure. Could we run a series of regular DB restarts (using cron probably) to see how the Mentat as a whole would perform? This have to be timed to not interfere with your work so maybe a weekend is a good idea?
Also, I am unsure whether mentat-alt performs all the tasks that production mentat does, particularly the reporting. It would be beneficial to make it as similar to mentat-hub as possible, of course with a different set of addressees.
Unfortunately this would be more work for you Mek, I can help very little as my knowledge of the whole Mentat is lacking severely, I would not know where to look (well, I would probably monitor the logs for errors). So take this as a suggestion and decide at your discretion. Of course I will help in any way I can.

Actions

Copy link

#22

Updated by Pavel Kácha over 5 years ago

Radko Krkoš wrote:

Also, I am unsure whether mentat-alt performs all the tasks that production mentat does, particularly the reporting. It would be beneficial to make it as similar to mentat-hub as possible, of course with a different set of addressees.

AFAIK, mentat-alt Reporter does indeed generate reports, however mails are intentionally sent only to admin (for spot inspection). Reports are not exactly same as on mentat-hub, I presume it's because of slight differences in start/relapse windows.

Actions

Copy link

#23

Updated by Jan Mach over 5 years ago

Radko, you are free to experiment as you like, at least we will possibly identify some other points of failure after unexpected restart. So feel free to add some restarting jobs to cron, just do not leave them indefinitely and perhaps make them output some content, perhaps even with some echo comment, so that I am aware that the restart was intentional.

Actions

Copy link

#24

Updated by Jan Mach over 5 years ago

Note: Reporter is indeed running on mentat-alt, but all reports are redirected to local root for possible inspection and to prevent from spamming. Different report contents on -alt and -hub are due to different content/state of thresholding caches, configuration is the same.

Actions

Copy link

#25

Updated by Jan Mach over 5 years ago

Priority changed from Low to Normal

Actions

Copy link

#26

Updated by Radko Krkoš over 5 years ago

File 0004-Fix-Use-RetryingQuery-where-possible-also-in-Hawat.patch 0004-Fix-Use-RetryingQuery-where-possible-also-in-Hawat.patch added

The attached patch ensures that the RetryingQuery from 0002 is actually used in some other places. Especially in Hawat's DB shim, which is declared independently. After applying this patch, the original issue seems to go away, regardless the DB restarts, the Hawat interface (and also some other Mentat components fixed in the meantime) recovers automatically, without interruption to the user.
There is still the issue of tests by cron DB restarts as discussed before. I would like a quick VC sometimes next week to work out the details.

Actions

Copy link

#27

Updated by Jan Mach over 5 years ago

Latest patch is currently deployed to mentat-alt.

Actions

Copy link

#28

Updated by Jan Mach about 5 years ago

Status changed from Feedback to Closed

I now consider this issue to be closed.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Mentat

Custom queries

Bug #4569

Hawat DB connection is not recreated if dropped

Updated by Radko Krkoš almost 6 years ago

Updated by Radko Krkoš almost 6 years ago

Updated by Radko Krkoš almost 6 years ago

Updated by Radko Krkoš almost 6 years ago

Updated by Radko Krkoš almost 6 years ago

Updated by Pavel Kácha almost 6 years ago

Updated by Pavel Kácha almost 6 years ago

Updated by Radko Krkoš almost 6 years ago

Updated by Pavel Kácha almost 6 years ago

Updated by Radko Krkoš over 5 years ago

Updated by Radko Krkoš over 5 years ago

Updated by Radko Krkoš over 5 years ago

Updated by Jan Mach over 5 years ago

Updated by Radko Krkoš over 5 years ago

Updated by Jan Mach over 5 years ago

Updated by Radko Krkoš over 5 years ago

Updated by Jan Mach over 5 years ago

Updated by Radko Krkoš over 5 years ago

Updated by Jan Mach over 5 years ago

Updated by Jan Mach over 5 years ago

Updated by Radko Krkoš over 5 years ago

Updated by Pavel Kácha over 5 years ago

Updated by Jan Mach over 5 years ago

Updated by Jan Mach over 5 years ago

Updated by Jan Mach over 5 years ago

Updated by Radko Krkoš over 5 years ago

Updated by Jan Mach over 5 years ago

Updated by Jan Mach about 5 years ago