Bug #4569
closedHawat DB connection is not recreated if dropped
100%
Description
The database connection of the web interface is not reopened if the server becomes temporarily unavailable. This happens if the DB server is restarted (reliably reproducible) and in some other cases as well (possibly procedural timeouts or even race conditions?). To improve overall robustness, this should be fixed.
The issue has low priority as the problem does not manifest itself under normal operation.
Files
Related issues
Updated by Radko Krkoš over 5 years ago
This may also be an issue if cancelling long-running queries using pg_cancel_backend()
, what is considered to be implemented for #4384.
Updated by Radko Krkoš over 5 years ago
- File 0001-DRAFT-Fix-eternal-500-in-Hawat-after-DB-restart.patch added
- Status changed from New to Feedback
Several issues leading to this problem have been identified. The first one stems from how SQLAlchemy handles reconnection and will be tricky to fix. Anyways, after a couple (usually 2) retries, the connection is renewed. I would leave this for later. (Also, this one would not affect the pg_cancel_backend()
use case.)
That leads to how we handle the DB connection in EventStorageService
. The attached draft patch shows how this problem can be fixed. It is an ugly hack so I expect several iterations before a merge. Only the method search_events()
is fixed. Other data access methods must be fixed as well. As the exception handling is the same in all of them, may I suggest using a decorator for this? This would save us a lot of lines of code. Another advantage would be that we could very easily (without changing even more lines) reconnect and retry if the connection was found to be lost, what would be a proper fix as far as the user is concerned.
As for the patch, I am not sure what does the psycopg2.DatabaseError
handling fix, but psycopg2.OperationalError
is derived from that and therefore must be given priority. Maybe we should replace the psycopg2.DatabaseError
with something more specific?
Updated by Radko Krkoš over 5 years ago
An obvious candidate for the fix is the EventStorageService
where the data access methods reside. The approach using a decorator looks quite elegant. Unfortunately (for an automatic heal), EventStorageCursor
needs a modification also to support cursor recreation.
Updated by Radko Krkoš over 5 years ago
- File deleted (
0001-DRAFT-Fix-eternal-500-in-Hawat-after-DB-restart.patch)
Updated by Radko Krkoš over 5 years ago
- File 0001-DRAFT-Fix-eternal-500-in-Hawat-after-DB-restart.patch added
In the end, no modification of EventStorageCursor
was necessary to fix the non-SQLAlchemy part of Mentat. The updated attached patch unifies exception handling in data access methods of EventStorageService
using a decorator and recreates the DB connection after PostgreSQL restart, reducing the number of page refreshes to heal (and as a bonus, removes 100 lines of code). This is still a draft, has a few rough edges, but please let's discuss this on the next meeting.
Updated by Pavel Kácha over 5 years ago
Lotta work. Mek, please, take a look sometimes, preferably before meeting, if possible.
Notes to discussion:
If I read it correctly, it retries right away three times. I think that second and subsequent retries should be delayed couple of seconds.
And probably would be fine to put these values somewhere into config.
Updated by Pavel Kácha over 5 years ago
- Related to Bug #4251: User data access through SQLAlchemy keeps a transaction open added
Updated by Radko Krkoš over 5 years ago
This was partially (not enough time) discussed online, Mek likes the decorator approach, so this is good to go for further development and polishing.
The outstanding issues are:- Retry count for renewing the DB connection should be configurable.
- A sleep time after unsuccessful connection renewal should be added and made configurable.
- The exception thrown after all DB reconnects time out should be based on original
psycopg2.OperationalError
(with details) exception. - The data access methods should be looked into. They explicitly state that in case of exception a rollback is performed, this is:
a) not the case anymore as the decorator is responsible for the task;
b) not the only way to handle exceptions as the new added approach is DB reconnection; - The issue of
psycopg2.DatabaseError
should be analyzed. This is a super-class ofpsycopg2.OperationalError
that indicates DB disconnect in the background. What other uses are there?
Any other comments?
The SQLAlchemy part was not diagnosed yet. This regards the event data access only.
Updated by Pavel Kácha over 5 years ago
- no retry/timeout, try just once (hence not blocking... too much)
- docstrings don't need to be updated
- own exception (with encapsulated original exception) (StorageConnectionException)
- OperationalError - reconnect, DatabaseError - rollback
Updated by Radko Krkoš over 5 years ago
- File deleted (
0001-DRAFT-Fix-eternal-500-in-Hawat-after-DB-restart.patch)
Updated by Radko Krkoš over 5 years ago
- File 0001-DRAFT-v3-Fix-eternal-500-in-Hawat-after-DB-restart.patch 0001-DRAFT-v3-Fix-eternal-500-in-Hawat-after-DB-restart.patch added
Here is v3 of the patch with the feedback incorporated. It comes with additional testing, a thorough review of all replaced exception handling to make sure nothing was missed and some cosmetic improvements.
Updated by Radko Krkoš over 5 years ago
- Assignee changed from Radko Krkoš to Jan Mach
This should be good to go. Mek?
Updated by Jan Mach over 5 years ago
- Status changed from Feedback to In Progress
- Assignee changed from Jan Mach to Radko Krkoš
I have just deployed the attached patch to the mentat-alt test system and sadly I have to report, that the issue remains. If you restart postgresql database without restarting apache and all Mentat daemons, the connection does not get recreated and system fails (daemons terminate and apache returns 500). Could you please look into that again? You are free to test this on mentat-alt system.
Mentat is now installed in its own virtual environment and I have verified that the patch code was successfully applied to /var/mentat/venv/lib/python3.5/site-packages/mentat/services/eventstorage.py file.
Updated by Radko Krkoš over 5 years ago
- Status changed from In Progress to Feedback
Jan Mach wrote:
I have just deployed the attached patch to the mentat-alt test system and sadly I have to report, that the issue remains. If you restart postgresql database without restarting apache and all Mentat daemons, the connection does not get recreated and system fails (daemons terminate and apache returns 500). Could you please look into that again? You are free to test this on mentat-alt system.
Apache indeed does return 500 for the first time, this is expected. The patch was supposed to fix the failure on event database access. The first failure is due to SQLAlchemy (user data access) and will be more complicated to fix. SQLAlchemy however recovers on retry (sometimes 2 retries are needed - seems to be related to some timing of failed transaction rollback in SQLAlchemy, this will be a true delight to debug), handing execution to the event table access code, where this patch comes into action, allowing for instant recovery if the connection was lost. (Without the patch, the recovery would only happen on apache restart - hence eternal 500.)
Sadly I can confirm that some Mentat daemons (I have seen enricher) sometimes fail after DB restart. I was not able to reproduce this reliably however, it seems to happen very rarely for me.
As for the apache restart, it does not seem necessary for me in any case (after applying the patch). Can you describe the case?
Mentat is now installed in its own virtual environment and I have verified that the patch code was successfully applied to /var/mentat/venv/lib/python3.5/site-packages/mentat/services/eventstorage.py file.
Thanks, this was useful info.
Updated by Jan Mach over 5 years ago
Radko Krkoš wrote:
Jan Mach wrote:
I have just deployed the attached patch to the mentat-alt test system and sadly I have to report, that the issue remains. If you restart postgresql database without restarting apache and all Mentat daemons, the connection does not get recreated and system fails (daemons terminate and apache returns 500). Could you please look into that again? You are free to test this on mentat-alt system.
Apache indeed does return 500 for the first time, this is expected. The patch was supposed to fix the failure on event database access. The first failure is due to SQLAlchemy (user data access) and will be more complicated to fix. SQLAlchemy however recovers on retry (sometimes 2 retries are needed - seems to be related to some timing of failed transaction rollback in SQLAlchemy, this will be a true delight to debug), handing execution to the event table access code, where this patch comes into action, allowing for instant recovery if the connection was lost. (Without the patch, the recovery would only happen on apache restart - hence eternal 500.)
Sadly I can confirm that some Mentat daemons (I have seen enricher) sometimes fail after DB restart. I was not able to reproduce this reliably however, it seems to happen very rarely for me.
Enricher has internal reload timer, that triggers fetching fresh enrichment data from database to internal in-memory cache. I have observed daemon failure always after each database restart, it just does not occur always immediately after the restart.
As for the apache restart, it does not seem necessary for me in any case (after applying the patch). Can you describe the case?
If you restart postgresql without restarting apache, for example the event search form https://mentat-alt.cesnet.cz/mentat/events/search?dt_from=&dt_to=&source_addrs=&source_ports=&submit=Search starts throwing 500s. But I now see, that the problem described in traceback is related to not yet patched sqlalchemy code, not the already fixed psycopg2 code. Sorry for the misdirection.
Mentat is now installed in its own virtual environment and I have verified that the patch code was successfully applied to /var/mentat/venv/lib/python3.5/site-packages/mentat/services/eventstorage.py file.
Thanks, this was useful info.
Updated by Radko Krkoš over 5 years ago
Jan Mach wrote:
Radko Krkoš wrote:
Sadly I can confirm that some Mentat daemons (I have seen enricher) sometimes fail after DB restart. I was not able to reproduce this reliably however, it seems to happen very rarely for me.
Enricher has internal reload timer, that triggers fetching fresh enrichment data from database to internal in-memory cache. I have observed daemon failure always after each database restart, it just does not occur always immediately after the restart.
Yes, that makes sense. Should we create a different issue for that as it is not strictly Hawat related or do you want this issue converted to an umbrella DB conn drop recovery one?
In any case, do you want me to look into that?
As for the apache restart, it does not seem necessary for me in any case (after applying the patch). Can you describe the case?
If you restart postgresql without restarting apache, for example the event search form https://mentat-alt.cesnet.cz/mentat/events/search?dt_from=&dt_to=&source_addrs=&source_ports=&submit=Search starts throwing 500s. But I now see, that the problem described in traceback is related to not yet patched sqlalchemy code, not the already fixed psycopg2 code. Sorry for the misdirection.
OK, but those disappear on their own after a few requests. Is apache restart really necessary?
Updated by Jan Mach over 5 years ago
Radko Krkoš wrote:
Jan Mach wrote:
Radko Krkoš wrote:
Sadly I can confirm that some Mentat daemons (I have seen enricher) sometimes fail after DB restart. I was not able to reproduce this reliably however, it seems to happen very rarely for me.
Enricher has internal reload timer, that triggers fetching fresh enrichment data from database to internal in-memory cache. I have observed daemon failure always after each database restart, it just does not occur always immediately after the restart.
Yes, that makes sense. Should we create a different issue for that as it is not strictly Hawat related or do you want this issue converted to an umbrella DB conn drop recovery one?
I think we should continue to resolve the matter in this issue to reduce the management overhead. I think, that the issue title is ok, when we resolve the matter for Hawat, we resolve it also for Mentat backend.
In any case, do you want me to look into that?
Yes please, look into it. I am occupied elsewhere and I would be grateful, if you could look into it further.
As for the apache restart, it does not seem necessary for me in any case (after applying the patch). Can you describe the case?
If you restart postgresql without restarting apache, for example the event search form https://mentat-alt.cesnet.cz/mentat/events/search?dt_from=&dt_to=&source_addrs=&source_ports=&submit=Search starts throwing 500s. But I now see, that the problem described in traceback is related to not yet patched sqlalchemy code, not the already fixed psycopg2 code. Sorry for the misdirection.
OK, but those disappear on their own after a few requests. Is apache restart really necessary?
Updated by Radko Krkoš over 5 years ago
- File 0002-Fix-Fixed-enricher-failure-on-DB-restart.patch 0002-Fix-Fixed-enricher-failure-on-DB-restart.patch added
- File 0003-Fix-Fixed-mentat-storage-failing-on-DB-restart.patch 0003-Fix-Fixed-mentat-storage-failing-on-DB-restart.patch added
The two attached patches are aimed at newfound issues regarding recovery after DB restart. See commit messages for details. Fixes are for mentat-enricher and for mentat-storage (partial fix).
The enricher one was tricky to figure out and it uses the method devised by Mek - overriding SQLAlchemy's Query class. It does not fix Hawat issues regarding SQLAlchemy as described before. Those will be dealt with separately.
Updated by Jan Mach over 5 years ago
- Target version changed from Backlog to 2.5
Updated by Jan Mach over 5 years ago
- % Done changed from 0 to 100
Took a while, but I have just deployed your work on our mentat-alt server.
Updated by Radko Krkoš over 5 years ago
I am glad it helped. With these patches, all known points of failure regarding DB connection are addressed, resulting in automatic recovery after connection loss. The only edge case is the user data access code using SQLAlchemy, which only recovers after the second try. We should solve that at some point.
In the meantime, I would like an intensive test to validate the patches and possibly find other points of failure. Could we run a series of regular DB restarts (using cron probably) to see how the Mentat as a whole would perform? This have to be timed to not interfere with your work so maybe a weekend is a good idea?
Also, I am unsure whether mentat-alt performs all the tasks that production mentat does, particularly the reporting. It would be beneficial to make it as similar to mentat-hub as possible, of course with a different set of addressees.
Unfortunately this would be more work for you Mek, I can help very little as my knowledge of the whole Mentat is lacking severely, I would not know where to look (well, I would probably monitor the logs for errors). So take this as a suggestion and decide at your discretion. Of course I will help in any way I can.
Updated by Pavel Kácha over 5 years ago
Radko Krkoš wrote:
Also, I am unsure whether mentat-alt performs all the tasks that production mentat does, particularly the reporting. It would be beneficial to make it as similar to mentat-hub as possible, of course with a different set of addressees.
AFAIK, mentat-alt Reporter does indeed generate reports, however mails are intentionally sent only to admin (for spot inspection). Reports are not exactly same as on mentat-hub, I presume it's because of slight differences in start/relapse windows.
Updated by Jan Mach over 5 years ago
Radko, you are free to experiment as you like, at least we will possibly identify some other points of failure after unexpected restart. So feel free to add some restarting jobs to cron, just do not leave them indefinitely and perhaps make them output some content, perhaps even with some echo comment, so that I am aware that the restart was intentional.
Updated by Jan Mach over 5 years ago
Note: Reporter is indeed running on mentat-alt, but all reports are redirected to local root for possible inspection and to prevent from spamming. Different report contents on -alt and -hub are due to different content/state of thresholding caches, configuration is the same.
Updated by Radko Krkoš over 5 years ago
- File 0004-Fix-Use-RetryingQuery-where-possible-also-in-Hawat.patch 0004-Fix-Use-RetryingQuery-where-possible-also-in-Hawat.patch added
The attached patch ensures that the RetryingQuery from 0002 is actually used in some other places. Especially in Hawat's DB shim, which is declared independently. After applying this patch, the original issue seems to go away, regardless the DB restarts, the Hawat interface (and also some other Mentat components fixed in the meantime) recovers automatically, without interruption to the user.
There is still the issue of tests by cron DB restarts as discussed before. I would like a quick VC sometimes next week to work out the details.
Updated by Jan Mach over 5 years ago
Latest patch is currently deployed to mentat-alt.
Updated by Jan Mach about 5 years ago
- Status changed from Feedback to Closed
I now consider this issue to be closed.