Task #6096
closedServer and database performance and monitoring
100%
Description
When working on issues #4384, #4274 and #4275 I was playing a lot with PostgreSQL database and with server monitoring configuration. I have discovered we were neglecting some parts of our database. We paid a lot of attention to mentat_events
database, which takes a lot of load. However we have not properly configured the mentat_main
database, which contains periodical statistics and reports and also deserves our attention.
Let`s try to improve this. I can see room for improvements in following areas:
- Improve Nagios monitoring and system-status capabilities to better check system health.
- Consider better server and database performance monitoring options to detect possible problems, bottlenecks, or just have better knowledge about system resource usage.
- Consider following tools: Grafana, TimescaleDB.
- Improve tools and documentation for system administrators.
Files
Updated by Jan Mach almost 5 years ago
- Description updated (diff)
- Status changed from New to In Progress
- % Done changed from 0 to 20
I have already done following work:
- When logged in on mentat-alt|hub check new version of
system-status
script. I have improved the section for PostgreSQL monitoring. These new health checks are done with the help of Nagios plugin check_postgres, so when we tune them to we can turn the checks on in our organization-wide Nagios instance. So now both databasesmentat_events
andmentat_main
are checked for number of connections, hit-ratio, worst query time, bloat and last (auto)vacuum and (auto)analyze operations. I have discovered following problems/issues to discuss:- [A] What are the optimal thresholds for each check. I have used something made up to start with, but we should tune them if necessary.
- [B] Current results suggest, that the
mentat_main
database is not optimally configured to run periodical vacuums and analyzes. - [C] Was the PostgreSQL configuration tweaked to accommodate recent schema changes, namely the recent split and addition of
events_json
table? Doesn`t the database documentation page need some improvements to reflect these changes?
- I have installed new server monitoring stack to play with. It is accessible at https://griffin.cesnet.cz/ (access protected with eduID) and built on top of Telegraf - TimescaleDB - Grafana. I played for a while with setting up some basic server monitoring dashboards, but there is still a lot of work for them to be really usable. The idea is to have better server monitoring solution/platform than current Nagios system to better spot system performance bottlenecks, to better judge the effect of various system/database changes and/or upgrades to system performance, etc. I think that with the monitoring system fully under our control the possibilities are endless. I also base the usefulness of such system on the lectures from P2D2 conference (https://p2d2.cz/files/zeleny-stehule-monitoring.pdf). I see another pro in an opportunity to check out the TimescaleDB extension, that could potentially also be used in Mentat system database for storing IDEA events, reports or periodical statistics.
- I have improved following documentation pages:
- administration - upgraded information about monitoring and maintenance, implemented convenient database maintenance script
sqldb-maintenance.sh
- upgrading - updated update procedures
- administration - upgraded information about monitoring and maintenance, implemented convenient database maintenance script
Updated by Pavel Kácha almost 5 years ago
- Assignee changed from Jan Mach to Radko Krkoš
Could you please look into check which make sense, or are duplicit, have sane limits, or show real problems (autovacuum etc.)
Is mentat_main sanely configured?
Is mentat_events (specifically events_json) sanely configured after db split?
Updated by Radko Krkoš almost 5 years ago
- File 0001-Add-Alembic-revision-to-enable-CLUSTER-on-Hawat-tabl.patch 0001-Add-Alembic-revision-to-enable-CLUSTER-on-Hawat-tabl.patch added
The database mentat_main
was checked, for the sqldb-maintenance.sh
script to work, there must be indices set for tables to be CLUSTERed according to. Right now the script would do nothing on mentat_main
. This was fixed on mentat-alt
. Also, an alembic revision is prepared to implement this. It does not perform the CLUSTERing itself, just sets the appropriate indices. See the attachment for patch. As the tables in mentat_main
are much smaller (than in mentat_events
) and INSERT
only, there is not much more to do there. Just make sure to perform an occasional CLUSTER
during maintenance.
In mentat_events
, the configuration for autovacuum
was set for events_json
table. This was documented, the patches were sent by e-mail.
Updated by Radko Krkoš almost 5 years ago
Pavel Kácha wrote:
Could you please look into check which make sense, or are duplicit, have sane limits, or show real problems (autovacuum etc.)
The checks have been checked. That resulted in some adjustments for the thresholds. The checks are deployed by an ansible role, that is maintained on GitHub, so a pull request [1] has been sent. The checks for autovacuum are useless for mentat_main as changes there are manual. For the new threshold values, see the pull request.
The check check_postgres_query_time
is useless now as it has not been updated for current PostgreSQL versions. The queries used to gather information would not work since v9.1, as upstream does not seem to be active, we could fix up a quick and dirty solution as we probably don't care about multiversion support.
[1] https://github.com/honzamach/ansible-role-postgresql/pull/1
Updated by Radko Krkoš almost 5 years ago
Radko Krkoš wrote:
The checkcheck_postgres_query_time
is useless now as it has not been updated for current PostgreSQL versions. The queries used to gather information would not work since v9.1, as upstream does not seem to be active, we could fix up a quick and dirty solution as we probably don't care about multiversion support.
Ignore this, it is a mistake on my part (the query generation is multistage and a correct one for current version is generated in the end).
On the other hand, the check_postgres_query_time
only reports the query execution time for queries running at the time, not history, so you have to hit the long running query for it to be reported.
Updated by Pavel Kácha almost 5 years ago
- To be discussed changed from Yes to No
Updated by Jan Mach almost 5 years ago
- Assignee changed from Radko Krkoš to Jan Mach
- Target version changed from Backlog to 2.6
- % Done changed from 20 to 80
As Radko noted, the check_postgres_query_time is pretty much useless. Instead we will implement simple Nagios plugin that will parse the postgresql log file and look for the long running queries there.
Updated by Jan Mach almost 5 years ago
- Status changed from In Progress to Closed
- % Done changed from 80 to 100
I have created simple shell script/NRPE plugin for PostgreSQL log file monitoring. It is currently already deployed on all our installations, it was incorporated into our Ansible PostgreSQL role.
A have incorporated patch attached by Radko Krkoš containing ALTER TABLE commands for optimizations of tables in mentat_main database, but I have moved the commands from migration script to documentation. It is not possible to define ALTER TABLE commands in SQLAlchemy ORM models and it would be necessary to run migrations for fresh installations, which is not supported by our workflow, there would be problems with stamping the database schema version.
I think this issue can be closed now.
Updated by Radko Krkoš almost 5 years ago
- To be discussed changed from No to Yes