Task #6096: Server and database performance and monitoring - Mentat - Homeproj: Redmine for CESNET

Actions

Copy link

Task #6096

closed

Server and database performance and monitoring

Added by Jan Mach about 5 years ago. Updated almost 5 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Jan Mach

Category:

Installation

Target version:

2.6

Start date:

11/27/2019

Due date:

% Done:

100%

Estimated time:

To be discussed:

Description

When working on issues #4384, #4274 and #4275 I was playing a lot with PostgreSQL database and with server monitoring configuration. I have discovered we were neglecting some parts of our database. We paid a lot of attention to mentat_events database, which takes a lot of load. However we have not properly configured the mentat_main database, which contains periodical statistics and reports and also deserves our attention.

Let`s try to improve this. I can see room for improvements in following areas:

Improve Nagios monitoring and system-status capabilities to better check system health.
Consider better server and database performance monitoring options to detect possible problems, bottlenecks, or just have better knowledge about system resource usage.
Consider following tools: Grafana, TimescaleDB.
Improve tools and documentation for system administrators.

Files

0001-Add-Alembic-revision-to-enable-CLUSTER-on-Hawat-tabl.patch (2.9 KB) 0001-Add-Alembic-revision-to-enable-CLUSTER-on-Hawat-tabl.patch

Radko Krkoš, 01/14/2020 10:47 AM

Actions

Copy link

Updated by Jan Mach about 5 years ago

Description updated (diff)
Status changed from New to In Progress
% Done changed from 0 to 20

I have already done following work:

When logged in on mentat-alt|hub check new version of system-status script. I have improved the section for PostgreSQL monitoring. These new health checks are done with the help of Nagios plugin check_postgres, so when we tune them to we can turn the checks on in our organization-wide Nagios instance. So now both databases mentat_events and mentat_main are checked for number of connections, hit-ratio, worst query time, bloat and last (auto)vacuum and (auto)analyze operations. I have discovered following problems/issues to discuss:
- [A] What are the optimal thresholds for each check. I have used something made up to start with, but we should tune them if necessary.
- [B] Current results suggest, that the mentat_main database is not optimally configured to run periodical vacuums and analyzes.
- [C] Was the PostgreSQL configuration tweaked to accommodate recent schema changes, namely the recent split and addition of events_json table? Doesn`t the database documentation page need some improvements to reflect these changes?
I have installed new server monitoring stack to play with. It is accessible at https://griffin.cesnet.cz/ (access protected with eduID) and built on top of Telegraf - TimescaleDB - Grafana. I played for a while with setting up some basic server monitoring dashboards, but there is still a lot of work for them to be really usable. The idea is to have better server monitoring solution/platform than current Nagios system to better spot system performance bottlenecks, to better judge the effect of various system/database changes and/or upgrades to system performance, etc. I think that with the monitoring system fully under our control the possibilities are endless. I also base the usefulness of such system on the lectures from P2D2 conference (https://p2d2.cz/files/zeleny-stehule-monitoring.pdf). I see another pro in an opportunity to check out the TimescaleDB extension, that could potentially also be used in Mentat system database for storing IDEA events, reports or periodical statistics.
I have improved following documentation pages:
- administration - upgraded information about monitoring and maintenance, implemented convenient database maintenance script sqldb-maintenance.sh
- upgrading - updated update procedures

Actions

Copy link

Updated by Pavel Kácha about 5 years ago

Assignee changed from Jan Mach to Radko Krkoš

Could you please look into check which make sense, or are duplicit, have sane limits, or show real problems (autovacuum etc.)
Is mentat_main sanely configured?
Is mentat_events (specifically events_json) sanely configured after db split?

Actions

Copy link

Updated by Radko Krkoš almost 5 years ago

File 0001-Add-Alembic-revision-to-enable-CLUSTER-on-Hawat-tabl.patch 0001-Add-Alembic-revision-to-enable-CLUSTER-on-Hawat-tabl.patch added

The database mentat_main was checked, for the sqldb-maintenance.sh script to work, there must be indices set for tables to be CLUSTERed according to. Right now the script would do nothing on mentat_main. This was fixed on mentat-alt. Also, an alembic revision is prepared to implement this. It does not perform the CLUSTERing itself, just sets the appropriate indices. See the attachment for patch. As the tables in mentat_main are much smaller (than in mentat_events) and INSERT only, there is not much more to do there. Just make sure to perform an occasional CLUSTER during maintenance.

In mentat_events, the configuration for autovacuum was set for events_json table. This was documented, the patches were sent by e-mail.

Actions

Copy link

Updated by Radko Krkoš almost 5 years ago

Pavel Kácha wrote:

Could you please look into check which make sense, or are duplicit, have sane limits, or show real problems (autovacuum etc.)

The checks have been checked. That resulted in some adjustments for the thresholds. The checks are deployed by an ansible role, that is maintained on GitHub, so a pull request [1] has been sent. The checks for autovacuum are useless for mentat_main as changes there are manual. For the new threshold values, see the pull request.

The check check_postgres_query_time is useless now as it has not been updated for current PostgreSQL versions. The queries used to gather information would not work since v9.1, as upstream does not seem to be active, we could fix up a quick and dirty solution as we probably don't care about multiversion support.

[1] https://github.com/honzamach/ansible-role-postgresql/pull/1

Actions

Copy link

Updated by Radko Krkoš almost 5 years ago

Radko Krkoš wrote:

The check check_postgres_query_time is useless now as it has not been updated for current PostgreSQL versions. The queries used to gather information would not work since v9.1, as upstream does not seem to be active, we could fix up a quick and dirty solution as we probably don't care about multiversion support.

Ignore this, it is a mistake on my part (the query generation is multistage and a correct one for current version is generated in the end).
On the other hand, the check_postgres_query_time only reports the query execution time for queries running at the time, not history, so you have to hit the long running query for it to be reported.

Actions

Copy link

Updated by Pavel Kácha almost 5 years ago

To be discussed changed from Yes to No

Actions

Copy link

Updated by Jan Mach almost 5 years ago

Assignee changed from Radko Krkoš to Jan Mach
Target version changed from Backlog to 2.6
% Done changed from 20 to 80

As Radko noted, the check_postgres_query_time is pretty much useless. Instead we will implement simple Nagios plugin that will parse the postgresql log file and look for the long running queries there.

Actions

Copy link

Updated by Jan Mach almost 5 years ago

Status changed from In Progress to Closed
% Done changed from 80 to 100

I have created simple shell script/NRPE plugin for PostgreSQL log file monitoring. It is currently already deployed on all our installations, it was incorporated into our Ansible PostgreSQL role.

A have incorporated patch attached by Radko Krkoš containing ALTER TABLE commands for optimizations of tables in mentat_main database, but I have moved the commands from migration script to documentation. It is not possible to define ALTER TABLE commands in SQLAlchemy ORM models and it would be necessary to run migrations for fresh installations, which is not supported by our workflow, there would be problems with stamping the database schema version.

I think this issue can be closed now.

Actions

Copy link