Bug #6252: Timeline graphing is causing mayhem on production - Mentat - Homeproj: Redmine for CESNET

Actions

Copy link

Bug #6252

closed

Timeline graphing is causing mayhem on production

Added by Radko Krkoš almost 5 years ago. Updated over 4 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Jan Mach

Category:

Research and analysis

Target version:

2.7

Start date:

03/05/2020

Due date:

% Done:

Estimated time:

To be discussed:

Description

The current implementation of timeline graphing with a broad SELECT from the database and post-processing in Python inside Apache is causing serious problems leading to OOM-killing the Apache process and (in effect) flushing the disc cache, what impacts the performance and user experience of the whole system.
The required processing in Python is currently extensive and does not scale to non-trivial time intervals. There are numerous cases visible in the kernel log of Apache process allocating all available memory (250GB) only to be OOM killed after 30+ minutes of work. The length of time required to recover from this is extreme, as effectively the whole of disk cache is vacated and we rely on it heavily for performance.

We need to decrease the amount of work done in Python, there are several ways to reach that target, for example:
1) Identify non-useful outputs and stop calculating them.
2) Split the one large calculation of everything into parts as very rarely the user is truly interested in all possible known outputs.
3) Move the calculation into the DB, which will save a lot of duplicated iteration over the data. The DB is designed to answer analytical queries and the most efficient way to use it is to query for exactly the results required, not source data to be processed afterwards.

1 can be done at any time, 2 and 3 are best done together, after 1 is finished.

Related issues

Actions

Copy link

Updated by Pavel Kácha almost 5 years ago

On the meeting Mek mentioned that there are not yet patches causing default search limits on Mentat-hub. If that is true, we should check again after they're there - whether we need some more immediate solution.

(Bud all Radko's points of course still hold.)

Actions

Copy link