Timeline graphing is causing mayhem on production
The current implementation of timeline graphing with a broad
SELECT from the database and post-processing in Python inside Apache is causing serious problems leading to OOM-killing the Apache process and (in effect) flushing the disc cache, what impacts the performance and user experience of the whole system.
The required processing in Python is currently extensive and does not scale to non-trivial time intervals. There are numerous cases visible in the kernel log of Apache process allocating all available memory (250GB) only to be OOM killed after 30+ minutes of work. The length of time required to recover from this is extreme, as effectively the whole of disk cache is vacated and we rely on it heavily for performance.
We need to decrease the amount of work done in Python, there are several ways to reach that target, for example:
1) Identify non-useful outputs and stop calculating them.
2) Split the one large calculation of everything into parts as very rarely the user is truly interested in all possible known outputs.
3) Move the calculation into the DB, which will save a lot of duplicated iteration over the data. The DB is designed to answer analytical queries and the most efficient way to use it is to query for exactly the results required, not source data to be processed afterwards.
1 can be done at any time, 2 and 3 are best done together, after 1 is finished.