Grafana 8 - slow performance loading dashboards - postgresql

I am having performance related issues since updating Grafana from version 7.5.7 to version 8.2.3.I have several large dashboards, that have more than around 200 panels and about 10ish variables. These large dashboards are now taking a very long time to load, several minutes, making the user experience quite poor when they would take a few seconds to load on the previous version.
I tried to investigate the dashboard’s webpage performance when opening one of them, and noticed that for the same dashboard:
In version 7:
the activity Run Microtasks took 796ms
total blocking time = 2726ms
In version 8:
the activity Run Microtasks took 8562ms
total blocking time = 9620ms
I couldn’t find in Grafana’s documentation what Grafana is doing differently with version, that would impact performance in the way described here, and if the issue is something that I can avoid.
Any help will be highly appreciated!!

Related

Locust response time high latencies at the start

I'm doing some load testing on a microservice application. Collected the percentile statistics and plotted them. The application is running in a shared K8s cluster. The thing I am not quite understanding is why is there a latency spike in the start? Is this an issue with a cold boot?
Locust plot showing RT over time
Is this an issue with a cold boot?
Yes, this is the most likely explanation. There's no way of knowing without digging into your application and its logs though.
Most applications, especially ones that do automatic scaling, perform very poorly when suddenly hit with a large amount of load. If your actual expected user load does not have this behaviour, then maybe a slower ramp-up is more appropriate.
If you havent already read this, then maybe have a look at https://github.com/locustio/locust/wiki/FAQ#increase-my-request-raterps

How long do you fine tune false positives with mod_security and OWASP rules?

I just started using owasp rules and got tons of false positives. Example someone in the description field has written:
"we are going to select some users tomorrow for our job platform."
This is detected as sql injection attack (id 950007). Well it is not. It is valid comment. I have tons of this kind false positives.
First I have set up SecRuleEngine DetectionOnly to gather information.
Then I started using "SecRuleUpdateTargetById 950007 !ARGS:desc" or "SecRuleRemoveById 950007" and I already spend a day for this. modsec_audit.log is alreay > 100MB of size.
I am interested from your experience, how long do you fine tune it (roughly). After you turn it on, do you still get false positives and how do you manage to add white lists on time (do you analyze the logs daily) ?
I need this info to tell by boss the estimation for this task. It seems that will be long lasting.
Totally depends on your site, your technology and your test infrastructure. The OWASP CRS is very noisy by default and does require a LOT of tweaking. Incidentally there is some work going on this and next version might have a normal and a paranoid mode, to hopefully reduce false positives.
To give an example I look after a reasonably sized site with a mixture of static pages and a number of apps written in wide variety of technologies (legacy code - urgh!) and a fair amount of visitors.
Luckily I had a nightly regression run in our preproduction environment with good coverage, so that was my first port of call. I released ModSecurity there after some initial testing, in DetectionOnly mode and tweaked it over a month maybe until I'd addressed all of the issues and was comfortable moving to prod. This wasn't a full month of continuous work of course but 30-60 mins on most days to check the previous nights run, tweak the rules appropriately and set it up for next night's run (damn cookies with their random strings!).
Next up I did the same in production, and pretty much immediately ran into issues with free text feedback fields like you have (of course I didn't see most of these in regression runs). That took a lot of tweaking (had to turn off a lot of SQL Injection rules for those fields). I also got a lot of insight how many bots and scripts run against our site! Most were harmless or Wordpress exploit attempts (luckily I don't run Wordpress), so no real risk to my site, but still an eye opener. I monitored the logs hourly initially (paranoid!), then daily, and then weekly.
I would say from memory that it took another 3 months or so until I was fully comfortable turning it on fully and checked it a lot over the next few days. Luckily all hard work paid off and very few false positives.
Since then it's been fairly stable and very few false alerts - mostly dues to bad data (e.g. email##example.com entered as an email address for a field which didn't validate email addresses properly) and I often left those place and fixed the field validation instead.
Some of the common issues and rules I had to tweak are given here: Modsecurity: Excessive false positives (note you may not need or want to turn off all these rules in your site).
We have Splunk installed on our web servers (basically a tool which sucks up log files and can then be searched or automatically alert or report on issues). So set up a few alerts for when the more troublesome, free text fields fields caused a ModSecurity block (have corrected one or two more false positives there), and also on volume (so we get an alert when a threshold passed and could see we were under a sustained attack - happens few times a year) and weekly/monthly reporting.
So a good 4-5 months to implement from scratch end to end with maybe 30-40 man days work over that time. But it was a very complicated site and I had no prior ModSecurity/WAF experience. On plus side learned a lot about web technologies, ModSecurity and got regexpr-blindness from staring at some of the rules! :-)

How to reduce build times with Algolia

I set up Algolia a few weeks ago have been liking it a lot. But today I started to notice that that it would take a while for updates in my Rails application to be shown in the Algolia index.
Some investigation shows that for whatever reason, the build time shot up yesterday from around 20s to 750s. There doesn't seem to be anything else particularly strange about that time, except for the fact that there were a bunch of 'Get Settings' search operations for some reason. Any ideas on why this happened or how to fix it?
build time graph
Index operation graph
All FREE, Starter, Growth or Pro accounts run on clusters shared with other users. Only Enterprise accounts have their own cluster(s).
In your case, it sounds like the cluster you've been currently on, has been more solicited that the normal at the time you indexed your data. Resulting in the increase of the average build time. As on many other hosted services, it totally normal and expected. Nothing to worry about.

Google Cloud SQL very slow from time to time

It's been almost 3 months I have switched my platform to Google Cloud (Compute Engine + Cloud SQL + Cloud Storage).
I am very happy with it but from time to time I noticed big latency on the Cloud SQL server. My VMs from Compute Engine and my Cloud SQL instance are all on the same location (us-1) datacenter.
Since my Java backend makes a lot of SQL queries to generate a server response, the response times may vary from 250-300ms (normal) up to 2s!
In the console, I notice absolutely nothing: no CPU peaks, no read/write peaks, no backup running, nothing. No alert. Last time it happened, it lasted for a few days and then the response times went suddenly better than ever.
I am pretty sure Google works on the infrastructure behind the scenes... But no way to point that out.
So here's my questions:
Has anybody else ever had noticed the same kind of problem?
It is really annoying for me because my web pages get very slow and I have absolutely no control over it. Plus I loose a lot of time because I generally never first suspect a hardware problem / maintenance but instead something that we introduced in our app. Is it normal or do I have a problem on my SQL instance?
Is there anywhere I can have visibility over what's Google doing on the hardware? I know there are maintenance alerts, but for my zone it seems always empty when it happen.
The only option I have for now is to wait and that is really not acceptable.
I suspect that Google does some sort of IO throttling and their algorithm is not very sophisticated. We have a build server which slows down to a crawl if we do more than two builds within an hour. The build that normally takes 15 minutes will run for more than an hour and we usually terminate it and re-run manually later. This question describes a similar problem and the recommended solution is to use larger volumes as they come with more IO allowance.

How should I benchmark a system to determine the overall best architecture choice?

This is a bit of an open ended question, but I'm looking for an open ended answer. I'm looking for a resource that can help explain how to benchmark different systems, but more importantly how to analyze the data and make intelligent choices based on the results.
In my specific case, I have a 4 server setup that includes mongo that serves as the backend for an iOS game. All servers are running Ubuntu 11.10. I've read numerous articles that make suggestions like "if CPU utilization is high, make this change." As a new-comer to backend architecture, I have no concept of what "high CPU utilization" is.
I am using Mongo's monitoring service (MMS), and I am gathering some information about it, but I don't know how to make choices or identify bottlenecks. Other servers serve requests from the game client to mongo and back, but I'm not quite sure how I should be benchmarking or logging important information from them. I'm also using Amazon's EC2 to host all of my instances, which also provides some information.
So, some questions:
What statistics are important to log on a backend setup? (CPU, RAM, etc)
What is a good way to monitor those statistics?
How do I analyze the statistics? (RAM usage is high/read requests are low, etc)
What tips should I know before trying to create a stress-test or benchmarking script for my architecture?
Again, if there is a resource that answers many of these questions, I don't need an explanation here, I was just unable to find one on my own.
If more details regarding my setup are helpful, I can provide those as well.
Thanks!
I like to think of performance testing as a mini-project that is undertaken because there is a real-world need. Start with the problem to be solved: is the concern that users will have a poor gaming experience if the response time is too slow? Or is the concern that too much money will be spent on unnecessary server hardware?
In short, what is driving the need for the performance testing? This exercise is sometimes called "establishing the problem to be solved." It is about the goal to be achieved-- because if there is not goal, why go through all the work of testing the performance? Establishing the problem to be solved will eventually drive what to measure and how to measure it.
After the problem is established, a next set is to write down what questions have to be answered to know when the goal is met. For example, if the goal is to ensure the response times are low enough to provide a good gaming experience, some questions that come to mind are:
What is the maximum response time before the gaming experience becomes unacceptably bad?
What is the maximum response time that is indistinguishable from zero? That is, if 200 ms response time feels the same to a user as a 1 ms response time, then the lower bound for response time is 200 ms.
What client hardware must be considered? For example, if the game only runs on iOS 5 devices, then testing an original iPhone is not necessary because the original iPhone cannot run iOS 5.
These are just a few question I came up with as examples. A full, thoughtful list might look a lot different.
After writing down the questions, the next step is decide what metrics will provide answers to the questions. You have probably comes across a lot metrics already: response time, transaction per second, RAM usage, CPU utilization, and so on.
After choosing some appropriate metrics, write some test scenarios. These are the plain English descriptions of the tests. For example, a test scenario might involve simulating a certain number of games simultaneously with specific devices or specific versions of iOS for a particular combination of game settings on a particular level of the game.
Once the scenarios are written, consider writing the test scripts for whatever tool is simulating the server work loads. Then run the scripts to establish a baseline for the selected metrics.
After a baseline is established, change parameters and chart the results. For example, if one of the selected metrics is CPU utilization versus the number of of TCP packets entering the server second, make a graph to find out how utilization changes as packets/second goes from 0 to 10,000.
In general, observe what happens to performance as the independent variables of the experiment are adjusted. Use this hard data to answer the questions created earlier in the process.
I did a Google search on "software performance testing methodology" and found a couple of good links:
Check out this white paper Performance Testing Methodology by Johann du Plessis
Have a look at the Methodology section of this Wikipedia article.