Today our application got launched, meaning it started receiving more traffic than usual. But the increase isn't huge. Double of what it was before at most.
But since a few hours, our Sentry logs are full or errors with code DEADLINE_EXCEEDED. When I look at the trace, all of them refer to read operations, most of them on single documents (no queries, just singe doc reads), for example: const res = await fs.collection('coll').doc('doc').get();
When I google for this error message, I get a lot of results about issues with writing, especially in batches, but barely anything is written to our database, it's almost exclusively reads.
To give an indication of the amount of reads our database has to handle, we've had 1.2M reads in the past 30 days, with a peak of 60k per day, a number which we haven't exceeded yet today (41k).
What could be the issue in our application?
As usual, I find the answer right after posting the question to StackOverflow. What we saw here was a symptom of our VM running out of memory! After scaling up the server, the problem disappeared.
Related
For firestore I find under the 'soft limits' section in the docs following info
Maximum sustained write rate to a document - 1 per second
Sustaining a write rate above once per second increases latency and causes contention errors. This is not a hard limit, and you can surpass the limit in short bursts.
I have a rather big file (~800KB) in firestore at the moment which I'm quite frequently writing what gives me a warning (not as often as one time per second, but I think that might be due to the size...) and I'm wondering if it was better to switch to storage. I can't find any infos for storage though. Is it more 'robust' and no such restrictions to care about?
I have a simple HTTP server that I was testing. This server interacts with other HTTP servers and Cassandra DB.
Currently I was using 100 users with 1 request/s, so totally 100 tps was on the server. What I noticed with the Docker stats was that the CPU usage became higher and higher and ~ 2-3 hours later the CPU usage reaches the 90% mark, and even more. After that I got a notice from Locust, stating that the measurement may be inconsistent. But the latencies were not increased, so I do not know why this has been happening.
Can you please suggest possible cause(s) of the problem? I think 100 tps should be handled by one vCPU.
Thanks,
AM
There's no way for us to know exactly what's wrong without at very least seeing some code, and even then other factors like the environment or data or server you're running it on or against could have additional factors we wouldn't know about.
It's possible you have a problem with your code for your Locust users, such as a memory leak or they're just doing too much for a single worker to handle that many users. For users only doing simple HTTP calls, a single CPU typically can handle upwards of thousands of requests per second. Do anything more than that and you'll start to expect to reduce what a worker can handle. It's also possible you may just need a more powerful CPU (or more RAM or bandwidth) to do what you want it to do at the scale you want.
Do some profiling to see if you can find any inefficiencies in your code. Run smaller tests to see if the same behavior is evident with smaller loads. Run the same load but with additional Locust workers on other CPUs.
It's also just as possible your DB can't handle the load. The increasing CPU usage could be due to how your code is handling waiting on the connection from the DB. Perhaps the DB could sustain, say, 80 users at an acceptable rate but any additional users makes it fall further and further behind and your Locust users are then waiting longer and longer for the requested data.
For more suggestions, check out the Locust FAQ https://github.com/locustio/locust/wiki/FAQ#increase-my-request-raterps
I have been noticing a bad behavior with my Postrgre database, but I still can't find any solution or improvement to apply.
The context is simple, let's say I have two tables: CUSTOMERS and ITEMS. During certain days, the number of concurrent customers increase and so the request of items, they can consult, add or remove the amount from them. However in the APM I can see how any new request goes slower than the previous, pointing at the query response from those tables as the highest time consumer.
If the normal execution of the query is about 200 milliseconds, few moments later it can be about 20 seconds.
I understand the lock process in PostgreSQL as many users can be checking over the same item, they could be even affecting the values of it, but the response from the database it's too slow.
So I would like to know if there are ways to improve the performance in the database.
The first time I used PGTune to get the initial settings and it worked well. I have version 11 with 20Gb for RAM, 4 vCPU and SAN storage, the simultaneous customers (no sessions) can reach over 500.
Question title says it all. We have TRIGGERS set up on a database table that are being consumed by an offsite worker. However, there are times that it appears the worker falls behind in updating records. Is there a SQL query or call to determine the existing length of the message queue? (without popping any of the items). I see numerous mentions of this queue in the postgres documentation and other stackoverflow questions, but can't find anything regarding actually determining the length. Any help is appreciated!
For those of you familiar with redis, I'm looking for the equivalent of an LLEN command for this postgres message queue.
I don't think there's anything which directly reports the number of queue entries.
The closest thing is probably pg_notification_queue_usage(), which tells you what fraction of the queue storage is currently used (out of 8GB in total in a standard installation, according to the NOTIFY docs).
The memory usage is going to depend a lot on the payload, of course, but if you can figure out your average notification size, you should be able to translate this to an approximate queue length.
I am using Scala, Reactive Mongo 0.10.5 and Mongo 2.6.4 running on Ubuntu. I have tested on a few machine configurations but right now I am working with 15gb of memory, 2 cores and 60gb of SSD storage (AWS)
I have just set up a test mongo instance and have been using it to benchmark a few things, however I am seeing some inconsistency that I can't explain.
I am writing a consistent amount of data using 10 separate threads to a single collection. Each write consists of a document containing an array which contains 1000 elements. Each element is a complex document consisting of several fields and nested fields. I have tested with arrays of 1000, 10000 and 100 and have seen the same behavior with all. Each write is unique (i.e. I never write to the same document twice)
The write speed tends to be around 100-200ms per write with the current hardware I am using. I would like better but that isn't my main issue.
My main issue is that sometimes the write times will spike. When they do, it can take a single write several seconds to complete. They do eventually complete but it takes a while. I have timeouts built into the app doing the writing (10 seconds) and when the spikes happen it will frequently hit that timeout. I have increased the timeout and verified that the write does eventually complete but it can take a long time (30+ seconds).
I have worked with Mongo before using the Mongo Java Driver in Scala and have not noticed this problem. However it is unclear whether the issue is a result of the driver, or my Mongo setup.
I have looked at the logs and while they report when the query is taking longer, they don't actually provide any information about why it is taking longer. I have done the same with profiling and again they report a long query but don't say why it is long.
I have run mongostat while running and it seems that when the writes start taking a long time I notice a similar slow down in mongostat. I.E. mongostat will pause for several seconds before continuing.
The mongo machine itself is bored while this is happening. Load averages are minimal as are CPU and memory usage. It does not appear to be going into swap.
I suspect I just have something configured incorrectly in the Mongo but I haven't been able to find anything that indicates what.
Has anyone seen this behavior before? Is it something in my configuration or perhaps something with the Reactive Mongo driver?
UPDATE:
Using iostat I was able to determine that the normal writes/second is hitting around 1Mb/second. However during the slow periods it spikes to 6-7Mb/second.
I also found the following in the mongo logs.
[DataFileSync] flushing mmaps took 15621ms for 35 files
[DataFileSync] flushing mmaps took 14816ms for 22 files
In at least one case this log statement corresponds exactly with one of the slow downs.
This definitely seems to be a disk flush problem based on these observations.
Does this imply that I am pushing more data than the current Mongo configuration can handle? Or is there some other configuration that can be done to reduce the impact of those flushes?
It appears that in this case the problem may actually have been related to thread locking within the application itself. Once I resolved the issues with thread locking these other issues seemed to go away.
To be honest I don't know why thread locking would result in the observed behavior in Mongo, but if the problem is gone I am not going to complain.