A couple buddies and I have been having a serious issue with MongoDB. Before I get in to this lets just get some stats out of the way.
Running on a Dedicated VPS with Softlayer
-Ubuntu 13.04 inside an ESX VM
-4 Virtual cores running at 2.0 GHz each
-48GB of RAM
-All on an SSD
-MongoDB Version 2.6.5 (and issue happening on the last 6 releases)
Were doing polygon geo queries using a 2dsphere index, as well a few other variables with ints and booleans added. There's about 7 keys total we'll query at a time, and then sort them afterwards depending what the user requests to sort by (Price, closest point to location, ect).
Now here's where things gets stupidly complicated. At random time in the days, specific locations will stop returning queries in a reasonable time, instead of the 300 to 2000 milliseconds average it returns in like 30 to 150 seconds, and this is ONLY for specific locations. Searching for New York City will return, but searching for London England will take ages.
The second we change what key were sorting by (we've even tried sorting by _id) everything goes back to normal. Then later on, that sort key will break for random locations, and by flipping it it repairs. Another method we found to fix is completely reindexing the key were sorting by (deleting and recreating), but then our search goes down for the time it takes to recreate the index.
This method works, sure, but changing the sorting method defeats the purpose, and reindexing is just more and more downtime for something that shouldn't be happening.
This issue could be a bunch of things, so let me just go through quickly the ones we've diagnosed and found not to be the issue.
-Does not relate to amount of users accessing database (Could be 0, could be 1000).
-Lock is averaged at 2.5% or below.
-Btree is normal between 0 and 200.
-Happens even when not writing to database (and haven't done so in hours)
-CPU usage is minimal
-RAM usage is only 60% average.
-Data in listings are valid.
-All keys within the indexes are the proper types. No string in int indexes, no datetimes in float indexs, ect.
-Has nothing to do with how many results are returned. Issue happens in locations where there is 10000 results, or just 5.
-No errors in logs, no errors in MMS, no errors anywhere... Just slow.
I'm out of ideas here, I simply can't figure out what it may be. Has anyone ever come across something like this? Or have any ideas on what it may be? Anything is appreciated. If I've forgotten to mention something or need to redo how I've explained it please just ask, no issues rewriting. Thanks again.
Related
I'm looking at increasing the number of days historical events that are stored in the Tableau Server database from the default 183 to +365 days and I'm trying to understand what the performance impact to Tableau Server itself would be since the database and backup sizes also start increasing. Would it cause the overall Tableau Server running 2019.1.1 over time to slow to a crawl or begin to have a noticeable impact with respect to performance?
I think the answer here depends on some unknowns and makes it pretty subjective:
How much empty space is on your PostGres node.
How many events typically occur on your server in a 6-12 month period.
Maybe more importantly than a yes or a no (which should be taken with a grain of salt) would be things to consider prior to making the change.
Have you found value in the 183 default days? Is it worth the risk adding 365? It sounds like you might be doing some high level auditing, and a longer period is becoming a requirement. If that's the case, the answer is that you have no choice but to go ahead with the change. See steps below.
Make sure that the change is made in a Non-prod environment first. Ideally one with high traffic. Even though you will not get an exact replica - it would certainly be worth the practice in switching it over. Also you want to make sure that Non-prod and prod environments match exactly.
Make sure the change is very well documented. For example, if you were to change departments without anyone having knowledge of a non-standard config setting, it might make for a difficult situation if ever Support is needed or if there is a question as to what might be causing slow behavior.
Things to consider after a change:
Monitor the size of backups.
Monitor the size of the historical table(s) (See Data-Dictionary for table names if not already known.)
Be ready to roll back the config change if the above starts to inflate.
Overall:
I have not personally seen much troubleshooting value from these tables being over a certain number of days (ie: if there is trouble on the server it is usually investigated immediately and not referenced to 365+ days ago.) Perhaps the value for you lies in determining the amount of usage/expansion on Tableau Server.
I have not seen this table get so large that it brings down a server or slows it down. Especially if the server is sized appropriately.
If you're regularly/heavily working with and examining PostGres data, it may be wise to extract at a low traffic time of the day. This will prevent excess usage from outside sources during peak times. Remember that adhoc querying of PostGres is Technically unsupported. This leads to awkward situations if things go awry.
My postgres was running really slow lately, an aggregation for a month it usually ended up taking more than 1 minute (to be more exact the last one took 7 mins and 23 secs).
Last friday i recreated the servers (master and replica) and reimported the database.
First thing I noticed is that from 133gb now the database is 42gb (the actual data is around 12gb, i guess the rest are the indexes).
Everything was fast as hell for a day, after that the indexing finished (26gb on indexes) and now I'm back to square 1.
A count on ~5 million rows takes 3 mins 42 secs.
Made the autovacuum more aggressive and it looks like it's doing it's job now but the DB is still slow.
I am using the db for an API so it's constantly growing. Atm i have 2 tables one that has around 5 mil rows and the other 28 mil.
So if the master has a lot of activity and let's say that i'm expecting some performance loss, i don't expect it from the replica.
What's curios is that after a restart it's really fast for an hour or so.
Also another thing that i noticed was that on every query I do the IO is 100% while the memory and cpu are almost not used at all.
Any help would be greatly appreciated.
Update
Same database on a smaller machine works like a charm.
Same queries, same indexes.
The only difference is the traffic, not writing or updating that much.
Also i forgot to mention one thing, one of my indexes is clustered.
The live machine is a 5 core with 64gb and 3k IO.
The test machine is a 2 core with 4gb and an SSD.
Update
Found my issue.
Apparently the autovacuum can't get a lock and by the time it gets it the dead tuples increased.
Made the autovacuum more aggresive for now and deleted a bunch of unused indexes.
Still don't know how to fix the lock issue tho.
Update
Looks like something is increasing the estimated row count.
Since my last update here the row count increased by 2 mil.
I guess that by tomorrow the row count will be again around 12 mil and the count will be slow as hell again.
Could this be related to autovacuum?
Update
Well found my issue.
Looks like postgres is losing a lot of speed on a write intensive database.
Had a column that was used as a flag and updated a lot of times per day.
Everything looks really good after the flag and update was removed.
Any clue on how to fix this issue on a write intensive table?
May be the following pointers help:
Are you really sure you want to do a 5mil row Aggregation for an API? Everytime ? Can't you split the data into chunks such that only a small number of chunks actually get most of the new rows (and so the aggregation of all the previous chunks can be reused for the next Query)? Time is one such measure, serial numbers could be another, etc. If so, partitioning the data is an obvious solution you should investigate, it really has a good chance of giving you sub-second query times (assuming you store aggregations for previous chunks smartly).
A hunch about that first hour magic is that although this data fits RAM, concurrent querying pushes that data-set out and then its purely disk I/O... and in that case, CPU / RAM being idle isn't a surprise.
Finally, I think this setup is asking for a re-design where there is only so much you could do with a single SQL, and in that expecting sub-second Query times for data that is not within RAM for a 5mil data-set is probably being too optimistic!
(Nonetheless, do post your findings, if possible)
I'm currently setting up a mongodb and so far I have a DB with 25million objects, and some index. When I try use find() with 4 arguments and I know that one of the arguments are "wrong" or that the result of the query will be nothing found, then what happens is that mongodb takes all my RAM (8gb) and the time to get the "nothing found" varies a lot.
BUT the biggest problem is that once it's done it never lets go of the RAM and I need to restart mongodb to get back the RAM.
I've tried many other tests, such as adding huge amounts of objects (1million+), use the find() for something that I know exist and other operations and they work fine.
Also want to know that this works fine if I have a smaller DB like 1million objects.
I will keep on doing test on this, but if anyone have anything they know right away that could help that would be great.
EDIT: Tested some more with 1m DB. The first failed found took about 4sec and took some RAM, the proceding fails took only 0.6sec and didn't seem to affect the RAM. Does mongodb store the DB in RAM when trying to find a document and with very huge DB it can't get all in the cache?
Btw. on the 25mil it first took about 60sec to not find it, but the 2nd try took 174sec and after that find it had taken all the 8gb on RAM.
I'm using MongoDB with approximately 4 million documents and around 5-6GB database size. The machine has 10GB of RAM, and free only reports around 3.7GB in use. The database is used for a video game related ladder (rankings) website, separated by region.
It's a fairly write heavy operation, but still gets a significant number of reads as well. We use an updater which queries an outside source every hour or two. This updater then processes the records and updates documents on the database. The updater only processes one region at a time (see previous paragraph), so approximately 33% of the database is updated.
When the updater runs, and for the duration that it runs, the average flush time spikes up to around 35-40 seconds, and we experience general slowdowns with other queries. The updater is RAN on a SEPARATE MACHINE and only queries MongoDB at the end, when all the data has been retrieved and processed from the third party.
Some people have suggested slowing down the number of updates, or only updating players who have changed, but the problem comes down to rankings. Since we support ties between players, we need to pre-calculate the ranks - so if only a few users have actually changed ranks, we still need to update the rest of the users ranks accordingly. At least, that was the case with MySQL - I'm not sure if there is a good solution with MongoDB for ranking ~800K->1.2 million documents while supporting ties.
My question is: how can we improve the flush and slowdown we're experiencing? Why is it spiking so high? Would disabling journaling (to take some load off the i/o) help, as data loss isn't something I'm worried about as the database is updated frequently regardless?
Server status: http://pastebin.com/w1ETfPWs
You are using the wrong tool for the job. MongoDB isn't designed for ranking large ladders in real time, at least not quickly.
Use something like Redis, Redis have something called a "Sorted List" designed just for this job, with it you can have 100 millions entries and still fetch the 5000000th to 5001000th at sub millisecond speed.
From the official site (Redis - Sorted sets):
Sorted sets
With sorted sets you can add, remove, or update elements in a very fast way (in a time proportional to the logarithm of the number of elements). Since elements are taken in order and not ordered afterwards, you can also get ranges by score or by rank (position) in a very fast way. Accessing the middle of a sorted set is also very fast, so you can use Sorted Sets as a smart list of non repeating elements where you can quickly access everything you need: elements in order, fast existence test, fast access to elements in the middle!
In short with sorted sets you can do a lot of tasks with great performance that are really hard to model in other kind of databases.
With Sorted Sets you can:
Take a leader board in a massive online game, where every time a new score is submitted you update it using ZADD. You can easily take the top users using ZRANGE, you can also, given an user name, return its rank in the listing using ZRANK. Using ZRANK and ZRANGE together you can show users with a score similar to a given user. All very quickly.
Sorted Sets are often used in order to index data that is stored inside Redis. For instance if you have many hashes representing users, you can use a sorted set with elements having the age of the user as the score and the ID of the user as the value. So using ZRANGEBYSCORE it will be trivial and fast to retrieve all the users with a given interval of ages.
Sorted Sets are probably the most advanced Redis data types, so take some time to check the full list of Sorted Set commands to discover what you can do with Redis!
Without seeing any disk statistics, I am of the opinion that you are saturating your disks.
This can be checked with iostat -xmt 2, and checking the %util column.
Please don't disable journalling - you will only cause more issues later down the line when your machine crashes.
Separating collections will have no effect. Separating databases may, but if you're IO bound, this will do nothing to help you.
Options
If I am correct, and your disks are saturated, adding more disks in a RAID 10 configuration will vastly help performance and durability - more so if you separate the journal off to an SSD.
Assuming that this machine is a single server, you can setup a replicaset and send your read queries there. This should help you a fair bit, but not as much as the disks.
I have the following setup:
Mac Pro with 2 GB of RAM (yes, not that much)
MongoDB 1.1.3 64-bit
8 million entries in a single collection
index for one field (integer) wanted
Calling .ensureIndex(...) takes more than an hour, actually I killed the process after that. My impression is, that it takes far too long. Also, I terminated the process but the index can be seen with .getIndexes() afterwards.
Anybody knows what is going wrong here?
Adding an index on an existing data set is expected to take a while, as the entire BTree needs to be constructed. If you think this is taking an unreasonable amount of time, or you've seen a regression in performance the best bet is to ask about it on the list.
I would just like to point out the command:
db.currentOp()
which prints the current operations running on the server, and also shows the indexing process.
The foreground indexing is done in 3 steps, and the background one in 2 steps (if I remember correctly), but the background one is alot slower. The foreground one on the other hand locks the collection while indexing it (ie not very useful on a running application server).
As said before, google BTree if you are interested in how they work.
Anybody knows what is going wrong here?
Are you running via ssh or connecting remotely in some way? Sound a bit like a broken pipe issue. Did you create the index with {background : true} or not?