MongoDB update is slower with relevant indexes set

MongoDB update is slower with relevant indexes set - mongodb

I am testing a small example for a sharded set up and I notice that updating an embedded field is slower when the search fields are indexed.
I know that indexes are updated during inserts but are the search indexes of the query also updated?
The query for the update and the fields that are updated are not related to any manner.
e.g. (tested with toy data) :
{
id:... (sharded on the id)
embedded :[{ 'a':..,'b':...,'c':.... (indexed on a,b,c),
data:.... (data is what gets updated)
},
...
]
}
In the example above the query for the update is on a,b,c
and the values for the update affect only the data.
The only reasons I can think is that indexes are updated even if the updates are not on the indexed fields. The search part of the update seems to use the indexes when issuing a "find" query with with explain.
Could there be another reason?

I think wdberkeley -on the comments- gives the best explanation.
The document moves because it grows larger and the indexes are updated every time.
As he also notes, updating multiple keys is "bad"....I thinks I will avoid this design for now.

Related

Index bloat on GIN index for INSERT only table

I have a table with a column that is an array hstore. In JSON notation, the data for this field looks like:
[
{ type: "Pickup", id: "49593034" },
{ type: "User", id: "5903" },
...
]
The number of hstore's in the array for a single record might be as high as 10,000 although it may be less. This table only receives inserts. No data is removed or updated once a record has been added.
There is currently about 90K records in this table and it will generally grow by a few hundred each day. The values in later records will likely have different IDs than the values in earlier records.
I need to be able to do searches on this column. For example I might want to know all records where this column includes 'type=>User,id=>5903'::hstore. I found if I created a GIN index on this column and queried with the && operator it is able to search this data very quick.
I had a problem where the bloat on this query had gotten very high. PG stopped using the index and switched to a table scan causing it to be very slow. I fixed this by running REINDEX on the index.
I'm finding the bloat comes back after a bit. While it hasn't switched to table scans again (even with a high bloat) I worry it will so have preemptively been reindexing when I see it get high again. Whenever I do this I have to take a function off-line as reindexing blocks usage of that index according to the documentation.
I am using Heroku's PG service and they provide a tool to query for bloat. The query it executes can be found here.
After doing the REINDEX the index size is about 1GB. When it bloats it grows to about 11GB according to Heroku's query.
My questions are:
What exactly is causing the bloat? I understand with tables that a DELETE or UPDATE will cause bloat. Is the addition of data causing it to have to rebalance a tree or something leading to dead pages in the index or something?
Is the data from Heroku's query accurate. I've read some things about some bloat queries being for btrees and not gin indexes. Perhaps it's pointing to a non-existant problem. The only time it actually stopped using the index was when I first did a mass insert to populate the table. After that it has bloated but kept using the index. Maybe I know longer have an issue and am just pre-emptively reindexing for no reason.
Is there something I should change in my schema to make this maintenance free?
I was thinking of refactoring this so that the data in this column is stored in a separate table. This separate table would store the type, id and the ID of my main table as separate columns. I would create an btree index on the type and id field. Would this give me a generally maintenance free lookup? I'm thinking it would also be faster since the id could be stored as a true number. Currently it is stored as a string since hstore values are always strings.

How does MongoDB order their docs in one collection? [duplicate]

This question already has answers here:
How does MongoDB sort records when no sort order is specified?
(2 answers)
Closed 7 years ago.
In my User collection, MongoDB usually orders each new doc in the same order I create them: the last one created is the last one in the collection. But I have detected another collection where the last one I created has the 6 position between 27 docs.
Why is that?
Which order follows each doc in MongoDB collection?

It's called natural order:
natural order
The order in which the database refers to documents on disk. This is the default sort order. See $natural and Return in Natural Order.
This confirms that in general you get them in the same order you inserted, but that's not guaranteed–as you noticed.
Return in Natural Order
The $natural parameter returns items according to their natural order within the database. This ordering is an internal implementation feature, and you should not rely on any particular structure within it.
Index Use
Queries that include a sort by $natural order do not use indexes to fulfill the query predicate with the following exception: If the query predicate is an equality condition on the _id field { _id: <value> }, then the query with the sort by $natural order can use the _id index.
MMAPv1
Typically, the natural order reflects insertion order with the following exception for the MMAPv1 storage engine. For the MMAPv1 storage engine, the natural order does not reflect insertion order if the documents relocate because of document growth or remove operations free up space which are then taken up by newly inserted documents.
Obviously, like the docs mentioned, you should not rely on this default order (This ordering is an internal implementation feature, and you should not rely on any particular structure within it.).
If you need to sort the things, use the sort solutions.
Basically, the following two calls should return documents in the same order (since the default order is $natural):
db.mycollection.find().sort({ "$natural": 1 })
db.mycollection.find()
If you want to sort by another field (e.g. name) you can do that:
db.mycollection.find().sort({ "name": 1 })

For performance reasons, MongoDB never splits a document on the hard drive.
When you start with an empty collection and start inserting document after document into it, mongoDB will place them consecutively on the disk.
But what happens when you update a document and it now takes more space and doesn't fit into its old position anymore without overlapping the next? In that case MongoDB will delete it and re-append it as a new one at the end of the collection file.
Your collection file now has a hole of unused space. This is quite a waste, isn't it? That's why the next document which is inserted and small enough to fit into that hole will be inserted in that hole. That's likely what happened in the case of your second collection.
Bottom line: Never rely on documents being returned in insertion order. When you care about the order, always sort your results.

MongoDB does not "order" the documents at all, unless you ask it to.
The basic insertion will create an ObjectId in the _id primary key value unless you tell it to do otherwise. This ObjectId value is a special value with "monotonic" or "ever increasing" properties, which means each value created is guaranteed to be larger than the last.
If you want "sorted" then do an explicit "sort":
db.collection.find().sort({ "_id": 1 })
Or a "natural" sort means in the order stored on disk:
db.collection.find().sort({ "$natural": 1 })
Which is pretty much the standard unless stated otherwise or an "index" is selected by the query criteria that will determine the sort order. But you can use that to "force" that order if query criteria selected an index that sorted otherwise.
MongoDB documents "move" when grown, and therefore the _id order is not always explicitly the same order as documents are retrieved.

I could find out more about it thanks to the link Return in Natural Order provided by Ionică Bizău.
"The $natural parameter returns items according to their natural order within the database.This ordering is an internal implementation feature, and you should not rely on any particular structure within it.
Typically, the natural order reflects insertion order with the following exception for the MMAPv1 storage engine. For the MMAPv1 storage engine, the natural order does not reflect insertion order if the documents relocate because of document growth or remove operations free up space which are then taken up by newly inserted documents."

Mongo _id Insert Uniqueness Check

I have a medium to large Mongo collection containing image metadata for >100k images. I am generating a UUID for each image generated and using it as the _id field in the imageMeta.insert() call.
I know for a fact that these _id's are unique, or at least as unique as I can expect from boost's UUID implementation, but as the collection grows larger, the time to insert a record has grown as well.
I feel like to ensure uniqueness of the _id field Mongo must be double-checking these against the other _ids in the database. How is this implemented, and how should I expect the insert time to grow wrt. to the collection size?

The _id field in mongo is required to be unique and indexed. When an insert is performed, all indexes in the collection are updated, so it's expected to see insert time increase with the number of indexes and/or documents. Namely, all collections have at least one index (on the _id field), but you've likely created indexes on fields that you frequently query, and those indexes also get updated on every insert (adding to the latency).
One way to reduce perceived database latency is to specify a write concern to your driver. Note that the default write concern prior to November 2012 was 'unacknowledged', but it has since been changed to 'acknowledged'.

Mongodb store and select order

Basic question. Does mongodb find command will always return documents in the order they where added to collection? If no how is it possible to implement selection docs in the right order?
Sort? But what if docs where added simultaneously and say created date is the same, but there was an order still.

Well, yes and ... not exactly.
Documents are default sorted by natural order. Which is initially the order the documents are stored on disk, which is indeed the order in which the documents had been added to a collection.
This order however, is not deterministic, as document may be moved on disk once these documents grow after update operations, and can't be fit into current space anymore. This way the initial (insert) order may change.
The way to guarantee insert order sort is sort by {_id : 1} as long as the _id is of type ObjectId. This will return your documents sorted in ascending order.
Write operations do not take place simultaneously. Write locks are imposed in database level (V 2.4 and on). The first four bytes of _id is insert timestamp, and 3 last digits is a random counter used to distinguish (and sort) between ObjectId instances with same timestamp.
_id field is indexed by default

The fastest way to show Documents with certain property first in MongoDB

I have collections with huge amount of Documents on which I need to do custom search with various different queries.
Each Document have boolean property. Let's call it "isInTop".
I need to show Documents which have this property first in all queries.
Yes. I can easy do sort in this field like:
.sort( { isInTop: -1 } );
And create proper index with field "isInTop" as last field in it. But this will be work slowly, as indexes in mongo works best with unique fields.
So is there is solution to show Documents with field "isInTop" on top of each query?
I see two solutions here.
First: set Documents wich need to be in top the _id from "future". As you know, ObjectId contains timestamp. So I can create ObjectId with timestamp from future and use natural order
Second: create separate collection for Ducuments wich need to be in top. And do queries in it first.
Is there is any other solutions for this problem? Which will work fater?
UPDATE
I have done this issue with sorting on custom field which represent rank.

Using the _id field trick you mention has the problem that at some point in time you will reach the special time, and you can't change the _id field (without inserting a new document and removing the old one).
Creating a special collection which just holds the ones you care about is probably the best option. It gives you the ability to logically (and to some extent, physically) separate the documents.
Newly introduced in mongodb there is also support for a "sparse" index which may fulfill your needs as well. You could only set the "isInTop" field when you want it to be special, and then create a sparse index on it which would not have the problems you would normally have with a single indexed boolean field (in btrees).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse