MongoDB inserting duplicate document within upsert - mongodb

My current situation is that I have several Writer objects which dump data into MongoDB. No unique indexes so duplicates are allowed and are a possibility, but shouldn't be.
I was checking existing data within the DB and found several documents in which the fields that should be used to match in the upsert phase are duplicated and contain different counters.
{"date": "today", "k1": "sample", "count": 5}
{"date": "today", "k1": "sample", "count": 2}
That is a very very simple example of what my current situation is. The count field should be 7 and there shouldn't be two separate documents with the same keys I use to perform the upsert, but this is something that barely happens and isn't much of the data... Just wondering what could be causing this?
Is there any situation where this can happen? A R/W lock?

for such things as counters, I would recommend using $inc operator https://docs.mongodb.com/manual/reference/operator/update/inc/

Related

MongoDB: How to reorder document Fields that got jumbled in the past

My DB has many documents with mostly random field order as displayed in Mongo Compass. The first field is always _id but the rest of the fields could be in any order. This makes scanning records by eye very difficult.
I have read that this reordering due to upserts no longer happens with Mongo 4.2 and I have upgraded - but the problem remains.
Is there a way for me to reorder my fields so each document in a collection has the same field order - say -id first then a-z?
You can use $replaceWith to do this.
https://mongoplayground.net/p/VBzpabZuJpy
db.YOURCOLLECTION.updateMany({}, [
{$replaceWith: {
$mergeObjects: [
{
"fieldA": "$fieldA",
"fieldB": "$fieldB",
"fieldC": "$fieldC",
"fieldD": "$fieldD",
"fieldE": "$fieldE"
},
"$$ROOT"
]
}}
])
You can try reading each document in a language that preserves hash key order, reordering fields as you see fit, then writing each document back.
Since bson implements maps as lists of ordered key-value pairs, I expect all regular tools to preserve the order of keys that currently exists for each individual document.

How to perform queries with large $in data set in mongo db

I have simple query like this: {"field": {$nin: ["value1","value2","valueN"]}}.
The problem is large amount of unique values to exclude (using $nin operator). It's about 50000 unique values to filter and about 1Kb of query length.
Question: Is there elegant and performant way to do such operations?
Example.
Collection daily_stat with 56M of docs. Each day increases collection with 100K docs. Example of document
{
"day": "2020-04-15",
"username": "uniq_name",
"total": 12345
}
I run next query:
{
"date": "2020-04-15",
"username": {
$nin: [
"name1",
"name2",
"...",
"name50000"
]
}
}
MongoDB version: 3.6.12
I would say the big $nin array is the elegant solution. If there is an index on field then it will also be performant -- but only in terms of quickly excluding those docs not to be returned in the cursor. If you have, say, 10 million docs in a collection and you do a find() to exclude 50000, you are still dragging 9,950,000 records out of the DB and across the wire; that is non-trivial.
if you can find a pattern in the values you pass in you can try with regex. Example given below
db.persons.find({'field':{$nin:[/san/i]}},{_id:0,"field":1})
more details on regex in
https://docs.mongodb.com/manual/reference/operator/query/regex/

Index for queries with $ne and $or with arrays

Say I have a MongoDB collection with documents like this:
{ "_id": ObjectId("the_object_id"),
"type": "BLOG_POST",
"state": "IN_PROGRESS",
"createDate":ISODate("2017-02-15T01:01:01.000Z"),
"users": {
"posted": ["user1", "user2", "user3"],
"favorited": ["user1", "user4", "user5", "user6"],
"other_fields": "other data",
},
"many_more_fields": "a bunch of other data"
}
I have a query like this:
db.collection.find({"$and":[
{"type": "BLOG_POST"},
{"$or": [ {"users.posted":"userX"}, {"users.favorited":"userX"} ] },
{"state": {"$ne":"COMPLETED"}}
]}).sort({"createDate":1})
The collection currently only has indexes on the _id field and some fields not included in this query or example.
As far as the cardinality goes, documents with:
type=BLOG_POST is approximately 75% of the collection, state $ne "COMPLETED" is approximately 50% of the collection, and users are in the users.posted or users.favorited at most 2% of the collection.
What would the best index or set of indexes be for this use case?
It is my understanding that we cannot index both users.posted and users.favorited in the same index because they are both arrays. In the future we may be able to make a new array of users.userswhocare which is a set of both of the fields, but assume we can't make that change in the short term.
I also thought that the $ne on state means that an index on state will generally not be used. Is the query planner able to the state field at the end of an index to handle the $ne condition?
I had the idea of an index {"type":1, "createDate":1, "state":1}, so that the query would hit on the type, use the createDate for the sort, and handle the $ne with last bit of the index. It would still have to pick up 35%-40% of the documents to test for the users. Not good, but an improvement over the current collection scan.
Alternatively I could create two indexes, one like {"users.posted":1, "type":1, "createDate":1, "state":1} and {"users.favorited":1, "type":1, "createDate":1, "state":1}.
Will the query planner use the intersection of these two indexes to more quickly find the documents of interest?
We are currently using MongoDB 3.2.5. If there are differences in the answer between MongoDB 3.2 and 3.4, I would love to know them.
After some analysis, I found that adding multiple queries with users.posted and users.favorited as the first item in the respective indexes both performed better and was chosen by the MongoDB query planner.
I created indexes like:
db.collection.createIndex({"users.posted":1, "type":1, "createDate":1, "state":1})
db.collection.createIndex({"users.favorited":1, "type":1, "createDate":1, "state":1})
Due to the cardinality of users.posted and users.favorited being high (either one will encompass no more than 2% of the collection, most of the time less than 0.5%), the MongoDB query planner used both with index intersection.
I tested this against an index like:
db.collection.createIndex({"type":1, "createDate":1, "state":1}).
Reviewing the explain plans of against both queries using both explain() and explain("executionStats"), the query planner used index scans for the {"$or": [ {"users.posted":"userX"}, {"users.favorited":"userX"} ] } part of the query as the first stage. This led to the fewest totalKeysExamined and totalDocsExamined.

How to make distinct operation more quickly in mongodb

There are 30,000,000 records in one collection.
when I use distinct command on this collection by java, it takes about 4 minutes, the result's count is about 40,000.
Is mongodb's distinct operation so inefficiency?
and how can I make it more efficient?
Is mongodb's distinct operation so inefficiency?
At 30m records? I would say 4 minutes is actually quite good, I think that's just as fast, maybe a little faster than SQL does it.
I would probably test this in other databases before saying it is inefficient.
However, one way of looking at performance is to see if the field is indexed first and if that index is in RAM or can be loaded without page thrashing. Distinct() can use an index so long as the field has an index.
and how can I make it more efficient?
You could use a couple of methods:
Incremental map reduce to distinct the main collection once every, say, 5 mins to a unique collection
And Pre-aggregate the unique collection on save by saving to two collections, one detail and one unique
Those are the two most viable methods of getting around this performantly.
Edit
Distinct() is not outdated and if it fits your needs is actually more performant than $group since it can use an index.
The .distinct() operation is an old one, as is .group(). In general these have been superseded by .aggregate() which should be generally used in preference to these actions:
db.collection.aggregate([
{ "$group": {
"_id": "$field",
"count": { "$sum": 1 }
}
)
Substituting "$field" with whatever field you wish to get a distinct count from. The $ prefixes the field name to assign the value.
Look at the documentation and especially $group for more information.

how to build index in mongodb in this situation

I have a mongodb database, which has following fields:
{"word":"ipad", "date":20140113, "docid": 324, "score": 98}
which is a reverse index for a log of docs(about 120 millions).
there are two kinds of queries in my system:
one of which is :
db.index.find({"word":"ipad", "date":20140113}).sort({"score":-1})
this query fetch the word "ipad" in date 20140113, and sort the all docs by score.
another query is:
db.index.find({"word":"ipad", "date":20140113, "docid":324})
to speed up these two kinds of query, what index should I build?
Should I build two indexes like this?:
db.index.ensureIndex({"word":1, "date":1, "docid":1}, {"unique":true})
db.index.ensureIndex({"word":1, "date":1, "score":1}
but I think build the two index use two much hard disk space.
So do you have some good ideas?
You are sorting by score descending (.sort({"score":-1})), which means that your index should also be descending on the score-field so it can support the sorting:
db.index.ensureIndex({"word":1, "date":1, "score":-1});
The other index looks good to speed up that query, but you still might want to confirm that by running the query in the mongo shell followed with .explain().
Indexes are always a tradeoff of space and write-performance for read-performance. When you can't afford the space, you can't have the index and have to deal with it. But usually the write-performance is the larger concern, because drive space is usually cheap.
But maybe you could save one of the three indexes you have. "Wait, three indexes?" Yes, keep in mind that every collection must have an unique index on the _id field which is created implicitely when the collection is initialized.
But the _id field doesn't have to be an auto-generated ObjectId. It can be anything you want. When you have another index with an uniqueness-constraint and you have no use for the _id field, you can move that unique-constraint to the _id field to save an index. Your documents would then look like this:
{ _id: {
"word":"ipad",
"date":20140113,
"docid": 324
},
"score": 98
}