I have a unique index, but now some data does not have this index, resulting in duplicate data, I want to find out this part of the data, I want to query data that does not have this index.
Like this:
MongoDB shell version v3.6.8
MongoDB server version: 4.0.12
# there is no not_hint func
db.col.find().not_hint("md5_1_domain_1_ip_1_uri_1")
# hint not allowed to use $ne
db.col.find()._addSpecial("$hint", {"$ne": {"md5" : 1, "domain" : 1, "ip" : 1, "uri" : 1}})
The unique index
{
"v" : 2,
"key" : {
"md5" : "hashed"
},
"name" : "md5_hashed",
"ns" : "mdm.col"
},
{
"v" : 2,
"unique" : true,
"key" : {
"md5" : 1,
"domain" : 1,
"ip" : 1,
"uri" : 1
},
"name" : "md5_1_domain_1_ip_1_uri_1",
"background" : true,
"ns" : "mdm.col"
}
The data, I have modified some sensitive information and I am sure they are the same. And the data cannot be queried by a unique index. Only use _id or the other index to query.
mongos> db.col.find({ "_id" : ObjectId("5fb2df3b32b0f42dced04ea7")})
{ "_id" : ObjectId("5fb2df3b32b0f42dced04ea7"), "domain" : null, "ip" : 1, "md5" : BinData(5,"anQTYWNGHKoj4xx+KTjNxQ=="), "uri" : "x * 1025", "count" : 6, "fseen" : ISODate("2019-08-03T13:56:38Z"), "lseen" : ISODate("2019-08-03T13:56:38Z"), "sha1" : null, "sha256" : null, "src" : [ "xx2", "xx3" ] }
mongos> db.col.find({'_id': ObjectId('5fb2df3d32b0f42dced0721d')})
{ "_id" : ObjectId("5fb2df3d32b0f42dced0721d"), "domain" : null, "ip" : 1, "md5" : BinData(5,"anQTYWNGHKoj4xx+KTjNxQ=="), "uri" : "x * 1025", "count" : 6, "fseen" : ISODate("2019-08-03T13:56:38Z"), "lseen" : ISODate("2019-08-03T13:56:38Z"), "sha1" : null, "sha256" : null, "src" : [ "xx2", "xx3" ] }
mongos> db.col.find({"md5": BinData(5,"anQTYWNGHKoj4xx+KTjNxQ=="), "uri": "x * 1025", "ip": 1}
mongos> # it is None
And this info:
mongos> db.col.find().count()
5549020886
mongos> db.col.find().hint("md5_1_domain_1_ip_1_uri_1").count()
5521037206
The uri length is over 1024 and the data is not indexed. I want to find that 27983680 terms data and repair it.
Thanks
Strange how it could be. Anyway, you can find duplicate data with this aggregation pipeline:
db.col.aggregate([
{
$group: {
_id: {
md5: "$md5",
domain: "$domain",
ip: "$ip",
uri: "$uri"
}
},
count: { $sum: 1 },
ids: { $push: "$_id" }
},
{ $match: { count: { $gt: 1 } } }
], { allowDiskUse: true })
The result has field ids with array of _id from duplicate data.
I found the reason. The uri length is over 1024 and the data is not indexed.The DBA colse failIndexKeyTooLong. But I still can't find this part of the data.
Related
How I can find all unique indexes in MongoDB?
The db.collection.getIndexes() function doesn't give any information about uniqueness.
getIndexes() should work:
db.collection.createIndex({key: 1}, {unique: true})
db.collection.getIndexes()
[
{
"v" : 2,
"key" : { "_id" : 1 },
"name" : "_id_"
},
{
"v" : 2,
"key" : { "key" : 1 },
"name" : "key_1",
"unique" : true
}
]
If the index is not unique then "unique": true is simply missing.
I probably would not ask if I have not seen this:
> db.requests.getIndexes()
[
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_"
},
{
"v" : 2,
"unique" : true,
"key" : {
"name" : 1,
},
"name" : "name_1"
}
]
Where _id index does not have unique: true. Can it mean that the _id index is somehow not truly unique or something? Can it behave differently (non-unique) if _id is populated with non-ObjectId values - some other fundamental types?
I am using a (small, 256 MB) MongoDB 3.2.9 service instance through Swisscom CloudFoundry. As long as our entire DB fits into the available RAM, we see somewhat acceptable query performance.
However, we are experiencing very long query times on aggregation operations when our DB does not fit into RAM. We have created indexes for the accessed fields, but as far as I can tell it doesn't help.
Example document entry:
_id: 5a31...
description: Object
location: "XYZ"
name: "ABC"
status: "A"
m_nr: null
k_nr: null
city: "QWE"
high_value: 17
right_value: 71
more_data: Object
number: 101
interval: 1
next_date: "2016-01-16T00:00:00Z"
last_date: null
status: null
classification: Object
priority_value: "?"
redundancy_value: "?"
active_value: "0"
Example Query:
db.getCollection('a').aggregate(
[{ $sort:
{"description.location": 1}
},
{ $group:
{_id: "$description.location"}
}],
{ explain: true }
)
This query takes 25sec on a DB that only has 20k entries and produces 1k output fields.
The explain info for this query:
db.getCollection('a').aggregate([{ $group: {_id: "$description.location"} }], { explain: true }):
{
"waitedMS" : NumberLong(0),
"stages" : [
{
"$cursor" : {
"query" : {},
"fields" : {
"description.location" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "Z.a",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : []
},
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"$and" : []
},
"direction" : "forward"
},
"rejectedPlans" : []
}
}
},
{
"$group" : {
"_id" : "$description.location"
}
}
],
"ok" : 1.0
}
[UPDATE] Output of db.a.getIndexes():
/* 1 */
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "db.a"
},
{
"v" : 1,
"key" : {
"description.location" : 1.0
},
"name" : "description.location_1",
"ns" : "db.a"
}
]
Looks like it's doing a collection scan, have you tried adding an index on description.location?
db.a.createIndex({"description.location" : 1});
{
"_id" : ObjectId("53692eb238ed04c824679f18"),
"firstUserId" : 1,
"secondUserId" : 17,
"messages" : [
{
"_id" : ObjectId("5369338997b964b81d579fc6"),
"read" : true,
"dateTime" : 1399403401,
"message" : "d",
"userId" : 1
},
{
"_id" : ObjectId("536933c797b964b81d579fc7"),
"read" : false,
"dateTime" : 1399403463,
"message" : "asdf",
"userId" : 17
}
]
}
I'm trying to select all documents that have firstUserId = 1 and also have sub documents
that have userId differnet ($ne) to 1 and read = false.
I tried:
db.usermessages.find({firstUserId: 1, "messages.userId": {$ne: 1}, "messages.read": false})
But it returns empty cause messages have both 1 and 17.
And also how to count subdocuments that have given case?
Are you trying to get the count of all the documents which are returned after your match criteria? If Yes, then you might consider using aggregation framework. http://docs.mongodb.org/manual/aggregation/
Something like below could be done to get you the counts:
db.usermessages.aggregate(
{ "$unwind": "$messages" },
{ "$match":
{ "firstUserId": 1,
"messages.userId": { "$ne" : 1},
"messages.read": false
}
},
{ "$group": { "_id" :null, "count" : { "$sum": 1 } } }
)
Hope this helps.
PS: I have not tried this on my system.
Please tell me there is a way to query multiple fields in arrays (e.g. $elemMatch ) AND have a sort applied - it is being ignored if I query more than one field in l.
The documents are not returned in ts: -1 sort order - works fine if I query one array field.
The skip and limit do not affect the sort as expected.
Sample query
db.transactions.find({ 'l.id': '5612087d70634d009dd919e5bb07fdad', 'l.t': 'organization' } ).sort({ ts: -1 }.skip(0).limit(10).toArray()
db.transactions.find({ l: { $elemMatch: { id: '5612087d70634d009dd919e5bb07fdad', t: 'organization' } } } ).sort({ ts: -1 }.skip(0).limit(10).toArray()
This aggregate works, but I lose the rest of the contents of l
db.transactions.aggregate({$unwind:"$l"},{$match:{"l.t":"organization", "l.id":"5612087d70634d009dd919e5bb07fdad"}}, {$sort:{"ts":-1}})
Sample data - ignore the duplicated ids, not a factor for this question.
[
{
"t" : "organization.save",
"d" : {
"new" : false,
"id" : "5612087d70634d009dd919e5bb07fdad"
},
"ts" : ISODate("2013-08-20T02:21:39.955Z"),
"l" : [
{
"t" : "organization",
"id" : "5612087d70634d009dd919e5bb07fdad"
},
{
"t" : "account",
"id" : "5612087d70634d009dd919e5bb07fdad"
}
],
"_id" : "95c6cd5310aa485582312319f74775a4",
"__v" : 0
},
{
"t" : "organization.save",
"d" : {
"new" : false,
"id" : "5612087d70634d009dd919e5bb07fdad"
},
"ts" : ISODate("2013-08-20T02:21:43.121Z"),
"l" : [
{
"t" : "organization",
"id" : "5612087d70634d009dd919e5bb07fdad"
},
{
"t" : "account",
"id" : "5612087d70634d009dd919e5bb07fdad"
}
],
"_id" : "d6434c3e9a1743afaa6c0961b5a69f70",
"__v" : 0
}
]
Edit: For the purposes of getting up and running I created a compound field with t + ':' + id.
This has solved my sorting issue for now but is not ideal due to data double up.