Checking data integrity with counts - mongodb

I have a field with a dictionary in it, mapping people to numbers 0-9.
{peopleDict : {bob: 3, les: 3, meg: 8, sara: 6}}
I also have another field with another dictionary in it, which is supposed to count the number of people assigned to each number.
{countDict : {"3" : 2, "8" : 1, "6" : 1}}
So a document looks like
{peopleDict : {bob: 3, les: 3, meg: 8, sara: 6},
countDict : {"3" : 2, "8" : 1, "6" : 1}}
I am trying to write a query that tests whether countDict actually matches peopleDict for each document. I'm sure there must be a way to do this with aggregate but I'm not quite sure how.

As far as I know, you can't join data from different collections. So if you have them in separate collections you need to analyze data on application level, or redesign data structure to put all of them to single collection.

Related

MongoDB query to count all element in array with time condition

I am new to Mongodb trying to count the number of Delta values in DELTA_GROUP. In this example there are three "Delta" values under "DELTA_GROUP". In this case, the count value is 3 for this object. I need to satisfy two conditions here though.
First of all, I need to only count data collected within the specific time range I set (ex. between start point and end point using ISODate(gte,lte, etc)).
The Second, in the data with the specific time range, I want to count the number of delta values for every object and of course, there are a handful of objects within the specified time range. Thus, if I assume that there are only three delta values for each object (from example), and 10 objects total, the count result should be 30. How can I create a query for it with conditions above?
{
"_id" : ObjectId("5f68a088135c701658c24d62"),
"DELTA_GROUP" : [
{
"Delta" : 105,
},
{
"Delta" : 108,
},
{
"Delta" : 103,
}
],
"YEAR" : 2020,
"MONTH" : 9,
"DAY" : 21,
"RECEIVE_TIME" : ISODate("2020-09-21T21:46:00.323Z")
}
What I have tried so far is shown below. In this way, I was able to list out counted value for each object, but still need to work to get totalized counted value for specified range of dates.
db.DELTA_DATA.aggregate([
{$match:
{'RECEIVE_TIME':
{
$gte:ISODate("2020-09-10T00:00:00"),
$lte:ISODate("2020-10-15T23:59:59")
}
}},
{$project: {"total": {count : {"$size":"$DELTA_GROUP"}}}}])

Mongo one attribute different than the other attribute [duplicate]

This question already has answers here:
Compare two date fields in MongoDB
(6 answers)
Closed 8 years ago.
How do I do a MongoDB find comparing two attribute of the same document?
Like, if I have the collection "test", with this structure:
{a : 3, b : 4}
{a : 5, b : 5}
{a : 6, b : 6}
and I want to find all documents where the attribute 'a' is different than the attribute 'b', which would be the entry
{a : 3, b : 4}
.
I thought this could be accomplised by:
db.test.find({a : { $ne : b}})
but it didn't work. It gives me
Fri Aug 1 13:54:47 ReferenceError: b is not defined (shell):1
If this is an ad-hoc query and you don't want to keep track of different attributes (as mentioned in the entry posted by Marc B., then you can simply go with:
db.test.find("this.a != this.b");
This is going to be slow, depending on how many entries you have.

Using an indexed field for selecting a set of random items from a collection (MongoDB)

I am using MongoDB 2.4.10, and I have a collection of four million records, and a query that creates a subset of no more than 50000 even for our power users. I need to select a random 30 items from this subset, and, given the potential performance issues with skip and limit (especially when doing it 30 times with random skip amounts from 1-50000), I stumbled across the following solution:
Create a field for each record which is a completely random number
Create an index over this field
Sort by the field, and use skip(X).limit(30) to get a page of 30 items that, while consecutive in terms of the random field, actually bear no relation to each other. To the user, they seem random.
My index looks like this:
{a: 1, b: 1, c: 1, d: 1}
I also have a separate index:
{d : 1}
'd' is the randomised field.
My query looks like this:
db.content.find({a : {$in : ["xyz", "abc"]}, b : "ok", c : "Image"})
.sort({d : 1}).skip(X).limit(30)
When the collection is small, this works perfectly. However, on our performance and live systems, this query fails, because instead of using the a, b, c, d index, it uses this index only:
{d : 1}
As a result, the query ends up scanning more records than it needs to (by a factor of 25). So, I introduced hint:
db.content.find({a : {$in : ["xyz", "abc"]}, b : "ok", c : "Image"})
.hint({a : 1, b : 1, c : 1, d : 1}).sort({d : 1}).skip(X).limit(30)
This now works great with all values of X up to 11000, and explain() shows the correct index in use. But, when the skip amount exceeds 11000, I get:
{
"$err" : "too much data for sort() with no index. add an index or specify a smaller limit",
"code" : 10128
}
Presumably, the risk of hitting this error is why the query (without the hint) wasn't using this index earlier. So:
Why does Mongo think that the sort has no index to use, when I've forced it to use an index that explicitly includes the sorting field at the end?
Is there a better way of doing this?

Is there a limit on number of columns you can specify in mongodb composite index?

index_dict = {
"column_7" : 1,
"column_6" : 1,
"column_5" : 1,
"column_4" : 1,
"column_3" : 1,
"column_2" : 1,
"column_1" : 1,
"column_9" : 1,
"column_8" : 1,
"column_11" : 1,
"column_10" : 1
}
db.dataCustom.ensureIndex(index_dict, {unique: true, dropDups: true})
How many columns is the limit for index_dict? I need to know it for the implementation but I am not able to find the answer online
I must be honest, I know of no limitation on the number of columns. There is a logical limitation though.
An index is a very heavy thing, to put an immense, or even close to table length index on a collection will create performance problems. Most specifically the fact that your index is so large in space and values.
This, you must be aware of. However to answer your question: I know of no limit. There is a limit per size of a field within an index (1024 bytes including name) but I know of none on the number of columns in a index.
Edit
It appears I wrote this answer a little fast, there is a limit of 31 fields: http://docs.mongodb.org/manual/reference/limits/#Number%20of%20Indexed%20Fields%20in%20a%20Compound%20Index I guess I just have never reached that number :/
A collection may have no more than 64 indexes. More details Here and limitation
But I was wondering why you want so many Indexes ?
EDIT
There can be no more than 31 fields in a compound index.

How do bson arrays compare (in mongodb/pymongo)?

I would like to store in mongdb some very large integers, exactly (several thousands decimal digits). This will not work of course with the standard types supported by BSON, and I am trying to think of the most elegant workaround, considering that I would like to perform range searches and similar things. This requirement excludes storing the integers as strings as it makes the range searches impractical.
One way I can think of is to encode the 2^32-expansion using (variable-length) arrays of standard ints, and add to this array a first entry for the length of the array itself. That way lexicographical ordering on these arrays corresponds to the usual ordering of arbitrarily large integers.
For instance, in a collection I could have the 5 documents
{"name": "me", "fortune": [1,1000]}
{"name": "scrooge mcduck", "fortune": [11,1,0,0,0,0,0,0,0,0,0,0]}
{"name": "bruce wayne","fortune": [2, 10,0]}
{"name": "bill gates", "fortune": [2,1,1000]}
{"name": "francis", "fortune": [0]}
Thus Bruce Wayne's net worth is 10*2^32, Bill Gates' 2^32+1000 and Scrooge McDuck's 2^320.
I can then do a sort using {"fortune":1} and on my machine (with pymongo) it returns them in the order francis < me < bill < bruce < scrooge, as expected.
However, I am making assumptions that I haven't seen documented anywhere about the way BSON arrays compare, and the range searches don't seem to work the way I think (for instance,
find({"fortune":{$gte:[2,5,0]}})
returns no document, but I would wish for bruce and scrooge).
Can anyone help me? Thanks
You can instead store left padded strings which represent exact integer equal to the fortune.
eg. "1000000" = 1 million
"0010000" = 10 thousand
"2000000" = 2 million
"0200000" = 2 hundred thousand
Left padding with zeroes will ensure that lexographical comparison of these strings directly corresponds to their comparison as numeric values also. You will have to
assume a safe MAXIMUM possible value of fortune here, say a 20 digit number, and
pad the 0s accordingly
So a sample documents would be :
{"name": "scrooge mcduck", "fortune": "00001100000000000000" }
{"name": "bruce wayne", "fortune": "00000200000000000000" }
querying:
> db.test123.find()
{ "_id" : ObjectId("4f87e142f1573cffecd0f65e"), "name" : "bruce wayne", "fortune" : "00000200000000000000" }
{ "_id" : ObjectId("4f87e150f1573cffecd0f65f"), "name" : "donald", "fortune" : "00000150000000000000" }
{ "_id" : ObjectId("4f87e160f1573cffecd0f660"), "name" : "mickey", "fortune" : "00000000000000100000" }
> db.test123.find({ "fortune" : {$gte: "00000200000000000000"}});
{ "_id" : ObjectId("4f87e142f1573cffecd0f65e"), "name" : "bruce wayne", "fortune" : "00000200000000000000" }
> db.test123.find({ "fortune" : {$lt: "00000200000000000000"}});
{ "_id" : ObjectId("4f87e150f1573cffecd0f65f"), "name" : "donald", "fortune" : "00000150000000000000" }
{ "_id" : ObjectId("4f87e160f1573cffecd0f660"), "name" : "mickey", "fortune" : "00000000000000100000" }
The querying / sorting will work naturally as mongodb compares strings lexographically.
However, to do other numeric operations on your data, you will have to write custom logic in your data processing script ( PHP,Python,Ruby etc)
For querying and data storage, this string version should do fine.
Unfortunately your assumption about array comparison is incorrect. Range queries that, for example, query for all array values smaller than 3 ({array:{$lt:3}}) will return all arrays where at least one element is less than three, regardless of the element's position. As such your approach will not work.
What does work, but is a bit less obvious, is using binary blobs for your very large integers since those are byte-order compared. That requires you set an upper bit limit for your integers but that should be fairly straightforward. You can test it in shell using BinData(subType, base64) notation :
db.col.find({fortune:{$gt:BinData(0, "e8MEnzZoFyMmD7WSHdNrFJyEk8M=")}})
So all you'd have to do is create methods to convert your big integers from, say, strings to two-complements binary and you're set. Good luck