Here is an example of my MongoDB database in screenshot:
I'm trying to create a request which count numbers of actors per owner.login in my movies collection.
The result should be like this for all my Objects, from "movies" collection:
henry.tom total_actors : 1
tercik.comip total_actors : 9
Related
I am running a mongodb aggregate query to group the data of a collection and get a sum of the values in a field and insert it to another collection.
ex: collection1: [
{ name: "foo",
group_id:1,
marks: 10
},
{ name: "bar",
group_id:1,
marks: 20
},
{ name: "Hello World",
group_id:2,
marks: 40
}]
So, the group by query will insert into a collection ex: collection2 with the following data
collection2:[
{
group_id: 1,
marks: 30
}, {
group_id:2,
marks: 40
}
]
I need to do these two operations:
Group the data and get the aggregate
create a new collection with the data
Now, comes the interesting part, The data that is being grouped is of 5 billion rows, and so, the query to get the aggregate of the marks will be very slow to execute.
thus writing a node script to get the data by the group and then insert it to another collection will be not very optimal. The other way that I was thinking was to limit the data by x ex: 1000, and group the 1000s and then insert that to the collection2 and for the next 1000 , update the collection 2, and so on.
So, here are my questions. does aggregating the data by a limit and then iterating over it faster?
ex:
step 1: group and get the sum of the marks of 1000 rows
step 2: insert/update collection2 with this data
step 3: goto step1
is this above method more useful than just getting the aggregate by grouping all the 5 billion records and then inserting it in the collection2? Assuming that there is a node api that is doing the above task, how to determine/calculate the limit size for the faster operations? how do I use whenMatched to update/insert to collection2 with the marks?
I have data in a collection 'Test' with following entries like,
{
"_id" : ObjectId("588b65f1e9a1e01dfc55a5ff"),
"SALES_AMOUNT" : 4500
},
{
"_id" : ObjectId("588b65f1e9a1e01dfct5a5ff"),
"SALES_AMOUNT" : 500
}
and so on.
I want to equally separate into 10 buckets.
Eg :
If my total entries has 50, then it should be like,
first_bucket : First 5 entries from test collection second_bucket :
Next 5 entries from test collection .....
....
tenth_bucket : Last 5 entries from test collection.
Suppose, If total entries count has 101, then it should be like
first_bucket : First 10 entries from test collection second_bucket :
Next 10 entries from test collection .....
....
tenth_bucket : Last 11 entries from test collection.(Because 1
additional entry is there).
$bucket & $bucketAuto is in mongo 3.4.. But I use mongo 3.2.
I have a collection in mongo of 10 billion documents. Some of which have false information and require updating. The documents look like
{
"_id" : ObjectId("5567c71e2cdc06be25dbf7a0"),
"index1" : "stringIndex1",
"field" : "stringField",
"index2" : "stringIndex2",
"value" : 100,
"unique_id" : "b21fc73e-55a0-4e15-8db0-fa94e4ebcc0b",
"t" : ISODate("2015-05-29T01:55:39.092Z")
}
I want to update the value field for all documents that match criteria on index1, index2 and field. I want to do this across many combinations of the 3 fields.
In an ideal world we could create a second collection and compare the 2 before replacing the original in order to guarantee that we haven't lost any documents. But the size of the collection means that this isn't possible. Any suggestions for how to update this large amount of data without risking damaging it.
I have similar data in 5 collections in mongodb as follows (documents)
{
"_id" : ObjectId("53490030cf3b942d63cfbc7b"),
"uNr" : "abdc123abcd",
}
I want to iterate through each collection and check if there is uNr match in any collection. If there is then add that uNr and count +1 to new table. For example if there is a match in 3 collections, that it should show {"uNr" : "abcd123", "count": "3"}
If your total number of uNr values is small enough to fit in memory (at most a few million of them), you can total them client-side with a Counter and store them in a MongoDB collection:
from collections import Counter
from pymongo import MongoClient, InsertOne
db = MongoClient().my_database
counts = Counter()
for collection in [db.collection1,
db.collection2,
db.collection3]:
for doc in collection.find():
counts[doc['uNr']] += 1
# Empty the target collection.
db.counts.delete_many({})
inserts = [InsertOne({'_id': uNr, 'n': cnt}) for uNr, cnt in counts.items()]
db.counts.bulk_write(inserts)
Otherwise, query a thousand uNr values at a time and update counts in a separate collection:
from pymongo import MongoClient, UpdateOne, ASCENDING
db = MongoClient().my_database
# Empty the target collection.
db.counts.delete_many({})
db.counts.create_index([('uNr', ASCENDING)])
for collection in [db.collection1,
db.collection2,
db.collection3]:
cursor = collection.find(no_cursor_timeout=True)
# "with" statement helps ensure cursor is closed, since the server will
# never auto-close it.
with cursor:
updates = []
for doc in cursor:
updates.append(UpdateOne({'_id': doc['uNr']},
{'$inc': {'n': 1}},
upsert=True))
if len(updates) == 1000:
db.counts.bulk_write(updates)
updates = []
if updates:
# Last batch.
db.counts.bulk_write(updates)
Imagine a collection with about 5,000,000 documents. I need to do a basicCursor query to select ~100 documents based on too many fields to index. Let's call this the basicCursorMatch. This will be immensely slow.
I can however to a bTreeCursor query on a few indexes that will limit my search to ~500 documents. Let's call this query the bTreeCursorMatch.
Is there a way I can do this basicCursorMatch directly on the cursor or collection resulting from the bTreeCursorMatch?
Intuitively I tried
var cursor = collection.find(bTreeCursorMatch);
var results = cursor.find(basicCursorMatch);
similar to collection.find(bTreeCursorMatch).find(basicCursorMatch), which doesn't seem to work.
Alternatively, I was hoping I could do something like this:
collection.aggregate([
{$match: bTreeCursorMatch}, // Uses index 5,000,000 -> 500 fast
{$match: basicCursorMatch}, // No index, 500 -> 100 'slow'
{$sort}
]);
.. but it seems that I cannot do this either. Is there an alternative to do what I want?
The reason I am asking is because this second query will differ a lot and there is no way I can index all the fields. But I do want to make that first query using a bTreeCursor, otherwise querying the whole collection will take forever using a basicCursor.
update
Also, through user input the subselection of 500 documents will be queried in different ways during a session with an unpredictable basicCursor query, using multiple $in $eq $gt $lt. But during this, the bTreeCursor subselection remains the same. Should I just keep doing both queries for every user query, or is there a more efficient way to keep a reference to this collection?
In practice, you rarely need to run second queries on a cursor. You specially don't need to break MongoDB's work into separate indexable / non-indexable chunks.
If you pass a query to MongoDB's find method that can be partially fulfilled by a look-up in an index, MongoDB will do that look-up first, and then do a full scan on the remaining documents.
For instance, I have a collection users with documents like:
{ _id : 4, gender : "M", ... }
There is an index on _id, but not on gender. There are ~200M documents in users.
To get an idea of what MongoDB is doing under the hood, add explain() to your cursor (in the Mongo shell):
> db.users.find( { _id : { $gte : 1, $lt : 10 } } ).explain()
{
"cursor" : "BtreeCursor oldId_1_state_1",
"n" : 9,
"nscannedObjects" : 9
}
I have cut out some of the fields returned by explain. Basically, cursor tells you if it's using an index, n tells you the number of documents returned by the query and nscannedObjects is the number of objects scanned during the query. In this case, mongodb was able to scan exactly the right number of objects.
What happens if we now query on gender as well?
> db.users.find( { _id : { $gte : 1, $lt : 10 }, gender : "F" } ).explain()
{
"cursor" : "BtreeCursor oldId_1_state_1",
"n" : 5,
"nscannedObjects" : 9
}
find returns 5 objects, but had to scan 9 documents. It was therefore able to isolate the correct 9 documents using the _id field. It then went through all 9 documents and filtered them by gender.