PyMongo(3.6.1) - db.collection.aggregate(pipeline) NOT WORKING - mongodb

I'm having trouble with PyMongo, I have been googling for hours, but found no solution.
I'm working with some Python Scripts just to practice with MongoDb, which is running on my local machine. I have populated my mongoDb instance with one database "moviesDB", which contains 3 different collections:
1.Movies collection, here is an example of a document from this coll:
{'_id': 1,
'title': 'Toy Story (1995)',
'genres': ['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy'],
'averageRating': 3.87,
'numberOfRatings': 247,
'tags': [
{'_id': ObjectId('5b04970b87977f1fec0fb6e9'),
'userId': 501,
'tag': 'Pixar',
'timestamp': 1292956344}
]
}
2.Ratings collection, which looks like this:
{ '_id':ObjectId('5b04970c87977f1fec0fb923'),
'userId': 1,
'movieId': 31,
'rating': 2.5,
'timestamp': 1260759144}
3.Tags collection, that I won't use here, so it's not important.
Now, what I'm trying to do is: given a user (in this example, user 1), find all the genres of movies that he rated and per each genre list all the movieIds regarding that genre.
Here's my code:
"""
This query basically retrieves movieIds,
so from the result list of several documents like this:
{
ObjectId('5b04970c87977f1fec0fb923'),
'userId': 1,
'movieId': 31,
'rating': 2.5,
'timestamp': 1260759144},
retrieves only an array of integers, where each number represent a movie
that the user 1 rated."""
movies_rated_by_user = list(db.ratings.distinct(movieId, {userId: 1}))
pipeline = [
{"$match": {"_id ": {"$in": movies_rated_by_user}}},
{"$unwind": "$genres"},
{"$group": {"_id": "$genres", "movies": {"$addToSet": "$_id"}}}]
try:
"""HERE IS THE PROBLEM, SINCE db.movies.aggregate() RETURNS NOTHING!
so the cursor is empty."""
cursor = db.movies.aggregate(pipeline, cursor={})
except OperationFailure:
print("Something went Wrong", file=open("operations_log.txt", "a"))
print(OperationFailure.details, file=open("operations_log.txt", "a"))
sys.exit(1)
aggregate_genre = []
for c in cursor:
aggregate_genre.append(c)
print(aggregate_genre)
The point is that the aggregate function on the movies collection retrieves NOTHING, whereas it really should, since I tried this query on the MongoShell and it worked just fine. Here's how the mongoDB shell-query looks like:
db.movies.aggregate(
[
{$match:{_id : {$in: ids}}},
{$unwind : "$genres"},
{$group :
{
_id : "$genres",
movies: { $addToSet : "$_id" }}}
]
);
Where the 'ids' variables is defined like this, just like the movies_rated_by_user variable in the code:
ids= db.ratings.distinct("movieId", {userId : 1});
The result from the aggregate method looks like this (this is what the aggregate_genre variable in the code, should contain):
{ "_id" : "Western", "movies" : [ 3671 ] }
{ "_id" : "Crime", "movies" : [ 1953, 1405 ] }
{ "_id" : "Fantasy", "movies" : [ 2968, 2294, 2193, 1339 ] }
{ "_id" : "Comedy", "movies" : [ 3671, 2294, 2968, 2150, 1405 ] }
{ "_id" : "Sci-Fi", "movies" : [ 2455, 2968, 1129, 1371, 2105 ] }
{ "_id" : "Adventure", "movies" : [ 2193, 2150, 1405, 1287, 2105, 2294,
2968, 1371, 1129 ] }
Now the problem is the aggregate method, is there any error with the pipeline string??
PLEASE HELP!!
Thank you

I think you need to cast the cursor to a list before iterating through it. Hope this helps.
cursor = list(db.movies.aggregate(pipeline, cursor={}))

Related

MGO Query nested array of objects

I
am having difficulty converting a MongoDB query to mgo bson. The Mongo record schema is as shown below. I want to find records that have topics with label "Education" and "Students".
db.questions.insert
(
{
"_id" : ObjectId("5cb4048478163fa3c9726fdf"),
"questionText" : "why?",
"createdOn" : new Date(),
"createdBy": user1,
"topics" : [
{
"label": "Education",
},
{
"label": "Life and Living",
},
{
"label": "Students"
}
]
}
)
Using Robo 3T, the query looks like this:
db.questions.find({$and : [
{"topics": {"label": "Students"}},
{"topics": {"label": "Education"}}
]})
I am having trouble modeling this with MGO. Currently, have tried this:
map[$and:[
map[topics:map[label:students]]
map[topics:map[label:life and living]]
]]
and this
map[topics:map[$and:[
map[label:students]
map[label:life and living]
]]]
If you want to find some value from nested array then you use $elemMatch method.
db.questions.find(
{$and:
[
{topics: {$elemMatch: {label: 'Students'}}},
{topics: {$elemMatch: {label: 'Education'}}}
]
}
)
The bson model for the above answer is as follows:
query = getAndFilters(
bson.M{"topics": bson.M{"$elemMatch": bson.M{"label": "Students"}}},
bson.M{"topics": bson.M{"$elemMatch": bson.M{"label": "Education"}}})

Add a new field with large number of rows to existing collection in Mongodb

I have an existing collection with close to 1 million number of docs, now I'd like to append a new field data to this collection. (I'm using PyMongo)
For example, my existing collection db.actions looks like:
...
{'_id':12345, 'A': 'apple', 'B': 'milk'}
{'_id':12346, 'A': 'pear', 'B': 'juice'}
...
Now I want to append a new column field data to this existing collection:
...
{'_id':12345, 'C': 'beef'}
{'_id':12346, 'C': 'chicken'}
...
such that the resulting collection should look like this:
...
{'_id':12345, 'A': 'apple', 'B': 'milk', 'C': 'beef'}
{'_id':12346, 'A': 'pear', 'B': 'juice', 'C': 'chicken'}
...
I know we can do this with update_one with a for loop, e.g
for doc in values:
collection.update_one({'_id': doc['_id']},
{'$set': {k: doc[k] for k in fields}},
upsert=True
)
where values is a list of dictionary each containing two items, the _id key-value pair and new field key-value pair. fields contains all the new fields I'd like to add.
However, the issue is that I have a million number of docs to update, anything with a for loop is way too slow, is there a way to append this new field faster? something similar to insert_many except it's appending to an existing collection?
===============================================
Update1:
So this is what I have for now,
bulk = self.get_collection().initialize_unordered_bulk_op()
for doc in values:
bulk.find({'_id': doc['_id']}).update_one({'$set': {k: doc[k] for k in fields} })
bulk.execute()
I first wrote a sample dataframe into the db with insert_many, the performance:
Time spent in insert_many: total: 0.0457min
then I use update_one with bulk operation to add extra two fields onto the collection, I got:
Time spent: for loop: 0.0283min | execute: 0.0713min | total: 0.0996min
Update2:
I added an extra column to both the existing collection and the new column data, for the purpose of using left join to solve this. If you use left join you can ignore the _id field.
For example, my existing collection db.actions looks like:
...
{'A': 'apple', 'B': 'milk', 'dateTime': '2017-10-12 15:20:00'}
{'A': 'pear', 'B': 'juice', 'dateTime': '2017-12-15 06:10:50'}
{'A': 'orange', 'B': 'pop', 'dateTime': '2017-12-15 16:09:10'}
...
Now I want to append a new column field data to this existing collection:
...
{'C': 'beef', 'dateTime': '2017-10-12 09:08:20'}
{'C': 'chicken', 'dateTime': '2017-12-15 22:40:00'}
...
such that the resulting collection should look like this:
...
{'A': 'apple', 'B': 'milk', 'C': 'beef', 'dateTime': '2017-10-12'}
{'A': 'pear', 'B': 'juice', 'C': 'chicken', 'dateTime': '2017-12-15'}
{'A': 'orange', 'B': 'pop', 'C': 'chicken', 'dateTime': '2017-12-15'}
...
If your updates are really unique per document there is nothing faster than the bulk write API. Neither MongoDB nor the driver can guess what you want to update so you will need to loop through your update definitions and then batch your bulk changes which is pretty much described here:
Bulk update in Pymongo using multiple ObjectId
The "unordered" bulk writes can be slightly faster (although in my tests they weren't) but I'd still vote for the ordered approach for error handling reasons mainly).
If, however, you can group your changes into specific recurring patterns then you're certainly better off defining a bunch of update queries (effectively one update per unique value in your dictionary) and then issue those each targeting a number of documents. My Python is too poor at this point to write that entire code for you but here's a pseudocode example of what I mean:
Let's say you've got the following update dictionary:
{
key: "doc1",
value:
[
{ "field1", "value1" },
{ "field2", "value2" },
]
}, {
key: "doc2",
value:
[
// same fields again as for "doc1"
{ "field1", "value1" },
{ "field2", "value2" },
]
}, {
key: "doc3",
value:
[
{ "someotherfield", "someothervalue" },
]
}
then instead of updating the three documents separately you would send one update to update the first two documents (since they require the identical changes) and then one update to update "doc3". The more knowledge you have upfront about the structure of your update patterns the more you can optimize that even by grouping updates of subsets of fields but that's probably getting a little complicated at some point...
UPDATE:
As per your below request let's give it a shot.
fields = ['C']
values = [
{'_id': 'doc1a', 'C': 'v1'},
{'_id': 'doc1b', 'C': 'v1'},
{'_id': 'doc2a', 'C': 'v2'},
{'_id': 'doc2b', 'C': 'v2'}
]
print 'before transformation:'
for doc in values:
print('_id ' + doc['_id'])
for k in fields:
print(doc[k])
transposed_values = {}
for doc in values:
transposed_values[doc['C']] = transposed_values.get(doc['C'], [])
transposed_values[doc['C']].append(doc['_id'])
print 'after transformation:'
for k, v in transposed_values.iteritems():
print k, v
for k, v in transposed_values.iteritems():
collection.update_many({'_id': { '$in': v}}, {'$set': {'C': k}})
Since your join collection having less documents, you can convert the dateTime to date
db.new.find().forEach(function(d){
d.date = d.dateTime.substring(0,10);
db.new.update({_id : d._id}, d);
})
and do multiple field lookup based on date (substring of dateTime) and _id,
and out to a new collection (enhanced)
db.old.aggregate(
[
{$lookup: {
from : "new",
let : {id : "$_id", date : {$substr : ["$dateTime", 0, 10]}},
pipeline : [
{$match : {
$expr : {
$and : [
{$eq : ["$$id", "$_id"]},
{$eq : ["$$date", "$date"]}
]
}
}},
{$project : {_id : 0, C : "$C"}}
],
as : "newFields"
}
},
{$project : {
_id : 1,
A : 1,
B : 1,
C : {$arrayElemAt : ["$newFields.C", 0]},
date : {$substr : ["$dateTime", 0, 10]}
}},
{$out : "enhanced"}
]
).pretty()
result
> db.enhanced.find()
{ "_id" : 12345, "A" : "apple", "B" : "milk", "C" : "beef", "date" : "2017-10-12" }
{ "_id" : 12346, "A" : "pear", "B" : "juice", "C" : "chicken", "date" : "2017-12-15" }
{ "_id" : 12347, "A" : "orange", "B" : "pop", "date" : "2017-12-15" }
>

Query for field in subdocument

An example of the schema i have;
{ "_id" : 1234,
“dealershipName”: “Eric’s Mongo Cars”,
“cars”: [
{“year”: 2013,
“make”: “10gen”,
“model”: “MongoCar”,
“vin”: 3928056,
“mechanicNotes”: “Runs great!”},
{“year”: 1985,
“make”: “DeLorean”,
“model”: “DMC-12”,
“vin”: 8056309,
“mechanicNotes”: “Great Scott!”}
]
}
I wish to query and return only the value "vin" in "_id : 1234". Any suggestion is much appreciated.
You can use the field selection parameter with dot notation to constrain the output to just the desired field:
db.test.find({_id: 1234}, {_id: 0, 'cars.vin': 1})
Output:
{
"cars" : [
{
"vin" : 3928056
},
{
"vin" : 8056309
}
]
}
Or if you just want an array of vin values you can use aggregate:
db.test.aggregate([
// Find the matching doc
{$match: {_id: 1234}},
// Duplicate it, once per cars element
{$unwind: '$cars'},
// Group it back together, adding each cars.vin value as an element of a vin array
{$group: {_id: '$_id', vin: {$push: '$cars.vin'}}},
// Only include the vin field in the output
{$project: {_id: 0, vin: 1}}
])
Output:
{
"result" : [
{
"vin" : [
3928056,
8056309
]
}
],
"ok" : 1
}
If by query, you mean get the values of the vins in javascript, you could read the json into a string called theString (or any other name) and do something like:
var f = [], obj = JSON.parse(theString);
obj.cars.forEach(function(item) { f.push(item.vin) });
If your json above is part of a larger collection, then you'd need an outer loop.

Meteor upsert - using addToSet and each

I have a Groups collection that contains an eventId and an array of guest ids. I am trying to use upsert to check if an eventId and current guest id already exist, and if so update the collection with the new guest id pushed to the guests array, otherwise insert the new collection with both the current guest and new guest. Here's my upsert code:
var groupId = Groups.upsert({
$and:[
{eventId:groupAttributes.eventId}, // find all the Groups associated w/ this event
{ "guests": {$in: [guest]}}, // and include this guest
]
},{$set: group,
$addToSet: {guests: {$each: [ guest, groupAttributes.guest ]} }
});
What actually happens is a new Group is created everytime, and it's because the $each modifier isn't allowing $addToSet to add multiple values to an array as described here http://docs.mongodb.org/manual/reference/operator/update/addToSet/#up._S_addToSet
Instead what happens is the guests array is added with an "$each" array of size 2.
[{"$each":["yQZfEXfs7J9E4Nbqf","s2rk5KAxq4dYtDFNG"]}]
Two questions. 1) Is there a better way to do what I am trying to do, and if not 2) What am I doing incorrectly? Thanks.
Edit: Trying the following code in meteor mongo
db.groups.update({
$and:[
{eventId:"wGMhaP7t4nsiTNHt5"},
{ "guests": {$in: ["yQZfEXfs7J9E4Nbqf"]}},
]
},{$set: { userId: 'gBuR448nsJMcpwjsT',
author: 'Martin W',
weddingId: 'rz9xjtDm3bFAeiCxM',
eventId: 'wGMhaP7t4nsiTNHt5' },
$addToSet: {guests: {$each: [ "yQZfEXfs7J9E4Nbqf","s2rk5KAxq4dYtDFNG" ]} }
}, {upsert: true})
The result is the following
{ "_id" : ObjectId("528fa31a098141f336688d96"), "author" : "Martin W", "eventId"
: "wGMhaP7t4nsiTNHt5", "guests" : [ "yQZfEXfs7J9E4Nbqf", "s2rk5KAxq4dYtDFNG"
], "userId" : "gBuR448nsJMcpwjsT", "weddingId" : "rz9xjtDm3bFAeiCxM" }
Now I try the same code in Meteor
var groupId = Groups.update({
$and:[
{eventId:"wGMhaP7t4nsiTNHt5"},
{ "guests": {$in: ["yQZfEXfs7J9E4Nbqf"]}},
]
},{$set: { userId: 'gBuR448nsJMcpwjsT',
author: 'Martin W',
weddingId: 'rz9xjtDm3bFAeiCxM',
eventId: 'wGMhaP7t4nsiTNHt5' },
$addToSet: {guests: {$each: [ "yQZfEXfs7J9E4Nbqf","s2rk5KAxq4dYtDFNG" ]} }
}, {upsert: true})
And the result is different (and undesirable)
{ "_id" : "4XgaF6pBGEohQR6pa", "userId" : "gBuR448nsJMcpwjsT", "author" : "Marti
n W", "weddingId" : "rz9xjtDm3bFAeiCxM", "eventId" : "wGMhaP7t4nsiTNHt5", "guest
s" : [ { "$each" : [ "yQZfEXfs7J9E4Nbqf", "s2rk5KAxq4dYtDFNG" ] } ] }
I am a mongo newb so am not sure what to make of it, but I notice how the _id is an ObjectId in the first result and not in the second. And the guests array is the way I would expect using meteor mongo, but using $each as a value in the second result set. Any ideas?

Struggling to get ordered results from the last retrieved article, given array of elements to search in

I have a collections of objects with structure like this:
{
"_id" : ObjectId("5233a700bc7b9f31580a9de0"),
"id" : "3df7ce4cc2586c37607a8266093617da",
"published_at" : ISODate("2013-09-13T23:59:59Z"),
...
"topic_id" : [
284,
9741
],
...
"date" : NumberLong("1379116800055")
}
I'm trying to use the following query:
db.collection.find({"topic_id": { $in: [ 9723, 9953, 9558, 9982, 9833, 301, ... 9356, 9990, 9497, 9724] }, "date": { $gte: 1378944001000, $lte: 1378954799000 }, "_id": { $gt: ObjectId('523104ddbc7b9f023700193c') }}).sort({ "_id": 1 }).limit(1000)
The above query uses topic_id, date index but then it does not keep the order of returned results.
Forcing it to use hint({_id:1}) makes the results ordered, but the nscanned is 1 million documents even though limit(1000) is specified.
What am I missing?