mongoid search in an array inside an array of hash - mongodb

Say Object embeds_many searched_items
Here is the document:
{"_id": { "$oid" : "5320028b6d756e1981460000" },
"searched_items": [
{
"_id": { "$oid" : "5320028b6d756e1981470000" },
"hotel_id": 127,
"room_info": [
{
"price": 10,
"amenity_ids": [
1,
2
]
},
{
"price": 160,
"amenity_ids": null
}
]
},
{
"_id": { "$oid" : "5320028b6d756e1981480000" },
"hotel_id": 161,
"room_info": [
{
"price": 400,
"amenity_ids": [4,5]
}
]
}
]
}
I want to find the "searched_items" having room_info.amenity_ids IN [2,3].
I've tried
object.searched_items.where('room_info.amenity_ids' => [2, 3])
object.searched_items.where('room_info.amenity_ids' =>{'$in' => [2,3]}
with no luck

mongoid provides elem_match method for searching within objects of Array Type
e.g.
class A
include Mongoid::Document
field :some_field, type: Array
end
A.create(some_field: [{id: 'a', name: 'b'}, {id: 'c', name: 'd'}])
A.elem_match(some_field: { :id.in=> ["a", "c"] }) => will return the object
Let me know if you have any other doubts.
update
class SearchedHotel
include Mongoid::Document
field :hotel_id, type: String
field :room_info, type: Array
end
SearchedHotel.create(hotel_id: "1", room_info: [{id: 1, amenity_ids: [1,2], price: 600},{id: 2, amenity_ids: [1,2,3], price: 1000}])
SearchedHotel.create(hotel_id: "2", room_info: [{id: 3, amenity_ids: [1,2], price: 600}])
SearchedHotel.elem_match(room_info: {:amenity_ids.in => [1,2]})
Mongoid::Criteria
selector: {"room_info"=>{"$elemMatch"=>{"amenity_ids"=>{"$in"=>[1, 2]}}}}
options: {}
class: SearchedHotel
embedded: false
And it returns both the records. Am I missing something from your question/requirement. If yes, do let me know.

It's important to distinguish between top-level queries sent to the MongoDB server and
client-side operations on embedded-documents that are implemented by Mongoid.
This is the underlying confusion between the original question and the answer from #sandeep-kumar and associated comments.
The original question is all about the where clause on embedded documents after the query result has already been fetched.
The answer #sandeep-kumar and comments are all about top-level queries.
The following test covers both, showing how answers from #sandeep-kumar do work on the examples in your comments,
and also what does and does not work on your original question.
To summarize, Sandeep's answers do work for top-level queries.
Please review your code, if there are remaining problems, please post the exact Ruby code that summarizes the problem.
For your original question, please note that "object" has already been fetched from MongoDB,
and that you can verify this by looking at the log/test.log file.
The subsequent "where" operations are all client-side execution by Mongoid.
Simple "where" clauses do work at the embedded document level.
Complex "where" clauses involving nested array values don't seem to work -
I didn't really expect Mongoid to reimplement '$in' on the client-side.
Knowing that the "object" already has the query result,
and that the association "searched_items" gives you convenient access to the embedded documents,
you can write Ruby code to select what you want as in the following test.
Hope that this helps.
test/unit/my_object_test.rb
require 'test_helper'
require 'pp'
class MyObjectTest < ActiveSupport::TestCase
def setup
MyObject.delete_all
A.delete_all
SearchedHotel.delete_all
end
test "original question with client-side where operation on embedded documents" do
doc = {"_id"=>{"$oid"=>"5320028b6d756e1981460000"}, "searched_items"=>[{"_id"=>{"$oid"=>"5320028b6d756e1981470000"}, "hotel_id"=>127, "room_info"=>[{"price"=>10, "amenity_ids"=>[1, 2]}, {"price"=>160, "amenity_ids"=>nil}]}, {"_id"=>{"$oid"=>"5320028b6d756e1981480000"}, "hotel_id"=>161, "room_info"=>[{"price"=>400, "amenity_ids"=>[4, 5]}]}]}
MyObject.create(doc)
puts
object = MyObject.first
<<-EOT.split("\n").each{|line| puts "#{line}:"; eval "pp #{line}"}
object.searched_items.where('hotel_id' => 127).to_a
object.searched_items.where(:hotel_id.in => [127,128]).to_a
object.searched_items.where('room_info.amenity_ids' => {'$in' => [2,3]}).to_a
object.searched_items.where('room_info.amenity_ids'.to_sym.in => [2,3]).to_a
object.searched_items.select{|searched_item| searched_item.room_info.any?{|room_info| room_info['amenity_ids'] && !(room_info['amenity_ids'] & [2,3]).empty?}}.to_a
EOT
end
test "A comment - top-level queries" do
A.create(some_field: [{id: 'a', name: 'b', tag_ids: [6,7,8]}, {id: 'c', name: 'd'}, tag_ids: [5,6,7]])
A.create(some_field: [{id: 'a', name: 'b', tag_ids: [1,2,3]}, {id: 'c', name: 'd'}, tag_ids: [2,3,4]])
puts
pp A.where('some_field.tag_ids'.to_sym.in => [2,3]).to_a
pp A.elem_match(some_field: { :tag_ids.in => [2,3,4] }).to_a
end
test "SearchedHotel comment - top-level query" do
s = <<-EOT
[#<SearchedHotel _id: 53253c246d756e49a7030000, hotel_id: \"1\", room_info: [{\"id\"=>1, \"amenity_ids\"=>[1, 2], \"price\"=>600}, {\"id\"=>2, \"amenity_ids\"=>[1, 2, 3], \"price\"=>1000}]>, #<SearchedHotel _id: 53253c246d756e49a7040000, hotel_id: \"2\", room_info: [{\"id\"=>3, \"amenity_ids\"=>[1, 2], \"price\"=>600}]>]
EOT
a = eval(s.gsub('#<SearchedHotel ', '{').gsub(/>,/, '},').gsub(/>\]/, '}]').gsub(/_id: \h+, /, ''))
SearchedHotel.create(a)
puts
<<-EOT.split("\n").each{|line| puts "#{line}:"; eval "pp #{line}"}
SearchedHotel.elem_match(room_info: {:amenity_ids.in => [1,2]}).to_a
EOT
end
end
$ ruby -Ilib -Itest test/unit/my_object_test.rb
Run options:
# Running tests:
[1/3] MyObjectTest#test_A_comment_-_top-level_queries
[#<A _id: 5359329d7f11ba034b000002, some_field: [{"id"=>"a", "name"=>"b", "tag_ids"=>[1, 2, 3]}, {"id"=>"c", "name"=>"d"}, {"tag_ids"=>[2, 3, 4]}]>]
[#<A _id: 5359329d7f11ba034b000002, some_field: [{"id"=>"a", "name"=>"b", "tag_ids"=>[1, 2, 3]}, {"id"=>"c", "name"=>"d"}, {"tag_ids"=>[2, 3, 4]}]>]
[2/3] MyObjectTest#test_SearchedHotel_comment_-_top-level_query
SearchedHotel.elem_match(room_info: {:amenity_ids.in => [1,2]}).to_a:
[#<SearchedHotel _id: 5359329d7f11ba034b000003, hotel_id: "1", room_info: [{"id"=>1, "amenity_ids"=>[1, 2], "price"=>600}, {"id"=>2, "amenity_ids"=>[1, 2, 3], "price"=>1000}]>,
#<SearchedHotel _id: 5359329d7f11ba034b000004, hotel_id: "2", room_info: [{"id"=>3, "amenity_ids"=>[1, 2], "price"=>600}]>]
[3/3] MyObjectTest#test_original_question_with_client-side_where_operation_on_embedded_documents
object.searched_items.where('hotel_id' => 127).to_a:
[#<SearchedItem _id: 5359329d7f11ba034b000006, hotel_id: 127, room_info: [{"price"=>10, "amenity_ids"=>[1, 2]}, {"price"=>160, "amenity_ids"=>nil}]>]
object.searched_items.where(:hotel_id.in => [127,128]).to_a:
[#<SearchedItem _id: 5359329d7f11ba034b000006, hotel_id: 127, room_info: [{"price"=>10, "amenity_ids"=>[1, 2]}, {"price"=>160, "amenity_ids"=>nil}]>]
object.searched_items.where('room_info.amenity_ids' => {'$in' => [2,3]}).to_a:
[]
object.searched_items.where('room_info.amenity_ids'.to_sym.in => [2,3]).to_a:
[]
object.searched_items.select{|searched_item| searched_item.room_info.any?{|room_info| room_info['amenity_ids'] && !(room_info['amenity_ids'] & [2,3]).empty?}}.to_a:
[#<SearchedItem _id: 5359329d7f11ba034b000006, hotel_id: 127, room_info: [{"price"=>10, "amenity_ids"=>[1, 2]}, {"price"=>160, "amenity_ids"=>nil}]>]
Finished tests in 0.089544s, 33.5031 tests/s, 0.0000 assertions/s.
3 tests, 0 assertions, 0 failures, 0 errors, 0 skips

Related

How to transform deeply nested data in mongodb aggregation framework?

I’m a newbie with mongoDB aggregation and I’m struggling a bit trying to get my data to look the way I want. I’m a student completing a bootcamp
And we are doing a project where we seed a database of our choice with from millions lines of CSV that was extracted from a SQL database, though Im not sure which one.
For context, the data is questions and answers from a mock retail application we built.
I was given three files. One with questions, one with answers, and one with photos that were uploaded to answers. I successfully used the $ lookup and $out operators
To join these files on the appropriate index and export to a new collection . So now I just have a collection of questions, and a collections of ansPhotos
The issue Is that the data needs to be structurally transformed for different cases.
Suppose I want all the questions and answers for a particular product . Below shows how the question data is structured, giving me all questions for a product_id of 1:
db.questions.find({ product_id: 1 })[
({
_id: ObjectId('61731a1cae4ca5aef1836b04'),
question_id: 4,
product_id: 1,
question_body: 'How long does it last?',
question_date: Long('1594341317010'),
asker_name: 'funnygirl',
asker_email: 'first.last#gmail.com',
reported: 0,
helpful: 6,
},
{
_id: ObjectId('61731a1cae4ca5aef1836b05'),
question_id: 3,
product_id: 1,
question_body: 'Does this product run big or small?',
question_date: Long('1608535907083'),
asker_name: 'jbilas',
asker_email: 'first.last#gmail.com',
reported: 0,
helpful: 8,
},
{
_id: ObjectId('61731a1cae4ca5aef1836b06'),
question_id: 6,
product_id: 1,
question_body: 'Is it noise cancelling?',
question_date: Long('1608855284662'),
asker_name: 'coolkid',
asker_email: 'first.last#gmail.com',
reported: 1,
helpful: 19,
},
{
_id: ObjectId('61731a1cae4ca5aef1836b08'),
question_id: 1,
product_id: 1,
question_body: 'What fabric is the top made of?',
question_date: Long('1595884714409'),
asker_name: 'yankeelover',
asker_email: 'first.last#gmail.com',
reported: 0,
helpful: 1,
},
{
_id: ObjectId('61731a1cae4ca5aef1836b0d'),
question_id: 5,
product_id: 1,
question_body: 'Can I wash it?',
question_date: Long('1608855284662'),
asker_name: 'cleopatra',
asker_email: 'first.last#gmail.com',
reported: 0,
helpful: 7,
},
{
_id: ObjectId('61731a1cae4ca5aef1836b13'),
question_id: 2,
product_id: 1,
question_body: 'HEY THIS IS A WEIRD QUESTION!!!!?',
question_date: Long('1613888219613'),
asker_name: 'jbilas',
asker_email: 'first.last#gmail.com',
reported: 1,
helpful: 4,
})
];
I now want to get all the answers for all these questions. For brevity and because I’ll be pasting a lot of context / examples, heres what a couple of answer documents from ansPhotos looks like:
db.ansPhotos.find({question_id:4})
[
{
_id: ObjectId("61731c9c39b2df95b4573b3c"),
id: 65,
question_id: 4,
body: 'It runs small',
date: Long("1605784307205"),
answerer_name: 'dschulman',
answerer_email: 'first.last#gmail.com',
reported: 0,
helpful: 1,
photos: [
{
_id: ObjectId("61731edbbac3ef59b2a59b04"),
id: 15,
answer_id: 65,
url: 'https://images.unsplash.com/photo-1536922645426-5d658ab49b81?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1650&q=80'
},
{
_id: ObjectId("61731edbbac3ef59b2a59b0a"),
id: 14,
answer_id: 65,
url: 'https://images.unsplash.com/photo-1470116892389-0de5d9770b2c?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1567&q=80'
}
]
},
{
_id: ObjectId("61731c9c39b2df95b4573b54"),
id: 89,
question_id: 4,
body: 'Showing no wear after a few months!',
date: Long("1599089609530"),
answerer_name: 'sillyguy',
answerer_email: 'first.last#gmail.com',
reported: 0,
helpful: 8,
photos: []
}
]
Now for the part I’m struggling with.
The data needs to look different for different API calls. I basically need to nest every answer with its photos in each question.
Here are the key challenges I’m facing and transformations I have to make. There are other transformations that I am not discussing because they are easy to do, such as not returning the Object_id for answers,
Transforming the date etc.
Each question has an answers object that is stored in key-value pairs with its “id” as the key and the object as the value.
Each answer must only have the photos url in an array, instead of an array of objects that have a URL property each., as you can see above for answers related to question_id 4.
Some questions do not have any answers. Question with “ question_id:3” below is one such question. I am still expected to return an empty object at the “answers” key if there are no questions for it.
[
{
"question_id": 4,
"question_body": "How long does it last?",
"question_date": "2020-07-10T00:35:17.010Z",
"asker_name": "funnygirl",
"reported": false,
"question_helpfullness": 6,
"answers": {
"65": {
_id: ObjectId("61731c9c39b2df95b4573b3c"),
"id": 65,
"question_id": 4,
"body": "It runs small",
"date": 1605784307205,
"answerer_name": "dschulman",
"answerer_email": "first.last#gmail.com",
"helpful": 1,
"photos": ["https://images.unsplash.com/photo-1536922645426-5d658ab49b81?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1650&q=80”,
"https://images.unsplash.com/photo-1470116892389-0de5d9770b2c?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1567&q=80"]
},
"89": {
"_id": "61731c9c39b2df95b4573b54",
"id": 89,
"question_id": 4,
"body": "Showing no wear after a few months!",
"date": 1599089609530,
"answerer_name": "sillyguy",
"answerer_email": "first.last#gmail.com",
"reported": 0,
"helpful": 8,
"photos": []
}
}
},
{
"question_id": 5,
"question_body": "Can I wash it?",
"question_date": "2020-12-25T00:14:44.662Z",
"asker_name": "cleopatra",
"reported": false,
"question_helpfullness": 7,
"answers": {
"46": {
"_id": "61731c9c39b2df95b4573b27",
"id": 46,
"question_id": 5,
"body": "I've thrown it in the wash and it seems fine",
"date": 1606022843272,
"answerer_name": "marcanthony",
"answerer_email": "first.last#gmail.com",
"reported": 0,
"photos": []
},
"64": {
"_id": "61731c9c39b2df95b4573b3b",
"id": 64,
"question_id": 5,
"body": "It says not to",
"date": 1588644950162,
"answerer_name": "ceasar",
"answerer_email": "first.last#gmail.com",
"helpful": 0,
"photos": []
},
}
},
{
"question_id": 3,
"question_body": "Does this product run big or small?",
"question_date": "2020-12-21T07:31:47.083Z",
"asker_name": "jbilas",
"reported": false,
"question_helpfullness": 8,
"answers": {}
}
},
//etc..
What I’ve tried in the pipeline:
Calling db.questions.aggregate([]) with the following stages.
Get all products that have a product id of 1 and are not reported:
Stage 1 :
{
'$match': {
'product_id': 1,
'reported': 0
}
}
Stage 2:
Join all questions documents with their respective answers in an array called “answers”
{
'$lookup': {
'from': 'ansPhotos',
'localField': 'question_id',
'foreignField': 'question_id',
'as': 'answers'
}
}
Sample output:
questions_answers> db.questions.aggregate([{$match:{product_id:1,reported:0}},{$lookup:{from:'ansPhotos',localField:'question_id',foreignField:'question_id',as:'answers'}}])
[
{
_id: ObjectId("61731a1cae4ca5aef1836b04"),
question_id: 4,
product_id: 1,
question_body: 'How long does it last?',
question_date: Long("1594341317010"),
asker_name: 'funnygirl',
asker_email: 'first.last#gmail.com',
reported: 0,
helpful: 6,
answers: [
{
_id: ObjectId("61731c9c39b2df95b4573b3c"),
id: 65,
question_id: 4,
body: 'It runs small',
date: Long("1605784307205"),
answerer_name: 'dschulman',
answerer_email: 'first.last#gmail.com',
reported: 0,
helpful: 1,
photos: [
{
_id: ObjectId("61731edbbac3ef59b2a59b04"),
id: 15,
answer_id: 65,
url: 'https://images.unsplash.com/photo-1536922645426-5d658ab49b81?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1650&q=80'
},
{
_id: ObjectId("61731edbbac3ef59b2a59b0a"),
id: 14,
answer_id: 65,
url: 'https://images.unsplash.com/photo-1470116892389-0de5d9770b2c?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1567&q=80'
}
]
},
{
_id: ObjectId("61731c9c39b2df95b4573b54"),
id: 89,
question_id: 4,
body: 'Showing no wear after a few months!',
date: Long("1599089609530"),
answerer_name: 'sillyguy',
answerer_email: 'first.last#gmail.com',
reported: 0,
helpful: 8,
photos: []
}
]
},
{
_id: ObjectId("61731a1cae4ca5aef1836b05"),
question_id: 3,
product_id: 1,
question_body: 'Does this product run big or small?',
question_date: Long("1608535907083"),
asker_name: 'jbilas',
asker_email: 'first.last#gmail.com',
reported: 0,
helpful: 8,
answers: []
},
//etc…
]
Stage 3:
unwind each answer array , preserving null arrays because I still need to return questions without answers.
{
'$unwind': {
'path': '$answers',
'preserveNullAndEmptyArrays': true
}
}
I then have a document for each answer and can manipulate the “answers.photos” object. Each answers field is now an object containing the answers.
STAGE 4
Things become muddy here.
For example, I’ve tried to use $addFields, $set and $project to just get the photos.url property for each answer and put it in an array. I’ve had some success doing this. but…
STAGE 5
Then I try to $group them back into arrays of objects And had some success with it… Note the $ifNull is my feeble attempt to give the next stage what it wants, but it is not working.
{
$group: {“answers.id”},
question_id:{$first:"$question_id"}, question_body:{$first:'$question_body'},question_date:{$first:"$question_date"},asker_name:{$first:"$asker_name"},reported:{$first:"$reported"},question_helpfullness:{$first:"$helpful"},
answers:{$push:{$ifNull:["$answers",{_id:"$_id",id:'noanswers' ,question_id:"$question_id",}]}}
}
BUT I also need to do this at some point:
STAGE 6 or later…
{
'$addFields': {
'answers': {
'$arrayToObject': {
'$map': {
'input': '$answers',
'in': {
'k': {
'$toString': '$$this.id'
},
'v': '$$this'
}
}
}
}
}
}
To give me the appropriate key-value pairs as seen in the desired output.
This is where things get muddy. I have tried a TON of configurations over the last 5 days.
In most cases if I directly manipulate the answers array after stage 4, I get this error when I then try to use $addFields:
PlanExecutor error during aggregation :: caused by :: $arrayToObject requires an object with keys ‘k’ and ‘v’, where the value of ‘k’ must be of type string. Found type: null
This is because the question with id 3 has no answers and I’ve inadvertently assigned it to an empty object using any of the methods mentioned in Stage 4.
I’ve tried some $ifNull operations as you can see to give the this question the key value pairs its expecting, but am only successful sometimes and usually there are other weird side effects.
To summarize, is there a way for me to get only the url property out of the “answers.photos” array, account for the edge case of having a question that has no answers, and still structure them in the key value pairs as illustrated?
Apologies if this is too long or difficult to read. If there’s some more formatting I can do to make it better please let me know. Any help is very very much appreciated.
Joe

MongoDB - how to perform consecutive queries?

I have a schema where one field is an array of values. The collection might look something like:
{
_id: 1,
tags: ['a', 'b']
},
{
_id: 2,
tags: ['b', 'a']
},
_id: 3,
tags: ['a', 'c']
},
_id: 4,
tags: ['c', 'd']
},
_id:5,
tags: ['b', 'e']
}
The user should be able to perform consecutive filter operations, for example,
Filtering for 'a' will return _id:1, _id:2, and _id:3;
A consecutive filter for 'b' will return _id:1 and _id:2 (presumably from the results of step 1 above).
There might be n number of consecutive filter operations.
What is the best way to structure this filter with Mongodb?
Many thanks for your help.

Get items of array by index in MongoDB

So I have a data structure in a Mongo collection (v. 4.0.18) that looks something like this…
{
"_id": ObjectId("242kl4j2lk23423"),
"name": "Doug",
"kids": [
{
"name": "Alice",
"age": 15,
},
{
"name": "James",
"age": 13,
},
{
"name": "Michael",
"age": 10,
},
{
"name": "Sharon",
"age": 8,
}
]
}
In Mongo, how would I get back a projection of this object with only the first two kids? I want the output to look like this:
{
"_id": ObjectId("242kl4j2lk23423"),
"name": "Doug",
"kids": [
{
"name": "Alice",
"age": 15,
},
{
"name": "James",
"age": 13,
}
]
}
It seems like I should easily be able to get them by index, but I'm not seeing anything in the docs about how to do that. The real-world problem I'm trying to solve has nothing to do with kids, and the array could be quite lengthy. I'm trying to break it up and process it in batches without having to load the whole thing into memory in my application.
EDIT (non-sequential indexes):
I noticed that since I asked about item 1 & 2 that $slice would suffice…however, what if I wanted items 1 & 3? Is there a way I can specify specific array indexes to return?
Any ideas or pointers for how to accomplish that?
Thanks!
You are looking for the $slice projection operator if the desired selection are near each other.
https://docs.mongodb.com/manual/reference/operator/projection/slice/
This would return the first 2
client.db.collection.find({"name":"Doug"}, { "kids": { "$slice": 2 } })
returns
{'_id': ObjectId('5f85f682a45e15af3a907f51'), 'name': 'Doug', 'kids': [{'name': 'Alice', 'age': 15}, {'name': 'James', 'age': 13}]}
this would skip the first kid and return the next two (second and third)
client.db.collection.find({"name":"Doug"}, { "kids": { "$slice": [1, 2] } })
returns
{'_id': ObjectId('5f85f682a45e15af3a907f51'), 'name': 'Doug', 'kids': [{'name': 'James', 'age': 13}, {'name': 'Michael', 'age': 10}]}
Edit:
Arbitrary selections 1 and 3 probably need to route through an aggregation pipeline rather than a simple query. The performance shouldn't be too much different assuming you have an index on the $match field.
Steps of your pipeline should be pretty obvious and you should be able to take it from here.
Hate to point to RTFM, but that's going to be super helpful here to at least be acquainted with the pipeline operations.
https://docs.mongodb.com/manual/reference/operator/aggregation/
Your pipeline should:
$match on your desired query
$set some new field kid_selection to element 1 (second element) and element 3 (4th element) since counting starts at 0. Notice the prefixed $ on the "kids" key name in the kid_selection setter. When referencing a key in the document you're working on, you need to prefix with $
project the whole document, minus the original kids field that we've selected from
client.db.collection.aggregate([
{"$match":{"name":"Doug"}},
{"$set": {"kid_selection": [
{ "$arrayElemAt": [ "$kids", 1 ] },
{ "$arrayElemAt": [ "$kids", 3 ] }
]}},
{ "$project": { "kids": 0 } }
])
returns
{
'_id': ObjectId('5f86038635649a988cdd2ade'),
'name': 'Doug',
'kid_selection': [
{'name': 'James', 'age': 13},
{'name': 'Sharon', 'age': 8}
]
}

Fetch a certain amount of array using Mongoose and Mongo [duplicate]

Let's say I've got a collection People in mongo database:
[{
id: 1,
name: "Tom",
animals: ["cat", "dog", "fish", "bear"]
},
{
id: 2,
name: "Rob",
animals: ["shark", "snake", "fish", "bear", "panda"]
},
{
id: 3,
name: "Matt",
animals: ["cat", "fish", "bear"]
}]
For the purpose of REST API I need to create a pagination system for viewing people's animals and return only 3 per request. So for example if you go to /people/2the API should return this array:
["shark", "snake", "fish"]
I'm trying to get this result using Mongo methods. Here's my attempt:
db.getCollection('people').find({id: 2}, {animals: 1, _id:0}, {limit: 3})
Unfortunatelly it doesn't work like that and returns the whole object. Can anybody tell me how to do it?
For you problem you need the $slice projection operator instead of limit. The later limits the number of documents returned as a result of the query. Instead, the $slice operator is intended for exactly what you need.
Here is an example how to use it in your use case:
> db.getCollection('people').find({id: 2}, {_id: 0, animals: {$slice: [0, 3]}})
{
"id" : 2,
"name" : "Rob",
"animals" : [
"shark",
"snake",
"fish"
]
}

Add a new field with large number of rows to existing collection in Mongodb

I have an existing collection with close to 1 million number of docs, now I'd like to append a new field data to this collection. (I'm using PyMongo)
For example, my existing collection db.actions looks like:
...
{'_id':12345, 'A': 'apple', 'B': 'milk'}
{'_id':12346, 'A': 'pear', 'B': 'juice'}
...
Now I want to append a new column field data to this existing collection:
...
{'_id':12345, 'C': 'beef'}
{'_id':12346, 'C': 'chicken'}
...
such that the resulting collection should look like this:
...
{'_id':12345, 'A': 'apple', 'B': 'milk', 'C': 'beef'}
{'_id':12346, 'A': 'pear', 'B': 'juice', 'C': 'chicken'}
...
I know we can do this with update_one with a for loop, e.g
for doc in values:
collection.update_one({'_id': doc['_id']},
{'$set': {k: doc[k] for k in fields}},
upsert=True
)
where values is a list of dictionary each containing two items, the _id key-value pair and new field key-value pair. fields contains all the new fields I'd like to add.
However, the issue is that I have a million number of docs to update, anything with a for loop is way too slow, is there a way to append this new field faster? something similar to insert_many except it's appending to an existing collection?
===============================================
Update1:
So this is what I have for now,
bulk = self.get_collection().initialize_unordered_bulk_op()
for doc in values:
bulk.find({'_id': doc['_id']}).update_one({'$set': {k: doc[k] for k in fields} })
bulk.execute()
I first wrote a sample dataframe into the db with insert_many, the performance:
Time spent in insert_many: total: 0.0457min
then I use update_one with bulk operation to add extra two fields onto the collection, I got:
Time spent: for loop: 0.0283min | execute: 0.0713min | total: 0.0996min
Update2:
I added an extra column to both the existing collection and the new column data, for the purpose of using left join to solve this. If you use left join you can ignore the _id field.
For example, my existing collection db.actions looks like:
...
{'A': 'apple', 'B': 'milk', 'dateTime': '2017-10-12 15:20:00'}
{'A': 'pear', 'B': 'juice', 'dateTime': '2017-12-15 06:10:50'}
{'A': 'orange', 'B': 'pop', 'dateTime': '2017-12-15 16:09:10'}
...
Now I want to append a new column field data to this existing collection:
...
{'C': 'beef', 'dateTime': '2017-10-12 09:08:20'}
{'C': 'chicken', 'dateTime': '2017-12-15 22:40:00'}
...
such that the resulting collection should look like this:
...
{'A': 'apple', 'B': 'milk', 'C': 'beef', 'dateTime': '2017-10-12'}
{'A': 'pear', 'B': 'juice', 'C': 'chicken', 'dateTime': '2017-12-15'}
{'A': 'orange', 'B': 'pop', 'C': 'chicken', 'dateTime': '2017-12-15'}
...
If your updates are really unique per document there is nothing faster than the bulk write API. Neither MongoDB nor the driver can guess what you want to update so you will need to loop through your update definitions and then batch your bulk changes which is pretty much described here:
Bulk update in Pymongo using multiple ObjectId
The "unordered" bulk writes can be slightly faster (although in my tests they weren't) but I'd still vote for the ordered approach for error handling reasons mainly).
If, however, you can group your changes into specific recurring patterns then you're certainly better off defining a bunch of update queries (effectively one update per unique value in your dictionary) and then issue those each targeting a number of documents. My Python is too poor at this point to write that entire code for you but here's a pseudocode example of what I mean:
Let's say you've got the following update dictionary:
{
key: "doc1",
value:
[
{ "field1", "value1" },
{ "field2", "value2" },
]
}, {
key: "doc2",
value:
[
// same fields again as for "doc1"
{ "field1", "value1" },
{ "field2", "value2" },
]
}, {
key: "doc3",
value:
[
{ "someotherfield", "someothervalue" },
]
}
then instead of updating the three documents separately you would send one update to update the first two documents (since they require the identical changes) and then one update to update "doc3". The more knowledge you have upfront about the structure of your update patterns the more you can optimize that even by grouping updates of subsets of fields but that's probably getting a little complicated at some point...
UPDATE:
As per your below request let's give it a shot.
fields = ['C']
values = [
{'_id': 'doc1a', 'C': 'v1'},
{'_id': 'doc1b', 'C': 'v1'},
{'_id': 'doc2a', 'C': 'v2'},
{'_id': 'doc2b', 'C': 'v2'}
]
print 'before transformation:'
for doc in values:
print('_id ' + doc['_id'])
for k in fields:
print(doc[k])
transposed_values = {}
for doc in values:
transposed_values[doc['C']] = transposed_values.get(doc['C'], [])
transposed_values[doc['C']].append(doc['_id'])
print 'after transformation:'
for k, v in transposed_values.iteritems():
print k, v
for k, v in transposed_values.iteritems():
collection.update_many({'_id': { '$in': v}}, {'$set': {'C': k}})
Since your join collection having less documents, you can convert the dateTime to date
db.new.find().forEach(function(d){
d.date = d.dateTime.substring(0,10);
db.new.update({_id : d._id}, d);
})
and do multiple field lookup based on date (substring of dateTime) and _id,
and out to a new collection (enhanced)
db.old.aggregate(
[
{$lookup: {
from : "new",
let : {id : "$_id", date : {$substr : ["$dateTime", 0, 10]}},
pipeline : [
{$match : {
$expr : {
$and : [
{$eq : ["$$id", "$_id"]},
{$eq : ["$$date", "$date"]}
]
}
}},
{$project : {_id : 0, C : "$C"}}
],
as : "newFields"
}
},
{$project : {
_id : 1,
A : 1,
B : 1,
C : {$arrayElemAt : ["$newFields.C", 0]},
date : {$substr : ["$dateTime", 0, 10]}
}},
{$out : "enhanced"}
]
).pretty()
result
> db.enhanced.find()
{ "_id" : 12345, "A" : "apple", "B" : "milk", "C" : "beef", "date" : "2017-10-12" }
{ "_id" : 12346, "A" : "pear", "B" : "juice", "C" : "chicken", "date" : "2017-12-15" }
{ "_id" : 12347, "A" : "orange", "B" : "pop", "date" : "2017-12-15" }
>