Related
I have an existing collection with close to 1 million number of docs, now I'd like to append a new field data to this collection. (I'm using PyMongo)
For example, my existing collection db.actions looks like:
...
{'_id':12345, 'A': 'apple', 'B': 'milk'}
{'_id':12346, 'A': 'pear', 'B': 'juice'}
...
Now I want to append a new column field data to this existing collection:
...
{'_id':12345, 'C': 'beef'}
{'_id':12346, 'C': 'chicken'}
...
such that the resulting collection should look like this:
...
{'_id':12345, 'A': 'apple', 'B': 'milk', 'C': 'beef'}
{'_id':12346, 'A': 'pear', 'B': 'juice', 'C': 'chicken'}
...
I know we can do this with update_one with a for loop, e.g
for doc in values:
collection.update_one({'_id': doc['_id']},
{'$set': {k: doc[k] for k in fields}},
upsert=True
)
where values is a list of dictionary each containing two items, the _id key-value pair and new field key-value pair. fields contains all the new fields I'd like to add.
However, the issue is that I have a million number of docs to update, anything with a for loop is way too slow, is there a way to append this new field faster? something similar to insert_many except it's appending to an existing collection?
===============================================
Update1:
So this is what I have for now,
bulk = self.get_collection().initialize_unordered_bulk_op()
for doc in values:
bulk.find({'_id': doc['_id']}).update_one({'$set': {k: doc[k] for k in fields} })
bulk.execute()
I first wrote a sample dataframe into the db with insert_many, the performance:
Time spent in insert_many: total: 0.0457min
then I use update_one with bulk operation to add extra two fields onto the collection, I got:
Time spent: for loop: 0.0283min | execute: 0.0713min | total: 0.0996min
Update2:
I added an extra column to both the existing collection and the new column data, for the purpose of using left join to solve this. If you use left join you can ignore the _id field.
For example, my existing collection db.actions looks like:
...
{'A': 'apple', 'B': 'milk', 'dateTime': '2017-10-12 15:20:00'}
{'A': 'pear', 'B': 'juice', 'dateTime': '2017-12-15 06:10:50'}
{'A': 'orange', 'B': 'pop', 'dateTime': '2017-12-15 16:09:10'}
...
Now I want to append a new column field data to this existing collection:
...
{'C': 'beef', 'dateTime': '2017-10-12 09:08:20'}
{'C': 'chicken', 'dateTime': '2017-12-15 22:40:00'}
...
such that the resulting collection should look like this:
...
{'A': 'apple', 'B': 'milk', 'C': 'beef', 'dateTime': '2017-10-12'}
{'A': 'pear', 'B': 'juice', 'C': 'chicken', 'dateTime': '2017-12-15'}
{'A': 'orange', 'B': 'pop', 'C': 'chicken', 'dateTime': '2017-12-15'}
...
If your updates are really unique per document there is nothing faster than the bulk write API. Neither MongoDB nor the driver can guess what you want to update so you will need to loop through your update definitions and then batch your bulk changes which is pretty much described here:
Bulk update in Pymongo using multiple ObjectId
The "unordered" bulk writes can be slightly faster (although in my tests they weren't) but I'd still vote for the ordered approach for error handling reasons mainly).
If, however, you can group your changes into specific recurring patterns then you're certainly better off defining a bunch of update queries (effectively one update per unique value in your dictionary) and then issue those each targeting a number of documents. My Python is too poor at this point to write that entire code for you but here's a pseudocode example of what I mean:
Let's say you've got the following update dictionary:
{
key: "doc1",
value:
[
{ "field1", "value1" },
{ "field2", "value2" },
]
}, {
key: "doc2",
value:
[
// same fields again as for "doc1"
{ "field1", "value1" },
{ "field2", "value2" },
]
}, {
key: "doc3",
value:
[
{ "someotherfield", "someothervalue" },
]
}
then instead of updating the three documents separately you would send one update to update the first two documents (since they require the identical changes) and then one update to update "doc3". The more knowledge you have upfront about the structure of your update patterns the more you can optimize that even by grouping updates of subsets of fields but that's probably getting a little complicated at some point...
UPDATE:
As per your below request let's give it a shot.
fields = ['C']
values = [
{'_id': 'doc1a', 'C': 'v1'},
{'_id': 'doc1b', 'C': 'v1'},
{'_id': 'doc2a', 'C': 'v2'},
{'_id': 'doc2b', 'C': 'v2'}
]
print 'before transformation:'
for doc in values:
print('_id ' + doc['_id'])
for k in fields:
print(doc[k])
transposed_values = {}
for doc in values:
transposed_values[doc['C']] = transposed_values.get(doc['C'], [])
transposed_values[doc['C']].append(doc['_id'])
print 'after transformation:'
for k, v in transposed_values.iteritems():
print k, v
for k, v in transposed_values.iteritems():
collection.update_many({'_id': { '$in': v}}, {'$set': {'C': k}})
Since your join collection having less documents, you can convert the dateTime to date
db.new.find().forEach(function(d){
d.date = d.dateTime.substring(0,10);
db.new.update({_id : d._id}, d);
})
and do multiple field lookup based on date (substring of dateTime) and _id,
and out to a new collection (enhanced)
db.old.aggregate(
[
{$lookup: {
from : "new",
let : {id : "$_id", date : {$substr : ["$dateTime", 0, 10]}},
pipeline : [
{$match : {
$expr : {
$and : [
{$eq : ["$$id", "$_id"]},
{$eq : ["$$date", "$date"]}
]
}
}},
{$project : {_id : 0, C : "$C"}}
],
as : "newFields"
}
},
{$project : {
_id : 1,
A : 1,
B : 1,
C : {$arrayElemAt : ["$newFields.C", 0]},
date : {$substr : ["$dateTime", 0, 10]}
}},
{$out : "enhanced"}
]
).pretty()
result
> db.enhanced.find()
{ "_id" : 12345, "A" : "apple", "B" : "milk", "C" : "beef", "date" : "2017-10-12" }
{ "_id" : 12346, "A" : "pear", "B" : "juice", "C" : "chicken", "date" : "2017-12-15" }
{ "_id" : 12347, "A" : "orange", "B" : "pop", "date" : "2017-12-15" }
>
Say that I have a document:
{ _id: 1, item: "ABC", supplier: "XYZ", price: 10, available: 23 }
and then I run something like
db.products.update(
{ _id: 1, supplier: "XYZ" },
{ stock_value: {$mul: ["price", "available", 0.8] }}
)
to get a document
{ _id: 1, item: "ABC", supplier: "XYZ", price: 10, available: 23, stock_value: 184 }
I'd like to do this without loading everything into the client. And I need to be able to specify a different constant (e.g. the 0.8) for each supplier.
I'm thinking I should just use an aggregation with an $out to the same collection, to overwrite the whole then when the update is done, but I can't do a different aggregate() call for each supplier since I'm overwriting the collection - all other suppliers will be skipped. Is there some sort of "in place" aggregation? or a way to append $out ?
Say Object embeds_many searched_items
Here is the document:
{"_id": { "$oid" : "5320028b6d756e1981460000" },
"searched_items": [
{
"_id": { "$oid" : "5320028b6d756e1981470000" },
"hotel_id": 127,
"room_info": [
{
"price": 10,
"amenity_ids": [
1,
2
]
},
{
"price": 160,
"amenity_ids": null
}
]
},
{
"_id": { "$oid" : "5320028b6d756e1981480000" },
"hotel_id": 161,
"room_info": [
{
"price": 400,
"amenity_ids": [4,5]
}
]
}
]
}
I want to find the "searched_items" having room_info.amenity_ids IN [2,3].
I've tried
object.searched_items.where('room_info.amenity_ids' => [2, 3])
object.searched_items.where('room_info.amenity_ids' =>{'$in' => [2,3]}
with no luck
mongoid provides elem_match method for searching within objects of Array Type
e.g.
class A
include Mongoid::Document
field :some_field, type: Array
end
A.create(some_field: [{id: 'a', name: 'b'}, {id: 'c', name: 'd'}])
A.elem_match(some_field: { :id.in=> ["a", "c"] }) => will return the object
Let me know if you have any other doubts.
update
class SearchedHotel
include Mongoid::Document
field :hotel_id, type: String
field :room_info, type: Array
end
SearchedHotel.create(hotel_id: "1", room_info: [{id: 1, amenity_ids: [1,2], price: 600},{id: 2, amenity_ids: [1,2,3], price: 1000}])
SearchedHotel.create(hotel_id: "2", room_info: [{id: 3, amenity_ids: [1,2], price: 600}])
SearchedHotel.elem_match(room_info: {:amenity_ids.in => [1,2]})
Mongoid::Criteria
selector: {"room_info"=>{"$elemMatch"=>{"amenity_ids"=>{"$in"=>[1, 2]}}}}
options: {}
class: SearchedHotel
embedded: false
And it returns both the records. Am I missing something from your question/requirement. If yes, do let me know.
It's important to distinguish between top-level queries sent to the MongoDB server and
client-side operations on embedded-documents that are implemented by Mongoid.
This is the underlying confusion between the original question and the answer from #sandeep-kumar and associated comments.
The original question is all about the where clause on embedded documents after the query result has already been fetched.
The answer #sandeep-kumar and comments are all about top-level queries.
The following test covers both, showing how answers from #sandeep-kumar do work on the examples in your comments,
and also what does and does not work on your original question.
To summarize, Sandeep's answers do work for top-level queries.
Please review your code, if there are remaining problems, please post the exact Ruby code that summarizes the problem.
For your original question, please note that "object" has already been fetched from MongoDB,
and that you can verify this by looking at the log/test.log file.
The subsequent "where" operations are all client-side execution by Mongoid.
Simple "where" clauses do work at the embedded document level.
Complex "where" clauses involving nested array values don't seem to work -
I didn't really expect Mongoid to reimplement '$in' on the client-side.
Knowing that the "object" already has the query result,
and that the association "searched_items" gives you convenient access to the embedded documents,
you can write Ruby code to select what you want as in the following test.
Hope that this helps.
test/unit/my_object_test.rb
require 'test_helper'
require 'pp'
class MyObjectTest < ActiveSupport::TestCase
def setup
MyObject.delete_all
A.delete_all
SearchedHotel.delete_all
end
test "original question with client-side where operation on embedded documents" do
doc = {"_id"=>{"$oid"=>"5320028b6d756e1981460000"}, "searched_items"=>[{"_id"=>{"$oid"=>"5320028b6d756e1981470000"}, "hotel_id"=>127, "room_info"=>[{"price"=>10, "amenity_ids"=>[1, 2]}, {"price"=>160, "amenity_ids"=>nil}]}, {"_id"=>{"$oid"=>"5320028b6d756e1981480000"}, "hotel_id"=>161, "room_info"=>[{"price"=>400, "amenity_ids"=>[4, 5]}]}]}
MyObject.create(doc)
puts
object = MyObject.first
<<-EOT.split("\n").each{|line| puts "#{line}:"; eval "pp #{line}"}
object.searched_items.where('hotel_id' => 127).to_a
object.searched_items.where(:hotel_id.in => [127,128]).to_a
object.searched_items.where('room_info.amenity_ids' => {'$in' => [2,3]}).to_a
object.searched_items.where('room_info.amenity_ids'.to_sym.in => [2,3]).to_a
object.searched_items.select{|searched_item| searched_item.room_info.any?{|room_info| room_info['amenity_ids'] && !(room_info['amenity_ids'] & [2,3]).empty?}}.to_a
EOT
end
test "A comment - top-level queries" do
A.create(some_field: [{id: 'a', name: 'b', tag_ids: [6,7,8]}, {id: 'c', name: 'd'}, tag_ids: [5,6,7]])
A.create(some_field: [{id: 'a', name: 'b', tag_ids: [1,2,3]}, {id: 'c', name: 'd'}, tag_ids: [2,3,4]])
puts
pp A.where('some_field.tag_ids'.to_sym.in => [2,3]).to_a
pp A.elem_match(some_field: { :tag_ids.in => [2,3,4] }).to_a
end
test "SearchedHotel comment - top-level query" do
s = <<-EOT
[#<SearchedHotel _id: 53253c246d756e49a7030000, hotel_id: \"1\", room_info: [{\"id\"=>1, \"amenity_ids\"=>[1, 2], \"price\"=>600}, {\"id\"=>2, \"amenity_ids\"=>[1, 2, 3], \"price\"=>1000}]>, #<SearchedHotel _id: 53253c246d756e49a7040000, hotel_id: \"2\", room_info: [{\"id\"=>3, \"amenity_ids\"=>[1, 2], \"price\"=>600}]>]
EOT
a = eval(s.gsub('#<SearchedHotel ', '{').gsub(/>,/, '},').gsub(/>\]/, '}]').gsub(/_id: \h+, /, ''))
SearchedHotel.create(a)
puts
<<-EOT.split("\n").each{|line| puts "#{line}:"; eval "pp #{line}"}
SearchedHotel.elem_match(room_info: {:amenity_ids.in => [1,2]}).to_a
EOT
end
end
$ ruby -Ilib -Itest test/unit/my_object_test.rb
Run options:
# Running tests:
[1/3] MyObjectTest#test_A_comment_-_top-level_queries
[#<A _id: 5359329d7f11ba034b000002, some_field: [{"id"=>"a", "name"=>"b", "tag_ids"=>[1, 2, 3]}, {"id"=>"c", "name"=>"d"}, {"tag_ids"=>[2, 3, 4]}]>]
[#<A _id: 5359329d7f11ba034b000002, some_field: [{"id"=>"a", "name"=>"b", "tag_ids"=>[1, 2, 3]}, {"id"=>"c", "name"=>"d"}, {"tag_ids"=>[2, 3, 4]}]>]
[2/3] MyObjectTest#test_SearchedHotel_comment_-_top-level_query
SearchedHotel.elem_match(room_info: {:amenity_ids.in => [1,2]}).to_a:
[#<SearchedHotel _id: 5359329d7f11ba034b000003, hotel_id: "1", room_info: [{"id"=>1, "amenity_ids"=>[1, 2], "price"=>600}, {"id"=>2, "amenity_ids"=>[1, 2, 3], "price"=>1000}]>,
#<SearchedHotel _id: 5359329d7f11ba034b000004, hotel_id: "2", room_info: [{"id"=>3, "amenity_ids"=>[1, 2], "price"=>600}]>]
[3/3] MyObjectTest#test_original_question_with_client-side_where_operation_on_embedded_documents
object.searched_items.where('hotel_id' => 127).to_a:
[#<SearchedItem _id: 5359329d7f11ba034b000006, hotel_id: 127, room_info: [{"price"=>10, "amenity_ids"=>[1, 2]}, {"price"=>160, "amenity_ids"=>nil}]>]
object.searched_items.where(:hotel_id.in => [127,128]).to_a:
[#<SearchedItem _id: 5359329d7f11ba034b000006, hotel_id: 127, room_info: [{"price"=>10, "amenity_ids"=>[1, 2]}, {"price"=>160, "amenity_ids"=>nil}]>]
object.searched_items.where('room_info.amenity_ids' => {'$in' => [2,3]}).to_a:
[]
object.searched_items.where('room_info.amenity_ids'.to_sym.in => [2,3]).to_a:
[]
object.searched_items.select{|searched_item| searched_item.room_info.any?{|room_info| room_info['amenity_ids'] && !(room_info['amenity_ids'] & [2,3]).empty?}}.to_a:
[#<SearchedItem _id: 5359329d7f11ba034b000006, hotel_id: 127, room_info: [{"price"=>10, "amenity_ids"=>[1, 2]}, {"price"=>160, "amenity_ids"=>nil}]>]
Finished tests in 0.089544s, 33.5031 tests/s, 0.0000 assertions/s.
3 tests, 0 assertions, 0 failures, 0 errors, 0 skips
Suppose I have several documents like so
{
title: 'blah',
value: {
"A": {property: "foo"},
"B": {property: "bar"},
"C": {property: "foo"},
"D": {property: "foo"}
}
}
{
title: 'blah2',
value: {
"A": {property: "bar"},
"B": {property: "bar"},
"C": {property: "bar"},
"D": {property: "foo"}
}
}
What mongodb query will get me all of the documents / hash keys that have {property: "foo"}
(I know this can be done using js after the query, but can it be done within the query itself?)
The trouble is that there's no wildcard for object keys (see https://jira.mongodb.org/browse/SERVER-267), so you wouldn't be able to do this without listing all of the keys in your "value". That might be an option if you know what all of the keys are, but I imagine you don't.
If you converted "value" to an array rather than an object, you could do a query easily (which would return the documents, not the hash keys).
As the first answer says, there is nothing in the mongodb query language that would allow you to do this type of query.
You might want to consider altering your schema to make value an array like this:
value: [
{ name : "A", property : "bar" },
{ name : "B", property : "bar" },
{ name : "C", property : "bar" },
{ name : "D", property : "foo" }
]
Then you could index on value.property and run a query on value.property = "foo".
I know this question has been asked before, but that's a different scenario.
I'd like to have a collection like this:
{
"_id" : ObjectId("4c28f62cbf8544c60506f11d"),
"pk": 1,
"forums": [{
"pk": 1,
"thread_count": 10,
"post_count": 20,
}, {
"pk": 2,
"thread_count": 5,
"post_count": 24,
}]
}
What I want to do is to upsert a "forum" item, incrementing counters or adding an item if it does not exist.
For example to do something like this (I hope it makes sense):
db.mycollection.update({
"pk": 3,
"forums.pk": 2
}, {
"$inc": {"forums.$.thread_count": 1},
"$inc": {"forums.$.post_count": 1},
}, true)
and have:
{
"_id" : ObjectId("4c28f62cbf8544c60506f11d"),
"pk": 1,
"forums": [{
"pk": 1,
"thread_count": 10,
"post_count": 20,
}, {
"pk": 2,
"thread_count": 5,
"post_count": 24,
}]
},
{
"_id" : ObjectId("4c28f62cbf8544c60506f11e"),
"pk": 3,
"forums": [{
"pk": 2,
"thread_count": 1,
"post_count": 1,
}]
}
I can surely make it in three steps:
Upsert the whole collection with a new item
addToSet the forum item to the list
increment forum item counters with positional operator
That's to say:
db.mycollection.update({pk:3}, {pk:3}, true)
db.mycollection.update({pk:3}, {$addToSet: {forums: {pk:2}}})
db.mycollection.update({pk:3, 'forums.pk': 2}, {$inc: {'forums.$.thread_counter': 1, {'forums.$.post_counter': 1}})
Are you aware of a more efficient way to do it?
TIA, Germano
As you may have discovered, the positional operator cannot be used in upserts:
The positional operator cannot be combined with an upsert since it requires a matching array element. If your update results in an insert then the "$" will literally be used as the field name.
So you won't be able to achieve the desired result in a single query.
You have to separate the creation of the document from the counter update. Your own solution is on the right track. It can be condensed into the following two queries:
// optionally create the document, including the array
db.mycollection.update({pk:3}, {$addToSet: {forums: {pk:2}}}, true)
// update the counters in the array item
db.mycollection.update({pk:3, 'forums.pk': 2}, {$inc: {'forums.$.thread_counter': 1, 'forums.$.post_counter': 1}})