MongoDB : Aggregate and Match duplicated values in two fields - mongodb

I have the collection with the following data in in
{_id: 1, a : 12345, b : [2,3]},
{_id: 2, a : 666},
{_id: 3, a : 6, b : [15, 20]},
{_id: 4, a : 1111, b : [12345, 156]}
I need to match same values that are present in a and b. For the following data the aggregation should return 12345.
I use the next code to get all values from b
db.collection.aggregate([
{$unwind: "$b"},
{$project: {_id: "$b"}}
])
But stuck on what to do next and how to compare results with a field.
Ty for your help.

Related

How to change elements of an array field to values of a dict with a single attribute in MongoDB

I need a way to change the elements of an array field to values of a dict with a single attribute. I don't have to write the result back into my table. I just have to read it that way.
My table has rows like this :
{a: 1, b:[ {...}, ..., {...} ], c: 2}
I need a query that returns each row rewritten like this :
{a: 1, b: [ {foo: { ... }}, ..., {foo: {...}} ], c: 2}
In other words, each element of b becomes a dict with a single attribute, foo.
This feels like a job for $project or $replaceRoot or $set.
I'm using MongoDB 4.2.2 and PyMongo 3.10.1 on Ubuntu 18.04.
You can do that using aggregation :
db.collection.aggregate([{$addFields : {b : { $map: { input: '$b', in: {foo : '$$this'} } }}}])
Test : MongoDB-Playground
Ref : $addFields , $map

Add a new field with large number of rows to existing collection in Mongodb

I have an existing collection with close to 1 million number of docs, now I'd like to append a new field data to this collection. (I'm using PyMongo)
For example, my existing collection db.actions looks like:
...
{'_id':12345, 'A': 'apple', 'B': 'milk'}
{'_id':12346, 'A': 'pear', 'B': 'juice'}
...
Now I want to append a new column field data to this existing collection:
...
{'_id':12345, 'C': 'beef'}
{'_id':12346, 'C': 'chicken'}
...
such that the resulting collection should look like this:
...
{'_id':12345, 'A': 'apple', 'B': 'milk', 'C': 'beef'}
{'_id':12346, 'A': 'pear', 'B': 'juice', 'C': 'chicken'}
...
I know we can do this with update_one with a for loop, e.g
for doc in values:
collection.update_one({'_id': doc['_id']},
{'$set': {k: doc[k] for k in fields}},
upsert=True
)
where values is a list of dictionary each containing two items, the _id key-value pair and new field key-value pair. fields contains all the new fields I'd like to add.
However, the issue is that I have a million number of docs to update, anything with a for loop is way too slow, is there a way to append this new field faster? something similar to insert_many except it's appending to an existing collection?
===============================================
Update1:
So this is what I have for now,
bulk = self.get_collection().initialize_unordered_bulk_op()
for doc in values:
bulk.find({'_id': doc['_id']}).update_one({'$set': {k: doc[k] for k in fields} })
bulk.execute()
I first wrote a sample dataframe into the db with insert_many, the performance:
Time spent in insert_many: total: 0.0457min
then I use update_one with bulk operation to add extra two fields onto the collection, I got:
Time spent: for loop: 0.0283min | execute: 0.0713min | total: 0.0996min
Update2:
I added an extra column to both the existing collection and the new column data, for the purpose of using left join to solve this. If you use left join you can ignore the _id field.
For example, my existing collection db.actions looks like:
...
{'A': 'apple', 'B': 'milk', 'dateTime': '2017-10-12 15:20:00'}
{'A': 'pear', 'B': 'juice', 'dateTime': '2017-12-15 06:10:50'}
{'A': 'orange', 'B': 'pop', 'dateTime': '2017-12-15 16:09:10'}
...
Now I want to append a new column field data to this existing collection:
...
{'C': 'beef', 'dateTime': '2017-10-12 09:08:20'}
{'C': 'chicken', 'dateTime': '2017-12-15 22:40:00'}
...
such that the resulting collection should look like this:
...
{'A': 'apple', 'B': 'milk', 'C': 'beef', 'dateTime': '2017-10-12'}
{'A': 'pear', 'B': 'juice', 'C': 'chicken', 'dateTime': '2017-12-15'}
{'A': 'orange', 'B': 'pop', 'C': 'chicken', 'dateTime': '2017-12-15'}
...
If your updates are really unique per document there is nothing faster than the bulk write API. Neither MongoDB nor the driver can guess what you want to update so you will need to loop through your update definitions and then batch your bulk changes which is pretty much described here:
Bulk update in Pymongo using multiple ObjectId
The "unordered" bulk writes can be slightly faster (although in my tests they weren't) but I'd still vote for the ordered approach for error handling reasons mainly).
If, however, you can group your changes into specific recurring patterns then you're certainly better off defining a bunch of update queries (effectively one update per unique value in your dictionary) and then issue those each targeting a number of documents. My Python is too poor at this point to write that entire code for you but here's a pseudocode example of what I mean:
Let's say you've got the following update dictionary:
{
key: "doc1",
value:
[
{ "field1", "value1" },
{ "field2", "value2" },
]
}, {
key: "doc2",
value:
[
// same fields again as for "doc1"
{ "field1", "value1" },
{ "field2", "value2" },
]
}, {
key: "doc3",
value:
[
{ "someotherfield", "someothervalue" },
]
}
then instead of updating the three documents separately you would send one update to update the first two documents (since they require the identical changes) and then one update to update "doc3". The more knowledge you have upfront about the structure of your update patterns the more you can optimize that even by grouping updates of subsets of fields but that's probably getting a little complicated at some point...
UPDATE:
As per your below request let's give it a shot.
fields = ['C']
values = [
{'_id': 'doc1a', 'C': 'v1'},
{'_id': 'doc1b', 'C': 'v1'},
{'_id': 'doc2a', 'C': 'v2'},
{'_id': 'doc2b', 'C': 'v2'}
]
print 'before transformation:'
for doc in values:
print('_id ' + doc['_id'])
for k in fields:
print(doc[k])
transposed_values = {}
for doc in values:
transposed_values[doc['C']] = transposed_values.get(doc['C'], [])
transposed_values[doc['C']].append(doc['_id'])
print 'after transformation:'
for k, v in transposed_values.iteritems():
print k, v
for k, v in transposed_values.iteritems():
collection.update_many({'_id': { '$in': v}}, {'$set': {'C': k}})
Since your join collection having less documents, you can convert the dateTime to date
db.new.find().forEach(function(d){
d.date = d.dateTime.substring(0,10);
db.new.update({_id : d._id}, d);
})
and do multiple field lookup based on date (substring of dateTime) and _id,
and out to a new collection (enhanced)
db.old.aggregate(
[
{$lookup: {
from : "new",
let : {id : "$_id", date : {$substr : ["$dateTime", 0, 10]}},
pipeline : [
{$match : {
$expr : {
$and : [
{$eq : ["$$id", "$_id"]},
{$eq : ["$$date", "$date"]}
]
}
}},
{$project : {_id : 0, C : "$C"}}
],
as : "newFields"
}
},
{$project : {
_id : 1,
A : 1,
B : 1,
C : {$arrayElemAt : ["$newFields.C", 0]},
date : {$substr : ["$dateTime", 0, 10]}
}},
{$out : "enhanced"}
]
).pretty()
result
> db.enhanced.find()
{ "_id" : 12345, "A" : "apple", "B" : "milk", "C" : "beef", "date" : "2017-10-12" }
{ "_id" : 12346, "A" : "pear", "B" : "juice", "C" : "chicken", "date" : "2017-12-15" }
{ "_id" : 12347, "A" : "orange", "B" : "pop", "date" : "2017-12-15" }
>

MongoDB Composite Key query with one field unknown

I have a database that has a unique combination of two fields (x and i) for every entry. So I have set the _id field to be {_id: {a: x, b: i}}. Now I want to retrieve all values that have a certain value x but that have any value for i.
Example:
{_id: {a: 1, b: 5}},
{_id: {a: 1, b: 3}},
{_id: {a: 2, b: 5}}
{_id: {a: 3, b: 3}}
Now I want to do something like: db.find({_id: {a: 1, b: { $exists: true}}) or even easier: db.find({_id: {a: 1}) that should return:
{_id: {a: 1, b: 5}},
{_id: {a: 1, b: 3}}
Is there any way I could achieve this? Or in other words can you query in any way on this composite primary key? Currently I added the fields to the object itself but this is not really an optimal solution as my data set gets really large.
Edit:
db.someCollection.find({"_id.a": 1, "_id.b": { $exists: true}})
Seems to be a solution, this is however just as slow as adding a as a field (not in the key) to the object. Is there a faster method?
Have you tried this?
db.someCollection.find({"_id.a": 1, "_id.b": { $exists: true}})

Most frequent word in MongoDB collection

I got a MongoDB collection where each entry has a product field containing a string array. What i would like to do is find the most frequent word in the whole collection. Any ideas on how to do that ?
Here is a sample object:
{
"_id" : ObjectId("55e02d333b88f425f84191af"),
"product" : [
" x bla y "
],
"hash_key" : "ecfe355b2f45dfbaf361cff4d314d4cc",
"price" : [
"z"
],
"image" : "image_url"
}
Looking at the sample object, what I would like to do is count "x", "bla" and "y" singularly.
I recently had to do something similar. I had a collection of objects and each object had a list of keywords. To count the frequency of each keyword, I used the following aggregation pipeline, which uses the MongoDB version 4.4 $accumulator group operation.
db.collectionname.aggregate(
{$match: {available: true}}, // Some criteria to filter the documents
{$project:
{ _id: 0, keywords: 1}}, // Only keep keywords
{$group:
{_id: null, keywords: // Accumulate keywords into one array
{$accumulator: {
init: function(){return new Array()},
accumulate: function(state, value){return state.concat(value)},
accumulateArgs: ["$keywords"],
merge: function(state1, state2){return state1.concat(state2)},
lang: "js"}}}},
{$unwind: "$keywords"}, // Split array into fields
{$group: {_id: "$keywords", freq: {$sum: 1}}}, // Group keywords and count frequencies
{$sort: {freq: -1}}, // Sort in reverse order
{$limit: 5} // Take first five
)
I have no idea if this is the most efficient solution. However, it solved the problem for me.

Query for field in subdocument

An example of the schema i have;
{ "_id" : 1234,
“dealershipName”: “Eric’s Mongo Cars”,
“cars”: [
{“year”: 2013,
“make”: “10gen”,
“model”: “MongoCar”,
“vin”: 3928056,
“mechanicNotes”: “Runs great!”},
{“year”: 1985,
“make”: “DeLorean”,
“model”: “DMC-12”,
“vin”: 8056309,
“mechanicNotes”: “Great Scott!”}
]
}
I wish to query and return only the value "vin" in "_id : 1234". Any suggestion is much appreciated.
You can use the field selection parameter with dot notation to constrain the output to just the desired field:
db.test.find({_id: 1234}, {_id: 0, 'cars.vin': 1})
Output:
{
"cars" : [
{
"vin" : 3928056
},
{
"vin" : 8056309
}
]
}
Or if you just want an array of vin values you can use aggregate:
db.test.aggregate([
// Find the matching doc
{$match: {_id: 1234}},
// Duplicate it, once per cars element
{$unwind: '$cars'},
// Group it back together, adding each cars.vin value as an element of a vin array
{$group: {_id: '$_id', vin: {$push: '$cars.vin'}}},
// Only include the vin field in the output
{$project: {_id: 0, vin: 1}}
])
Output:
{
"result" : [
{
"vin" : [
3928056,
8056309
]
}
],
"ok" : 1
}
If by query, you mean get the values of the vins in javascript, you could read the json into a string called theString (or any other name) and do something like:
var f = [], obj = JSON.parse(theString);
obj.cars.forEach(function(item) { f.push(item.vin) });
If your json above is part of a larger collection, then you'd need an outer loop.