Most frequent word in MongoDB collection - mongodb

I got a MongoDB collection where each entry has a product field containing a string array. What i would like to do is find the most frequent word in the whole collection. Any ideas on how to do that ?
Here is a sample object:
{
"_id" : ObjectId("55e02d333b88f425f84191af"),
"product" : [
" x bla y "
],
"hash_key" : "ecfe355b2f45dfbaf361cff4d314d4cc",
"price" : [
"z"
],
"image" : "image_url"
}
Looking at the sample object, what I would like to do is count "x", "bla" and "y" singularly.

I recently had to do something similar. I had a collection of objects and each object had a list of keywords. To count the frequency of each keyword, I used the following aggregation pipeline, which uses the MongoDB version 4.4 $accumulator group operation.
db.collectionname.aggregate(
{$match: {available: true}}, // Some criteria to filter the documents
{$project:
{ _id: 0, keywords: 1}}, // Only keep keywords
{$group:
{_id: null, keywords: // Accumulate keywords into one array
{$accumulator: {
init: function(){return new Array()},
accumulate: function(state, value){return state.concat(value)},
accumulateArgs: ["$keywords"],
merge: function(state1, state2){return state1.concat(state2)},
lang: "js"}}}},
{$unwind: "$keywords"}, // Split array into fields
{$group: {_id: "$keywords", freq: {$sum: 1}}}, // Group keywords and count frequencies
{$sort: {freq: -1}}, // Sort in reverse order
{$limit: 5} // Take first five
)
I have no idea if this is the most efficient solution. However, it solved the problem for me.

Related

MongoDB Aggregate a sub field of an array (Array is main field)

I have a database in MongoDB. There are three main fields: _id, Reviews, HotelInfo. Reviews and HotelInfo are arrays. Reviews has a field called Author. I would like to print out every author name (once) and the amount of times they appear within the dataset.
I tried:
db.reviews.aggregate( {$group : { _id : '$Reviews.Author', count : {$sum : 1}}} ).pretty()
A part of the results were:
"_id" : [
"VijGuy",
"Stephtastic",
"dakota431",
"luce_sociator",
"ArkansasMomOf3",
"ccslilqt6969",
"RJeanM",
"MissDusty",
"sammymd",
"A TripAdvisor Member",
"A TripAdvisor Member"
],
"count" : 1
How it should be:
{ "_id" : "VijGuy", "count" : 1 }, { "_id" : "Stephtastic", "count" : 1 }
I posted the JSON format below.
Any idea on how to do this would be appreciated
JSON Format
Lets assume that this is our collection.
[{
_id: 1,
Reviews: [{Author: 'elad' , txt: 'good'}, {Author: 'chen', txt: 'bad'}]
},
{
_id: 2,
Reviews: [{Author: 'elad', txt : 'nice'}]
}]
to get the data as you want we need to first use the unwind stage and then the group stage.
[{ $unwind: '$Reviews' }, {$group : { _id : '$Reviews.Author', count : {$sum : 1}}}]
You need to first unwind the collection by the Reviews field.
after the unwind stage our data in the pipeline will look like this.
{_id:1, Reviews: {Author: 'elad' , txt: 'good'}},
{_id:1, Reviews: {Author: 'chen' , txt: 'bad'}},
{_id:2, Revies: {Author: 'elad', txt : 'nice'}
The unwind created a document for each element in Reviews array with the element itself and his host document. Now its easy to group in useful ways as you want. Now we can use the same group that you wrote and we will get the results.
after the group our data will look like this:
[{_id: 'elad',sum:2},{_id: 'chen', sum: 1}]
Unwind is a very important pipeline stage in the aggregation framework. Its help us transform complex and nested documents into flat and simple, and that help us to query the data in different ways.
What's the $unwind operator in MongoDB?

MongoDB find closest match

I'm wondering if it is possible to access a document in MongoDB via closest match.
e.g. my search query always contains:
name
country
city
Following rules are in place:
1. name always has to match
2. if either country or city is present, country has a higher priority
3. if country or city does not match only consider this document, if they have the default value (e.g. for String: "")
example Query:
name = "Test"
country = "USA"
city = "Seattle"
Documents:
db.stuff.insert([
{
name:"Test",
country:"",
city:"Seattle"
},{
name:"Test3",
country:"USA",
city:"Seattle"
},{
name:"Test",
country:"USA",
city:""
},{
name:"Test",
country:"Germany",
city:"Seattle"
},{
name:"Test",
country:"USA",
city:"Washington"
}
])
It should return the 3rd document
thanks!
Considering uncertain requirements and contradicting updates, the answer is rather a guideline addressing the "Is it possible at all" part.
The example should be adjusted to meet expectation.
db.stuff.aggregate([
{$match: {name: "Test"}}, // <== the fields that should always match
{$facet: {
matchedBoth: [
{$match: {country: "USA", city: "Seattle"}}, // <== bull's-eye
{$addFields: {weight: 10}} // <== 10 stones
],
matchedCity: [
{$match: {country: "", city: "Seattle"}}, // <== the $match may need to be improved, see below
{$addFields: {weight: 5}}
],
matchedCountry: [
{$match: {country: "USA", city: ""}},
{$addFields: {weight: 0}} // <== weightless, yet still a match
]
// add more rules here, if needed
}},
// get them together. Should list all rules from above
{$project: {doc: {$concatArrays: ["$matchedBoth", "$matchedCity", "$matchedCountry"]}}},
{$unwind: "$doc"}, // <== split them apart
{$sort: {"doc.weight": -1}}, // <== and order by weight, desc
// reshape to retrieve documents in its original format
{$project: {_id: "$doc._id", name: "$doc.name", country: "$doc.country", city: "$doc.city"}}
]);
The least explained part of the question affect how we build up facets. e.g.
{$match: {country: "", city: "Seattle"}}
matches all documents where country explicitly present and is an empty string.
It very well might be
{$match: {country: {$ne: "USA"}, city: "Seattle"}}
to get all documents with matching name and city and any country/no country, or even
{$match: {$and: [{$or: [{country: null}, {country: ""}]}, {city: "Seattle"}]}}
etc.
Here is a query
db.collection.aggregate([
{$match: {name:"Test"}},
{$project: {
name:"$name",
country: "$country",
city:"$city",
countryMatch: {$cond: [{$eq:["$country", "USA"]}, true, false]},
cityMatch: {$cond:[{$eq:["$city", "Seattle"]}, true, false]}
}},
{$match: {$and: [
{$or:[{countryMatch:true},{country:""}]},
{$or:[{cityMatch:true},{city:""}]}
]}},
{$sort: {countryMatch:-1, cityMatch:-1}},
{$project: {name:"$name", country:"$country", city:"$city"}}
])
Explanation:
First match filters out docs which don't match name (because rule #1 - name should match).
Next projection selects doc fields plus some information about country and city matches. We will need it to further filter and sort documents.
Second match filters out those documents which don't match both country and city and don't have default values for these fields (rule #3).
Sorting documents moves country matches before city matches as rule #2 states. And last - projection selects required fields.
Output:
{
_id: 3,
name : "Test",
country : "USA",
city : ""
},
{
_id: 1,
name : "Test",
country : "",
city : "Seattle"
}
You can limit query results to get only closest match.

MongoDB aggregation/map-reduce

I'm new to MongoDB and I need to do an aggregation which seems to me quite difficult. A document looks something like this
{
"_id" : ObjectId("568192aef8bd6b0cd0f649c6"),
"conference" : "IEEE International Conference on Acoustics, Speech and Signal Processing",
"prism:aggregationType" : "Conference Proceeding",
"children-id" : [
"SCOPUS_ID:84948148564",
"SCOPUS_ID:84927603733",
"SCOPUS_ID:84943521758",
"SCOPUS_ID:84905234683",
"SCOPUS_ID:84876113709"
],
"dc:identifier" : "SCOPUS_ID:84867598678"
}
The example contains just the fields I need in the aggregation. Prism:aggregationType can have 5 different values(conference proceeding, book, journal etc.). Children-id says that this document is cited by an array of other documents(SCOPUS_ID is an unique ID for each document).
What I want to do is to group first by conference, then for each conference I want to know for each prism:aggregationType how many citing documents are($gt > 0).
For example, lets say there are 100 documents that have the conference from above. These 100 documents are cited by 250 documents. I want to know from all of these 250 documents how many have "prism:aggregationType" : "Conference Proceeding", "prism:aggregationType" : "Journal" etc.
An output could look like this:
{
"conference" : "IEEE International Conference on Acoustics, Speech and Signal Processing",
"aggregationTypes" : [{"Conference Proceeding" : 50} , {"Journal" : 200}]
}
It is not important if it is done with aggregation pipeline or map-reduce.
EDIT
Is there any way to combine these 2 into one aggregation:
db.articles.aggregate([
{ $match:{
conference : {$ne : null}
}},
{$unwind:'$children-id'},
{$group: {
_id: {conference: '$conference'},
'cited-by':{$push:{'dc:identifier':"$children-id"}}
}}
]);
db.articles.find( { 'dc:identifier': { $in: [ 'SCOPUS_ID:84943302953', 'SCOPUS_ID:84927603733'] } }, {'prism:aggregationType':1} );
In the query I want to replace the array from $in with the array created with $push
Please try this one through aggregation
> db.collections
.aggregate([
// 1. get the size of `children-id` array through $project
{$project: {
conference: 1,
IEEE1: 1,
'prism:aggregationType': 1,
'children-id': {$size: '$children-id'}
}},
// 2. group by `conference` and `prism:aggregationType` and sum the size of `children-id`
{$group: {
_id: {
conference:'$conference',
aggregationType: '$prism:aggregationType'
},
ids: {$sum: '$children-id'}
}},
// 3. group by `conference`, and make pair of the conference processing ids size and journal ids size
{$group: {
_id: '$_id.conference',
aggregationTypes: {
$cond: [{$eq: ['$_id.aggregationType', 'Conference Proceeding']},
{$push: {"Conference Proceeding": '$ids'}},
{$push: {"Journal": '$ids'}}
]}
}}
]);
As we had a chat,
using $lookup in aggregation pipeline is unfortunately bonded to mongodb 3.2 which is not a case as R driver can use mongo 2.6 and source documents are in more than one collection.
The code I wrote in the EDIT section is also the final result I come up with(a little bit modified)
db.articles.aggregate([
{ $match:{
conference : {$ne : null}
}},
{$unwind:'$children-id'},
{$group: {
_id: '$conference',
'cited-by':{$push:"$children-id"}
}}
]);
db.articles.find( { 'dc:identifier': { $in: [ 'SCOPUS_ID:84943302953', 'SCOPUS_ID:84927603733'] } }, {'prism:aggregationType':1} );
The result will look like this for each conference:
{
"_id" : "Annual Conference on Privacy, Security and Trust",
"cited-by" : [
"SCOPUS_ID:84942789431",
"SCOPUS_ID:84928151617",
"SCOPUS_ID:84939229259",
"SCOPUS_ID:84946407175",
"SCOPUS_ID:84933039513",
"SCOPUS_ID:84942789431",
"SCOPUS_ID:84942607254",
"SCOPUS_ID:84948165954",
"SCOPUS_ID:84926379258",
"SCOPUS_ID:84946771354",
"SCOPUS_ID:84944223683",
"SCOPUS_ID:84942789431",
"SCOPUS_ID:84939169499",
"SCOPUS_ID:84947104346",
"SCOPUS_ID:84948764343",
"SCOPUS_ID:84938075139",
"SCOPUS_ID:84946196118",
"SCOPUS_ID:84930820238",
"SCOPUS_ID:84947785321",
"SCOPUS_ID:84933496680",
"SCOPUS_ID:84942789431"
]
}
I iterate through all the documents I get (around 250) and then I use the cited-by array inside $in. I use index over dc:identifier so it works instantly.
$lookup could be an alternative to get the things done from aggregate pipeline but packages in R does not support versions above 2.6.
Thank you for your time anyway :)

MongoDB: calculate average value for the document & then do the same thing across entire collection

I have collection of documents (Offers) with subdocuments (Salary) like this:
{
_id: ObjectId("zzz"),
sphere: ObjectId("xxx"),
region: ObjectId("yyy"),
salary: {
start: 10000,
end: 50000
}
}
And I want to calculate average salary across some region & sphere for the entire collection. I created query for this, it works, but it takes care only about salary start value.
db.offer.aggregate(
[
{$match:
{$and: [
{"salary.start": {$gt: 0}},
{region: ObjectId("xxx")},
{sphere: ObjectId("yyy")}
]}
},
{$group: {_id: null, avg: {$avg: "$salary.start"}}}
]
)
But firstly I want to calculate avarage salary (start & end) of the offer. How can I do this?
Update.
If value for "salary.end" may be missing in your data, you need to add one additional "$project" iteration to replace missing "salary.end" with existing "salary.start". Otherwise, the result of the average function will be calculated wrong due to ignoring documents with the lack of "salary.end" values.
db.offer.aggregate([
{$match:
{$and: [
{"salary.start": {$gt: 0}},
{"region": ObjectId("xxx")},
{"sphere": ObjectId("yyy")}
]}
},
{$project:{"_id":1,
"sphere":1,
"region":1,
"salary.start":1,
"salary.end":1,
"salary.end": {$ifNull: ["$salary.end", "$salary.start"]}
}
},
{$project:{"_id":1,
"sphere":1,
"region":1,
"avg_salary":{$divide:[
{$add:["$salary.start","$salary.end"]}
,2
]}}},
{$group:{"_id":{"sphere":"$sphere","region":"$region"},
"avg":{$avg:"$avg_salary"}}}
])
The way you aggregate has to be modified:
Match the required region,sphere and where salary > 0.
Project a extra field for each offer, which holds the average of
start and end.
Now group together the records with the same region and sphere, and
apply the $avg aggregation operator on the avg_salary for each offer
in that group,to get the average salary.
The Code:
db.offer.aggregate([
{$match:
{$and: [
{"salary.start": {$gt: 0}},
{"region": ObjectId("xxx")},
{"sphere": ObjectId("yyy")}
]}
},
{$project:{"_id":1,
"sphere":1,
"region":1,
"avg_salary":{$divide:[
{$add:["$salary.start","$salary.end"]}
,2
]}}},
{$group:{"_id":{"sphere":"$sphere","region":"$region"},
"avg":{$avg:"$avg_salary"}}}
])

Query for field in subdocument

An example of the schema i have;
{ "_id" : 1234,
“dealershipName”: “Eric’s Mongo Cars”,
“cars”: [
{“year”: 2013,
“make”: “10gen”,
“model”: “MongoCar”,
“vin”: 3928056,
“mechanicNotes”: “Runs great!”},
{“year”: 1985,
“make”: “DeLorean”,
“model”: “DMC-12”,
“vin”: 8056309,
“mechanicNotes”: “Great Scott!”}
]
}
I wish to query and return only the value "vin" in "_id : 1234". Any suggestion is much appreciated.
You can use the field selection parameter with dot notation to constrain the output to just the desired field:
db.test.find({_id: 1234}, {_id: 0, 'cars.vin': 1})
Output:
{
"cars" : [
{
"vin" : 3928056
},
{
"vin" : 8056309
}
]
}
Or if you just want an array of vin values you can use aggregate:
db.test.aggregate([
// Find the matching doc
{$match: {_id: 1234}},
// Duplicate it, once per cars element
{$unwind: '$cars'},
// Group it back together, adding each cars.vin value as an element of a vin array
{$group: {_id: '$_id', vin: {$push: '$cars.vin'}}},
// Only include the vin field in the output
{$project: {_id: 0, vin: 1}}
])
Output:
{
"result" : [
{
"vin" : [
3928056,
8056309
]
}
],
"ok" : 1
}
If by query, you mean get the values of the vins in javascript, you could read the json into a string called theString (or any other name) and do something like:
var f = [], obj = JSON.parse(theString);
obj.cars.forEach(function(item) { f.push(item.vin) });
If your json above is part of a larger collection, then you'd need an outer loop.