Exporting nested data from a MongoDB - mongodb

I am trying to export nested fields from MongoDB to a CSV.
From the below code, I would like to extract the scale name (e.g. “Security” & “Power”) and the raw_score (e.g. 2 & 3, respectively) fields. These four fields would be stored in four columns in a CSV, where each column is an extract field.
"results" : {
"scales" : [
{
"scale" : {
"name" : "Security",
"code" : "SEC",
"multiplier" : 1
},
"raw_score" : 2
},
{
"scale" : {
"name" : "Power",
"code" : "POW",
"multiplier" : -1
},
"raw_score" : 3
}
],
In the past I have been successful using dot notation to extract nested fields (a working example below from a previous extraction), yet I am unsure how do to extract fields that share the same name.
mongoexport -d production_hoganx_collector_061817 -c records --type=csv -o col_liwc_summary_061817.csv -f user_id,post_analysis.liwc_scores.tone
How can I extract the name and raw_score fields using the mongoexport command? I have tried to export the database to a JSON file and then extract the data via R, however this method takes too long to complete.
If mongoexport is not suitable, I am open to hearing alternatives!
Many thanks,

I'm assuming this is a one time thing so I suggest using an aggregate to build a new collection with the scales array unwinded.
Unwind fans out a document in n documents, where n is the amount of elements in the unwind specified array-type field. So for example if you had a document like this one:
{
name: "Some name",
email: ["somename#somedomain.com", "name#someotherdomain.com"]
}
An unwind on the email field would result in two documents:
{
name: "Some name",
email: "somename#somedomain.com"
},
{
name: "Some name",
email: "name#someotherdomain.com"
}
So in your case I think you should use that to unwind your scales field like this:
db.collection.aggregate([
{$match: yourCondition},
{$unwind: "$scales"},
{$project: {
_id: false,
scales: true,
... other fields ...
}},
{$out: "unwindedcollection"}
]);
At this point you should be able to use mongoexport from the new collection generated (unwindedcollection), using the dot notation you used before.
Be sure to set false on _id, otherwise you'll end up with a duplicate _id error. You don't want to project that field so it creates new ids when inserting in the new collection you're dumping your aggregate results.
I'll leave the links to the docs of the concepts I used for this:
aggregate: https://docs.mongodb.com/manual/reference/method/db.collection.aggregate/
$project: https://docs.mongodb.com/manual/reference/operator/aggregation/project/
$unwind: https://docs.mongodb.com/manual/reference/operator/aggregation/unwind/
$out: https://docs.mongodb.com/manual/reference/operator/aggregation/out/

Related

Sort mongodb query on multiple fields

I would like to sort a mongodb query that search for bloggers.
Here the document structure (simplified) of a Blogger :
{
posts : {
hashtags : [{
hashtag : String,
weight : Number
}]
},
globalMark : Number
}
People can search bloggers via an input text. Eg: They can write "fashion travel" and click on search button.
I would like as result to show up Bloggers who have posts that contain hashtags that match /fashion/i and /travel/i, sorted by relevancy. The relevancy depends on the globalMark and hashtag weight.
I know how to show up them skipping hashtag weight but don't know how to include this weight in my query....
Here my current query :
Blogger.find({
"$and" : [{
"posts.hashtags.hashtag" : {$regex: /fashion/i}
}, {
"posts.hashtags.hashtag" : {$regex: /travel/i}
}]
})
.sort("-globalMark")
How can I handle this weight ?
BIG THANKS !
First think: MongoDB is not a for search with like operator (if your data is big you will have some latency).
Second think: hashtags value need to be object or objects in array
Third: You can use
db.collectionName.find({
$and: [
{"posts.hashtags.hashtag" : {$regex: /fashion/i}},
{"posts.hashtags.hashtag" : {$regex: /travel/i}}
]
}).sort({
globalMark: 1, "posts.hashtags.weight": 1
})
I have made it using mongodb aggregation and $project pipeline stage. Basically $project let you modify result doc so I have used it to build a score regarding different fields then sort the aggregation on this built score.
Here the doc :
https://docs.mongodb.com/manual/reference/operator/aggregation/project/

What is MongoDB's equivalent to RethinkDB's r.row?

The project I am working on uses MongoDB, but I usually use RethinkDB. I am trying to write an operation that will take the current value of a document's property and subtract 1 from it. In reQL this would be r.db('db').table('table').get('documentId').update({propToUpdate: r.row('propToUpdate').sub(1)}). How can I preform the same function in MongoDB?
I think you can use $inc operator:
db.products.update(
{ sku: "abc123" },
{ $inc: { quantity: -1} }
);
As an example extract from here.
Where "db" is the database, (e.g. use myDatabase), "products" is the name of the collection you want to query, "sku" is the field you are querying and "quantity" is the field you want to increment/decrement.

Mongo query - Array of Objects - get field from element 0 only

Trying to query Mongo, and get 1 field of element 0, inside a document, namely emails[0].address:
Here's a sample (truncated) document structure:
{
"_id" : "dfadgfe266reh",
"emails" : [
{
"address" : "email#domain.com",
"verified" : false
}
]
}
And my query (truncated) is like this:
{
fields: {
'emails.0.address': {
address: 1
}
}
}
However, when I run this, I get an empty object array, namely emails:[{}]
But if I change the selector to 'emails.address' it will give me the actual email address -- the problem is I only want emails[0].address
What am I doing wrong?
To get the required document you need a project document with two attributes. First one, as #Veeram mentioned, slices the array and second to specify an attribute from the embedded document in the array. See code:
db.collection.find( {}, { "emails": { $slice: 1 } , "emails.address": 1} );
You can use $ in projection:
.find({"emails.address":{$exists:true}}, {"emails.$.address":1})

Inline/combine other collection into one collection

I want to combine two mongodb collections.
Basically I have a collection containing documents that reference one document from another collection. Now I want to have this as a inline / nested field instead of a separate document.
So just to provide an example:
Collection A:
[{
"_id":"90A26C2A-4976-4EDD-850D-2ED8BEA46F9E",
"someValue": "foo"
},
{
"_id":"5F0BB248-E628-4B8F-A2F6-FECD79B78354",
"someValue": "bar"
}]
Collection B:
[{
"_id":"169099A4-5EB9-4D55-8118-53D30B8A2E1A",
"collectionAID":"90A26C2A-4976-4EDD-850D-2ED8BEA46F9E",
"some":"foo",
"andOther":"stuff"
},
{
"_id":"83B14A8B-86A8-49FF-8394-0A7F9E709C13",
"collectionAID":"90A26C2A-4976-4EDD-850D-2ED8BEA46F9E",
"some":"bar",
"andOther":"random"
}]
This should result in Collection A looking like this:
[{
"_id":"90A26C2A-4976-4EDD-850D-2ED8BEA46F9E",
"someValue": "foo",
"collectionB":[{
"some":"foo",
"andOther":"stuff"
},{
"some":"bar",
"andOther":"random"
}]
},
{
"_id":"5F0BB248-E628-4B8F-A2F6-FECD79B78354",
"someValue": "bar"
}]
I'd suggest something simple like this from the console:
db.collB.find().forEach(function(doc) {
var aid = doc.collectionAID;
if (typeof aid === 'undefined') { return; } // nothing
delete doc["_id"]; // remove property
delete doc["collectionAID"]; // remove property
db.collA.update({_id: aid}, /* match the ID from B */
{ $push : { collectionB : doc }});
});
It loops through each document in collectionB and if there is a field collectionAID defined, it removes the unnecessary properties (_id and collectionAID). Finally, it updates a matching document in collectionA by using the $push operator to add the document from B to the field collectionB. If the field doesn't exist, it is automatically created as an array with the newly inserted document. If it does exist as an array, it will be appended. (If it exists, but isn't an array, it will fail). Because the update call isn't using upsert, if the _id in the collectionB document doesn't exist, nothing will happen.
You can extend it to delete other fields as necessary or possibly add more robust error handling if for example a document from B doesn't match anything in A.
Running the code above on your data produces this:
{ "_id" : "5F0BB248-E628-4B8F-A2F6-FECD79B78354", "someValue" : "bar" }
{ "_id" : "90A26C2A-4976-4EDD-850D-2ED8BEA46F9E",
"collectionB" : [
{
"some" : "foo",
"andOther" : "stuff"
},
{
"some" : "bar",
"andOther" : "random"
}
],
"someValue" : "foo"
}
Sadly mapreduce can't produce full documents.
https://jira.mongodb.org/browse/SERVER-2517
No idea why despite all the attention, whining and upvotes they haven't changed it. So you'll have to do this manually in the language of your choice.
Hopefully you've indexed 'collectionAID' which should improve the speed of your queries. Just write something that goes through your A collection one document at a time, loading the _id and then adding the array from Collection B.
There is a much faster way than https://stackoverflow.com/a/22676205/1578508
You can do it the other way round and run through the collection you want to insert your documents in. (Far less executions!)
db.collA.find().forEach(function (x) {
var collBs = db.collB.find({"collectionAID":x._id},{"_id":0,"collectionA":0});
x.collectionB = collBs.toArray();
db.collA.save(x);
})

Matching for latest documents for a unique set of fields before aggregating

Assuming I have the following document structures:
> db.logs.find()
{
'id': ObjectId("50ad8d451d41c8fc58000003")
'name': 'Sample Log 1',
'uploaded_at: ISODate("2013-03-14T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099',
'tag_doc': {
'group_x: ['TAG-1','TAG-2'],
'group_y': ['XYZ']
}
},
{
'id': ObjectId("50ad8d451d41c8fc58000004")
'name': 'Sample Log 2',
'uploaded_at: ISODate("2013-03-15T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099'
'tag_doc': {
'group_x: ['TAG-1'],
'group_y': ['XYZ']
}
}
> db.cases.findOne()
{
'id': ObjectId("50ad8d451d41c8fc58000099")
'name': 'Sample Case 1'
}
Is there a way to perform a $match in aggregation framework that will retrieve only all the latest Log for each unique combination of case_id and group_x? I am sure this can be done with multiple $group pipeline but as much as possible, I want to immediately limit the number of documents that will pass through the pipeline via the $match operator. I am thinking of something like the $max operator except it is used in $match.
Any help is very much appreciated.
Edit:
So far, I can come up with the following:
db.logs.aggregate(
{$match: {...}}, // some match filters here
{$project: {tag:'$tag_doc.group_x', case:'$case_id', latest:{uploaded_at:1}}},
{$unwind: '$tag'},
{$group: {_id:{tag:'$tag', case:'$case'}, latest: {$max:'$latest'}}},
{$group: {_id:'$_id.tag', total:{$sum:1}}}
)
As I mentioned, what I want can be done with multiple $group pipeline but this proves to be costly when handling large number of documents. That is why, I wanted to limit the documents as early as possible.
Edit:
I still haven't come up with a good solution so I am thinking if the document structure itself is not optimized for my use-case. Do I have to update the fields to support what I want to achieve? Suggestions very much appreciated.
Edit:
I am actually looking for an implementation in mongodb similar to the one expected in How can I SELECT rows with MAX(Column value), DISTINCT by another column in SQL? except it involves two distinct field values. Also, the $match operation is crucial because it makes the resulting set dynamic, with filters ranging to matching tags or within a range of dates.
Edit:
Due to the complexity of my use-case I tried to use a simple analogy but this proves to be confusing. Above is now the simplified form of the actual use case. Sorry for the confusion I created.
I have done something similar. But it's not possible with match, but only with one group pipeline. The trick is do use multi key with correct sorting:
{ user_id: 1, address: "xyz", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }, { user_id: 1, address: "xyz2", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }
if i wan't to group on user_id & address and i wan't the message with the latest date we need to create a key like this:
{ user_id:1, address:1, date_sent:-1 }
then you are able to perform aggregate without sort, which is much faster and will work on shards with replicas. if you don't have a key with correct sort order you can add a sort pipeline, but then you can't use it with shards, because all that is transferred to mongos and grouping is done their (also will get memory limit problems)
db.user_messages.aggregate(
{ $match: { user_id:1 } },
{ $group: {
_id: "$address",
count: { $sum : 1 },
date_sent: { $max : "$date_sent" },
message: { $first : "$message" },
} }
);
It's not documented that it should work like this - but it does. We use it on production system.
I'd use another collection to 'create' the search results on the fly - as new posts are posted - by upserting a document in this new collection every time a new blog post is posted.
Every new combination of author/tags is added as a new document in this collection, whereas a new post with an existing combination just updates an existing document with the content (or object ID reference) of the new blog post.
Example:
db.searchResult.update(
... {'author_id':'50ad8d451d41c8fc58000099', 'tag_doc.tags': ["TAG-1", "TAG-2" ]},
... { $set: { 'Referenceid':ObjectId("5152bc79e8bf3bc79a5a1dd8")}}, // or embed your blog post here
... {upsert:true}
)
Hmmm, there is no good way of doing this optimally in such a manner that you only need to pick out the latest of each author, instead you will need to pick out all documents, sorted, and then group on author:
db.posts.aggregate([
{$sort: {created_at:-1}},
{$group: {_id: '$author_id', tags: {$first: '$tag_doc.tags'}}},
{$unwind: '$tags'},
{$group: {_id: {author: '$_id', tag: '$tags'}}}
]);
As you said this is not optimal however, it is all I have come up with.
If I am honest, if you need to perform this query often it might actually be better to pre-aggregate another collection that already contains the information you need in the form of:
{
_id: {},
author: {},
tag: 'something',
created_at: ISODate(),
post_id: {}
}
And each time you create a new post you seek out all documents in this unqiue collection which fullfill a $in query of what you need and then update/upsert created_at and post_id to that collection. This would be more optimal.
Here you go:
db.logs.aggregate(
{"$sort" : { "uploaded_at" : -1 } },
{"$match" : { ... } },
{"$unwind" : "$tag_doc.group_x" },
{"$group" : { "_id" : { "case" :'$case_id', tag:'$tag_doc.group_x'},
"latest" : { "$first" : "$uploaded_at"},
"Name" : { "$first" : "$Name" },
"tag_doc" : { "$first" : "$tag_doc"}
}
}
);
You want to avoid $max when you can $sort and take $first especially if you have an index on uploaded_at which would allow you to avoid any in memory sorts and reduce the pipeline processing costs significantly. Obviously if you have other "data" fields you would add them along with (or instead of) "Name" and "tag_doc".