Mongoexport - modify large array fields to their counts - mongodb

I have a large collection that I'd like to export to CSV, but I'd like to do some trimming to some of the fields. (e.g. I just need to know the number of elements in some, and just to know if others exist or not in the doc)
I would like to do the equivalent to a map function on the fields, so that fields that contain a list will be exported to the list size, and some fields that sometimes exist and sometimes do not, I would like to have them exported as boolean flags.
e.g. if my rows looks like this
{_id:"id1", listField:[1,2,3], optionalField: "...", ... }
{_id:"id2", listField:[1,2,3,4], ... }
I'd like to run a mongoexport to CSV that will result in this
_id, listField.length, optinalField.exists
"id1", 3, , true
"id2", 4, , false
Is that possible using mongoexport? (assume MongoDB version 3.0)
If not, is there another way to do that?

The mongoexport utility itself is pretty spartan and just a basic tool bundled in the suite. You can add "query" filters, but pretty much just like .find() queries in general, the intention is to return documents "as is" rather than "manipulate" the content.
Just as with other query operations, the .aggregate() method is something useful for document manipulation. So in order to "manipulate" the output to something different from the original document source, you would do:
db.collection.aggregate([
{ "$project": {
"listField": { "$size": "$listField" },
"optionalField": {
"$cond": [
{ "$ifNull": [ "$optionalField", false ] },
true,
false
]
}
}}
])
The $size operator returns the "size" of the array, and the $ifNull tests for the presence, either returning the field value or the alternate. Pass that result into $cond to get a true/false return rather than the field value. "_id" is always implicit, unless you specifically ask to omit it.
That would give you the "reduced" output, but in order to go to CSV then you would have to code that export yourself, as mongoexport does not run aggregation pipeline queries.
But the code to do so should be quite trivial ( pick a library for your language ), and the aggregation statement is also trivial as you can see here.
For the "really basic" approach, then just send a script to the mongo shell, as a very rudimentary form of programming:
db.collection.aggregate([
{ "$project": {
"listField": { "$size": "$listField" },
"optionalField": {
"$cond": [
{ "$ifNull": [ "$optionalField", false ] },
true,
false
]
}
}}
]).forEach(function(doc) {
print(Object.keys(doc).map(function(key) {
return doc[key]
}).join(","));
});
Which would output:
id1,3,true
id2,4,false

Related

Find dictionary keys in mongodb with dot infront of their key values

I have some data in a mongodb database collection that looks like this
{
"_id": {
"$oid": "63737b4b654d9b6a0c3a2006"
},
"tag": {
"tagName": 0.10534846782684326
}
}
and I want to check if a dictionary with a specific tagName exists. To do so, we can apply this
mycollection.find({f"tag.{'tagName'}": {"$exists": True}})
However, some tagNames have a dot . in front, e.g.,
{
"_id": {
"$oid": "63737b4b654d9b6a0c3a2006"
},
"tag": {
".tagName": 0.10534846782684326
}
}
So when I run the query
mycollection.find({f"tag.{'.tagName'}": {"$exists": True}})
returns that the dictionary whose key name is .tagName is not found. This is because of the double dot in f"tag.{'.tagName'}". Can we write the query in such a way in order to avoid this situation?
Mongodb version:
db version v4.4.13
Build Info: {
"version": "4.4.13",
"gitVersion": "df25c71b8674a78e17468f48bcda5285decb9246",
"openSSLVersion": "OpenSSL 1.1.1f 31 Mar 2020",
"modules": [],
"allocator": "tcmalloc",
"environment": {
"distmod": "ubuntu2004",
"distarch": "x86_64",
"target_arch": "x86_64"
}
}
The first syntax looks a little odd to me. I don't think it should have the curly brackets. You can see in this playground example that it doesn't find the first document. So you may be looking to remove the curly brackets from the query in both situations, and here is an example where doing so correctly returns the first document.
Now regarding the . character in the name, one approach would be to use the $getField operator. That operator helps retrieve field names that are otherwise ambiguous or contain special characters. An example (that would only retrieve the second document) might look like this:
db.collection.find({
$expr: {
$ifNull: [
{
$getField: {
field: ".tagName",
input: "$tag"
}
},
false
]
}
})
Playground example here
You may combine the two conditions with a $or to return both documents, playground example here.
I would recommend updating your data to remove the extra . character. Its presence is going to make working with the data more difficult and probably cause some performance issues since many of the operations won't be able to effectively use indexes.
Version 4.4 and earlier
As noted in the comments, the $getField operator is new in version 5.0. To accomplish something similar prior to that you could use the $objectToArray operator.
Effectively what you will do here is convert $tag to an array of k, v pairs where k contains the field name. You can then filter directly against that name (k) looking for the value(s) of interest.
The verbose, but arguably more readable, approach to doing so looks like this:
db.collection.aggregate([
{
"$addFields": {
"tagNames": {
"$objectToArray": "$tag"
}
}
},
{
$match: {
"tagNames.k": {
$in: [
"tagName",
".tagName"
]
}
}
},
{
$project: {
tagNames: 0
}
}
])
You could probably collapse it down and do it directly in find() (via $expr usage), as demonstrated here. But doing so requires a little more knowledge about your schema and the structure of the tag field. Overall though, working with field names that contain dots is even more difficult prior to 5.0, which further strengthens the suggestion to correct the underlying data.

Remove complete document or element from array based on condition

My collection documents are:
{
"_id" : 1,
"fruits" : [ {"name":"pears"},
{"name":"grapes"},
{"name":"bananas"} ],
}
{
"_id" : 2,
"fruits" : [ {"name":"bananas"} ],
}
I need to remove the whole document when the fruits contains only "bananas" or only remove the fruit "bananas" when there are more than one fruit in the fruits array.
My final collection after running the required query should be:
{
"_id" : 1,
"fruits" : [ {"name":"pears"},
{"name":"grapes"}],
}
I am currently using two queries to get this done:
db.collection.remove({'fruits':{$size:1, $elemMatch:{'name': 'bananas'} }}) [this will remove the document when only one fruit present]
and
db.collection.update({},{$pull:{'fruits':{'name':'bananas'}}},{multi: true}) [this will remove the entry 'bananas' from the array]
Is there any way to combine these into one query?
EDIT: Final take
-- I guess there is no "one query" to perform the above tasks since the intents are very different of both the actions.
-- The best that can be performed is: club the actions into a bulk_write query which saves on the network I/O(as suggested in the answer by Neil). This is believe is more beneficial when you have multiple such actions being fired. Also, bulk_write can provide the feature of locking in the sense that the "ordered" mode of the bulk_write makes the actions sequential, breaking and halting execution in case of error.
Hence bulk_write is more beneficial when the actions performed need to be sequential. Somewhat like "chaining" in JS. There is also the option to perform un-ordered bulk_writes.
Also, the actions specified in the bulk write, operate on the collection level as individual actions.
You basically want bulk_write() here to do them both. Also Use $exists to ensure there's only one element:
from pymongo import UpdateMany, DeleteMany
db.collection.bulk_write(
[
UpdateMany(
{ "fruits.1": { "$exists": True }, "fruits.name": "bananas" },
{ "$pull":{
'fruits': { 'name':'bananas' }
}}
),
DeleteMany(
{ "fruits.1": { "$exists": False }, "fruits.name": "bananas" }
)
],
ordered=False
)
You don't really need $elemMatch for "one" condition and you should be using update_many() and in this case UpdateMany() instead of { "multi": true }. And that option is different in "pymongo" anyway. Then of course there is delete_many() or DeleteMany() for the "bulk" context.
Bulk operations send one request with one response, which is better than sending multiple requests. Also "update" and "delete" are two different things, but the single request can combine just like this.
The $size operator is valid but $exists can apply to a "range" where $size cannot, so it's generally a bit more flexible.
i.e Just as a $exists range example
# Array between 2 and 4 elements
db.collection.update_many(
{
"fruits.1": { "$exists": True },
"fruits.4": { "$exists": False },
"fruits.name": "bananas"
},
{ "$pull":{
'fruits': { 'name':'bananas' }
}}
)
And of course in the context here you actually want to know the difference between other possible things in the array and those with "only" a single "bananas".
The ordered=False here actually refers to two different ways that "bulk write" requests can be handled
Ordered - Where True ( which is the "default" ) then the operations are executed in "serial order" as they appear in the array of operations sent with the "bulk write". If any error occurs here then the batch stops execution at the point of the error and returns an exception.
UnOrdered - Where False the operations are executed in "parallel" within reasonable constraints on the server. If any error occurs there is still an exception raised, however this does not stop other operations within the "bulk write" from completing. Any errors are returned with the "array index" from the list provided to the command of which operation caused the error.
This option can used to "tune" the desired behavior in particular to error reporting and continuation, and also allows a degree of "parallelism" to the execution where "serial" is not actually required of the operations. Since these two statements do not actually depend on one or the other and will in fact select different documents anyway, then ordered=False is probably the better option in terms of efficiency here.
db.users.aggregate(
[{
$project: {
data: {
$filter: {
input: "$fruits",
as: "filterData",
cond: { $ne: [ "$$filterData.name", 'bananas' ] }
}
}
}
},
{
$unwind: {
path : "$data",
preserveNullAndEmptyArrays : false
}
},
{
$group: {
_id:"$_id",
data: { $addToSet: "$data" }
}
},
])
I think above query would give you perfect results

Get distinct records with specified fields that match a value, paginated

I'm trying to get all documents in my MongoDB collection
by distinct customer ids (custID)
where status code == 200
paginated (skipped and limit)
return specified fields
var Order = mongoose.model('Order', orderSchema());
My original thought was to use mongoose db query, but you can't use distinct with skip and limit as Distinct is a method that returns an "array", and therefore you cannot modify something that is not a "Cursor":
Order
.distinct('request.headers.custID')
.where('response.status.code').equals(200)
.limit(limit)
.skip(skip)
.exec(function (err, orders) {
callback({
data: orders
});
});
So then I thought to use Aggregate, using $group to get distinct customerID records, $match to return all unique customerID records that have status code of 200, and $project to include the fields that I want:
Order.aggregate(
[
{
"$project" :
{
'request.headers.custID' : 1,
//other fields to include
}
},
{
"$match" :
{
"response.status.code" : 200
}
},
{
"$group": {
"_id": "$request.headers.custID"
}
},
{
"$skip": skip
},
{
"$limit": limit
}
],
function (err, order) {}
);
This returns an empty array though. If I remove project, only $request.headers.custID field is returned when in fact I need more.
Any thoughts?
The thing you need to understand about aggregation pipelines is generally the word "pipeline" means that each stage only receives the input that is emitted by the preceeding stage in order of execution. The best analog to think of here is "unix pipe" |, where the output of one command is "piped" to the other:
ps aux | grep mongo | tee out.txt
So aggregation pipelines work in much the same way as that, where the other main thing to consider is both $project and $group stages operate on only emitting those fields you ask for, and no others. This takes a little getting used to compared to declarative approaches like SQL, but with a little practice it becomes second nature.
Other things to get used to are stages like $match are more important to place at the beginning of a pipeline than field selection. The primary reason for this is possible index selection and usage, which speeds things up immensely. Also, field selection of $project followed by $group is somewhat redundant, as both essentially select fields anyway, and are usually best combined where appropriate anyway.
Hence most optimially you do:
Order.aggregate(
[
{ "$match" : {
"response.status.code" : 200
}},
{ "$group": {
"_id": "$request.headers.custID", // the grouping key
"otherField": { "$first": "$otherField" },
// and so on for each field to select
}},
{ "$skip": skip },
{ "$limit": limit }
],
function (err, order) {}
);
Where the main thing here to remember about $group is that all other fields than _id ( which is the grouping key ) require the use of an accumulator to select, since there is in fact always a multiple occurance of the values for the grouping key.
In this case we are using $first as an accumulator, which will take the first occurance from the grouping boundary. Commonly this is used following a $sort, but does not need to be so, just as long as you understand the behavior of what is selected.
Other accumulators like $max simply take the largest value of the field from within the values inside the grouping key, and are therefore independant of the "current record/document" unlike $first or $last. So it all depends on your needs.
Of course you can shorcut the selection in modern MongoDB releases after MongoDB 2.6 with the $$ROOT variable:
Order.aggregate(
[
{ "$match" : {
"response.status.code" : 200
}},
{ "$group": {
"_id": "$request.headers.custID", // the grouping key
"document": { "$first": "$$ROOT" }
}},
{ "$skip": skip },
{ "$limit": limit }
],
function (err, order) {}
);
Which would take a copy of all fields in the document and place them under the named key ( which is "document" in this case ). It's a shorter way to notate, but of course the resulting document has a different structure, being now all under the one key as sub-fields.
But as long as you understand the basic principles of a "pipeline" and don't exclude data you want to use in later stages by previous stages, then you generally should be okay.

Project fields whose value is not X

I have a Mongo collection where the schema is
{
_id:...
user1: ...
user2: ...
date: ...
}
I'd like to create a query that, given a user, would return a list of all the other users this user has met. So far what Ive got is
db.collections.find({$or:[{user1: XXX}, {user2: XXX}]}, <projection>);
What I'de like to do is to have the projection look something like
{user1: $user1==XXX, user2: $user2==XXX}
Where$user1/2 is the value of user 1 or 2 at that document, but I don't know how to translate that into a mongo query. Perhaps this can be done using $where, but I cant find the syntax for $where in projections.
Not really completely clear what you mean here but $where is not an option as it is only a JavaScript evaluation of conditions for a "query" and has nothing to do with projection.
You seem to be asking to "identify" which of the fields actually matched your condition or at least something like that. Standard query projection does not alter values present in a document in any way, but the .aggregate() method has $project which can alter document content.
So to identify which field matched then you could do:
db.collections.aggregate([
{ "$match": {
"$or": [
{ "user1": "XXX" },
{ "user2": "XXX" }
}
}},
{ "$project": {
"matched": {
"$cond": [
{ "$ne": [ "$user1", "XXX" ] },
"$user2",
"$user1"
]
}
}}
])
Where the $cond operator provides a ternary ( if/then/else ) condition that evaluates a logical comparison operator as the first argument and returns either the second argument where true or the third where false.
So in this case, given that $or can match either field then the returned value will be the field that had the value present for the match, being logical that if it was not "user1" then it must be "user2" since those are the initial query conditions.
Whatever your case you want this sort of logical evaluation to return another value that is not actually present in the document by inspecting an existing value(s) to the conditions.

How to check order of Array element in Mongodb?

In MongoDB, is there any easy way to check Order of element in Array? For example I have a document like this:
{
_id: 1,
tags: ["mongodb", "rethinkdb", "couchbase", "others"]
}
I would like to check in tags field if mongodb come before rethinkdb or not(lets see in array element, mongodb=0, rethinkdb=1 index, so mongodb come first and our case match.)?
but if there is another document (like below) where rethinkdb comes before mongodb,It case does not match.
{
_id: 2,
tags: ["rethinkdb", "mongodb", "couchbase"]
}
Here mongodb(1) comes after rethinkdb(0) so our case does not match.
Your question is not really as clear as you think it is, and thus why there are several ways to answer it:
If you are looking just to find out if a document has "mongodb" as the first element of the array then you just issue a query like this:
db.collection.find({ "tags.0": "mongodb" })
And that will return only the documents that match the given value at the specified index position using "dot notation".
If you actually expect to match if an array is in an "expected order" then you can get some help from the aggregation pipeline and set operators that are available and other features in MongoDB 2.6:
db.collection.aggregate([
{ "$project": {
"$_id": "$$ROOT",
"matched": { "$setEquals": [
"$tags",
["mongodb", "rethinkdb", "couchbase", "others"]
]}
}},
{ "$match": { "matched": true }}
])
Or if your want is to make sure that the "mongodb" value comes before the "rethinkdb" value, then you will need to evaluate in JavaScript with mapReduce, or something equally not nice like the $where operator:
db.collection.find({
"$where": function() {
return this.tags.indexOf("mongodb") < this.tags.indexOf("rethinkdb");
}
})