MongoDB: Delete item inside multiple objects - mongodb

Im trying to delete an item inside an object categorized inside multiple keys.
for example, deleting ObjectId("c") from every items section
This is the structure:
{
"somefield" : "value",
"somefield2" : "value",
"objects" : {
"/" : {
"color" : "#112233",
"items" : [
ObjectId("c"),
ObjectId("b")
]
},
"/folder1" : {
"color" : "#112233",
"items" : [
ObjectId("c"),
ObjectId("d")
]
},
"/folder2" : {
"color" : "112233",
"items" : []
},
"/testing" : {
"color" : "112233",
"items" : [
ObjectId("c"),
ObjectId("f")
]
}
}
}
I tried with pull and unset like:
db.getCollection('col').update(
{},
{ $unset: { 'objects.$.items': ObjectId("c") } },
{ multi: true }
)
and
db.getCollection('col').update(
{},
{ "objects": {"items": { $pull: [ObjectId("c")] } } },
{ multi: true }
)
Any idea? thanks!

The problem here is largely with the current structure of your document. MongoDB cannot "traverse paths" in an efficient way, and your structure currently has an "Object" ( 'objects' ) which has named "keys". What this means is that accessing "items" within each "key" needs the explicit path to each key to be able to see that element. There are no wildcards here:
db.getCollection("coll").find({ "objects./.items": Object("c") })
And that is the basic principle to "match" something as you cannot do it "across all keys" without resulting to JavaScript code, which is really bad.
Change the structure. Rather than "object keys", use "arrays" instead, like this:
{
"somefield" : "value",
"somefield2" : "value",
"objects" : [
{
"path": "/",
"color" : "#112233",
"items" : [
"c",
"b"
]
},
{
"path": "/folder1",
"color" : "#112233",
"items" : [
"c",
"d"
]
},
{
"path": "/folder2",
"color" : "112233",
"items" : []
},
{
"path": "/testing",
"color" : "112233",
"items" : [
"c",
"f"
]
}
]
}
It's much more flexible in the long run, and also allows you to "index" fields like "path" for use in query matching.
However, it's not going to help you much here, as even with a consistent query path, i.e:
db.getCollection("coll").find({ "objects.items": Object("c") })
Which is better, but the problem still persists that is it not possible to $pull from multiple sources ( whether object or array ) in the same singular operation. And that is augmented with "never" across multiple documents.
So the best you will ever get here is basically "trying" the "muti-update" concept until the options are exhausted and there is nothing left to "update". With the "modified" structure presented then you can do this:
var bulk = db.getCollection("coll").initializeOrderedBulkOp(),
count = 0,
modified = 1;
while ( modified != 0 ) {
bulk.find({ "objects.items": "c"}).update({
"$pull": { "objects.$.items": "c" }
});
count++;
var result = bulk.execute();
bulk = db.getCollection("coll").initializeOrderedBulkOp();
modified = result.nModified;
}
print("iterated: " + count);
That uses the "Bulk" operations API ( actually all shell methods now use it anyway ) to basically get a "better write response" that gives you useful information about what actually happened on the "update" attempt.
The point is that is basically "loops" and tries to match a document based on the "query" portion of the update and then tries to $pull from the matched array index an item from the "inner array" that matches the conditions given to $pull ( which acts as "query" in itself, just upon the array items ).
On each iteration you basically get the "nModified" value from the response, and when this is finally 0, then the operation is complete.
On the sample ( restructured ) given then this will take 4 iterations, being one for each "outer" array member. The updates are "multi" as implied by bulk .update() ( as opposed to .updateOne() ) already, and therefore the "maximum" iterations is determined by the "maximum" array elements present in the "outer" array across the whole collection. So if there is "one" document out of "one thousand" that has 20 entries then the iterations will be 20, and just because that document still has something that can be matched and modified.
The alternate case under your current structure does not bear mentioning. It is just plain "impossible" without:
Retrieving the document individually
Extracting the present keys
Running an individual $pull for the array under that key
Get next document, rinse and repeat
So "multi" is "right out" as an option and cannot be done, without some some possible "foreknowledge" of the possible "keys" under the "object" key in the document.
So please "change your structure" and be aware of the general limitations available.
You cannot possibly do this in "one" update, but at least if the maximum "array entries" your document has was "4", then it is better to do "four" updates over a "thousand" documents than the "four thousand" that would be required otherwise.
Also. Please do not "obfuscate" the ObjectId value in posts. People like to "copy/paste" code and data to test for themselves. Using something like ObjectId("c") which is not a valid ObjectId value would clearly cause errors, and therefore is not practical for people to use.
Do what "I did" in the listing, and if you want to abstract/obfuscate, then do it with "plain values" just as I have shown.

One approach that you could take is using JavaScript native methods like reduce to create the documents that will be used in the update.
You essentially need an operation like the following:
var itemId = ObjectId("55ba3a983857192828978fec");
db.col.find().forEach(function(doc) {
var update = {
"object./.items": itemId,
"object./folder1.items": itemId,
"object./folder2.items": itemId,
"object./testing.items": itemId
};
db.col.update(
{ "_id": doc._id },
{
"$pull": update
}
);
})
Thus to create the update object would require the reduce method that converts an array into an object:
var update = Object.getOwnPropertyNames(doc.objects).reduce(function(o, v, i) {
o["objects." + v + ".items"] = itemId;
return o;
}, {});
Overall, you would need to use the Bulk operations to achieve the above update:
var bulk = db.col.initializeUnorderedBulkOp(),
itemId = ObjectId("55ba3a983857192828978fec"),
count = 0;
db.col.find().forEach(function(doc) {
var update = Object.getOwnPropertyNames(doc.objects).reduce(function(o, v, i) {
o["objects." + v + ".items"] = itemId;
return o;
}, {});
bulk.find({ "_id": doc._id }).updateOne({
"$pull": update
})
count++;
if (count % 1000 == 0) {
bulk.execute();
bulk = db.col.initializeUnorderedBulkOp();
}
})
if (count % 1000 != 0) { bulk.execute(); }

Related

MongoDB Calculate Values from Two Arrays, Sort and Limit

I have a MongoDB database storing float arrays. Assume a collection of documents in the following format:
{
"id" : 0,
"vals" : [ 0.8, 0.2, 0.5 ]
}
Having a query array, e.g., with values [ 0.1, 0.3, 0.4 ], I would like to compute for all elements in the collection a distance (e.g., sum of differences; for the given document and query it would be computed by abs(0.8 - 0.1) + abs(0.2 - 0.3) + abs(0.5 - 0.4) = 0.9).
I tried to use the aggregation function of MongoDB to achieve this, but I can't work out how to iterate over the array. (I am not using the built-in geo operations of MongoDB, as the arrays can be rather long)
I also need to sort the results and limit to the top 100, so calculation after reading the data is not desired.
Current Processing is mapReduce
If you need to execute this on the server and sort the top results and just keep the top 100, then you could use mapReduce for this like so:
db.test.mapReduce(
function() {
var input = [0.1,0.3,0.4];
var value = Array.sum(this.vals.map(function(el,idx) {
return Math.abs( el - input[idx] )
}));
emit(null,{ "output": [{ "_id": this._id, "value": value }]});
},
function(key,values) {
var output = [];
values.forEach(function(value) {
value.output.forEach(function(item) {
output.push(item);
});
});
output.sort(function(a,b) {
return a.value < b.value;
});
return { "output": output.slice(0,100) };
},
{ "out": { "inline": 1 } }
)
So the mapper function does the calculation and output's everything under the same key so all results are sent to the reducer. The end output is going to be contained in an array in a single output document, so it is both important that all results are emitted with the same key value and that the output of each emit is itself an array so mapReduce can work properly.
The sorting and reduction is done in the reducer itself, as each emitted document is inspected the elements are put into a single tempory array, sorted, and the top results are returned.
That is important, and just the reason why the emitter produces this as an array even if a single element at first. MapReduce works by processing results in "chunks", so even if all emitted documents have the same key, they are not all processed at once. Rather the reducer puts it's results back into the queue of emitted results to be reduced until there is only a single document left for that particular key.
I'm restricting the "slice" output here to 10 for brevity of listing, and including the stats to make a point, as the 100 reduce cycles called on this 10000 sample can be seen:
{
"results" : [
{
"_id" : null,
"value" : {
"output" : [
{
"_id" : ObjectId("56558d93138303848b496cd4"),
"value" : 2.2
},
{
"_id" : ObjectId("56558d96138303848b49906e"),
"value" : 2.2
},
{
"_id" : ObjectId("56558d93138303848b496d9a"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d93138303848b496ef2"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497861"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497b58"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497ba5"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497c43"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d95138303848b49842b"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d96138303848b498db4"),
"value" : 2.1
}
]
}
}
],
"timeMillis" : 1758,
"counts" : {
"input" : 10000,
"emit" : 10000,
"reduce" : 100,
"output" : 1
},
"ok" : 1
}
So this is a single document output, in the specific mapReduce format, where the "value" contains an element which is an array of the sorted and limitted result.
Future Processing is Aggregate
As of writing, the current latest stable release of MongoDB is 3.0, and this lacks the functionality to make your operation possible. But the upcoming 3.2 release introduces new operators that make this possible:
db.test.aggregate([
{ "$unwind": { "path": "$vals", "includeArrayIndex": "index" }},
{ "$group": {
"_id": "$_id",
"result": {
"$sum": {
"$abs": {
"$subtract": [
"$vals",
{ "$arrayElemAt": [ { "$literal": [0.1,0.3,0.4] }, "$index" ] }
]
}
}
}
}},
{ "$sort": { "result": -1 } },
{ "$limit": 100 }
])
Also limitting to the same 10 results for brevity, you get output like this:
{ "_id" : ObjectId("56558d96138303848b49906e"), "result" : 2.2 }
{ "_id" : ObjectId("56558d93138303848b496cd4"), "result" : 2.2 }
{ "_id" : ObjectId("56558d96138303848b498e31"), "result" : 2.1 }
{ "_id" : ObjectId("56558d94138303848b497c43"), "result" : 2.1 }
{ "_id" : ObjectId("56558d94138303848b497861"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b499037"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b498db4"), "result" : 2.1 }
{ "_id" : ObjectId("56558d93138303848b496ef2"), "result" : 2.1 }
{ "_id" : ObjectId("56558d93138303848b496d9a"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b499182"), "result" : 2.1 }
This is made possible largely due to $unwind being modified to project a field in results that contains the array index, and also due to $arrayElemAt which is a new operator that can extract an array element as a singular value from a provided index.
This allows the "look-up" of values by index position from your input array in order to apply the math to each element. The input array is facilitated by the existing $literal operator so $arrayElemAt does not complain and recongizes it as an array, ( seems to be a small bug at present, as other array functions don't have the problem with direct input ) and gets the appropriate matching index value by using the "index" field produced by $unwind for comparison.
The math is done by $subtract and of course another new operator in $abs to meet your functionality. Also since it was necessary to unwind the array in the first place, all of this is done inside a $group stage accumulating all array members per document and applying the addition of entries via the $sum accumulator.
Finally all result documents are processed with $sort and then the $limit is applied to just return the top results.
Summary
Even with the new functionallity about to be availble to the aggregation framework for MongoDB it is debatable which approach is actually more efficient for results. This is largely due to there still being a need to $unwind the array content, which effectively produces a copy of each document per array member in the pipeline to be processed, and that generally causes an overhead.
So whilst mapReduce is the only present way to do this until a new release, it may actually outperform the aggregation statement depending on the amount of data to be processed, and despite the fact that the aggregation framework works on native coded operators rather than translated JavaScript operations.
As with all things, testing is always recommended to see which case suits your purposes better and which gives the best performance for your expected processing.
Sample
Of course the expected result for the sample document provided in the question is 0.9 by the math applied. But just for my testing purposes, here is a short listing used to generate some sample data that I wanted to at least verify the mapReduce code was working as it should:
var bulk = db.test.initializeUnorderedBulkOp();
var x = 10000;
while ( x-- ) {
var vals = [0,0,0];
vals = vals.map(function(val) {
return Math.round((Math.random()*10),1)/10;
});
bulk.insert({ "vals": vals });
if ( x % 1000 == 0) {
bulk.execute();
bulk = db.test.initializeUnorderedBulkOp();
}
}
The arrays are totally random single decimal point values, so there is not a lot of distribution in the listed results I gave as sample output.

MongoDB: Update a field of an item in array with matching another field of that item

I have a data structure like this:
We have some centers. A center has some switches. A switch has some ports.
{
"_id" : ObjectId("561ad881755a021904c00fb5"),
"Name" : "center1",
"Switches" : [
{
"Ports" : [
{
"PortNumber" : 2,
"Status" : "Empty"
},
{
"PortNumber" : 5,
"Status" : "Used"
},
{
"PortNumber" : 7,
"Status" : "Used"
}
]
}
]
}
All I want is to write an Update query to change the Status of the port that it's PortNumber is 5 to "Empty".
I can update it when I know the array index of the port (here array index is 1) with this query:
db.colection.update(
// query
{
_id: ObjectId("561ad881755a021904c00fb5")
},
// update
{
$set : { "Switches.0.Ports.1.Status" : "Empty" }
}
);
But I don't know the array index of that Port.
Thanks for help.
You would normally do this using the positional operator $, as described in the answer to this question:
Update field in exact element array in MongoDB
Unfortunately, right now the positional operator only supports one array level deep of matching.
There is a JIRA ticket for the sort of behavior that you want: https://jira.mongodb.org/browse/SERVER-831
In case you can make Switches into an object instead, you could do something like this:
db.colection.update(
{
_id: ObjectId("561ad881755a021904c00fb5"),
"Switch.Ports.PortNumber": 5
},
{
$set: {
"Switch.Ports.$.Status": "Empty"
}
}
)
Since you don't know the array index of the Port, I would suggest you dynamically create the $set conditions on the fly i.e. something which would help you get the indexes for the objects and then modify accordingly, then consider using MapReduce.
Currently this seems to be not possible using the aggregation framework. There is an unresolved open JIRA issue linked to it. However, a workaround is possible with MapReduce. The basic idea with MapReduce is that it uses JavaScript as its query language but this tends to be fairly slower than the aggregation framework and should not be used for real-time data analysis.
In your MapReduce operation, you need to define a couple of steps i.e. the mapping step (which maps an operation into every document in the collection, and the operation can either do nothing or emit some object with keys and projected values) and reducing step (which takes the list of emitted values and reduces it to a single element).
For the map step, you ideally would want to get for every document in the collection, the index for each Switches and Ports array fields and another key that contains the $set keys.
Your reduce step would be a function (which does nothing) simply defined as var reduce = function() {};
The final step in your MapReduce operation will then create a separate collection Switches that contains the emitted Switches array object along with a field with the $set conditions. This collection can be updated periodically when you run the MapReduce operation on the original collection.
Altogether, this MapReduce method would look like:
var map = function(){
for(var i = 0; i < this.Switches.length; i++){
for(var j = 0; j < this.Switches[i].Ports.length; j++){
emit(
{
"_id": this._id,
"switch_index": i,
"port_index": j
},
{
"index": j,
"Switches": this.Switches[i],
"Port": this.Switches[i].Ports[j],
"update": {
"PortNumber": "Switches." + i.toString() + ".Ports." + j.toString() + ".PortNumber",
"Status": "Switches." + i.toString() + ".Ports." + j.toString() + ".Status"
}
}
);
}
}
};
var reduce = function(){};
db.centers.mapReduce(
map,
reduce,
{
"out": {
"replace": "switches"
}
}
);
Querying the output collection Switches from the MapReduce operation will typically give you the result:
db.switches.findOne()
Sample Output:
{
"_id" : {
"_id" : ObjectId("561ad881755a021904c00fb5"),
"switch_index" : 0,
"port_index" : 1
},
"value" : {
"index" : 1,
"Switches" : {
"Ports" : [
{
"PortNumber" : 2,
"Status" : "Empty"
},
{
"PortNumber" : 5,
"Status" : "Used"
},
{
"PortNumber" : 7,
"Status" : "Used"
}
]
},
"Port" : {
"PortNumber" : 5,
"Status" : "Used"
},
"update" : {
"PortNumber" : "Switches.0.Ports.1.PortNumber",
"Status" : "Switches.0.Ports.1.Status"
}
}
}
You can then use the cursor from the db.switches.find() method to iterate over and update your collection accordingly:
var newStatus = "Empty";
var cur = db.switches.find({ "value.Port.PortNumber": 5 });
// Iterate through results and update using the update query object set dynamically by using the array-index syntax.
while (cur.hasNext()) {
var doc = cur.next();
var update = { "$set": {} };
// set the update query object
update["$set"][doc.value.update.Status] = newStatus;
db.centers.update(
{
"_id": doc._id._id,
"Switches.Ports.PortNumber": 5
},
update
);
};

Append a string to the end of an existing field in MongoDB

I have a document with a field containing a very long string. I need to concatenate another string to the end of the string already contained in the field.
The way I do it now is that, from Java, I fetch the document, extract the string in the field, append the string to the end and finally update the document with the new string.
The problem: The string contained in the field is very long, which means that it takes time and resources to retrieve and work with this string in Java. Furthermore, this is an operation that is done several times per second.
My question: Is there a way to concatenate a string to an existing field, without having to fetch (db.<doc>.find()) the contents of the field first? In reality all I want is (field.contents += new_string).
I already made this work using Javascript and eval, but as I found out, MongoDB locks the database when it executes javascript, which makes the overall application even slower.
Starting Mongo 4.2, db.collection.updateMany() can accept an aggregation pipeline, finally allowing the update of a field based on its current value:
// { a: "Hello" }
db.collection.updateMany(
{},
[{ $set: { a: { $concat: [ "$a", "World" ] } } }]
)
// { a: "HelloWorld" }
The first part {} is the match query, filtering which documents to update (in this case all documents).
The second part [{ $set: { a: { $concat: [ "$a", "World" ] } } }] is the update aggregation pipeline (note the squared brackets signifying the use of an aggregation pipeline). $set (alias of $addFields) is a new aggregation operator which in this case replaces the field's value (by concatenating a itself with the suffix "World"). Note how a is modified directly based on its own value ($a).
For example (it's append to the start, the same story ):
before
{ "_id" : ObjectId("56993251e843bb7e0447829d"), "name" : "London
City", "city" : "London" }
db.airports
.find( { $text: { $search: "City" } })
.forEach(
function(e, i){
e.name='Big ' + e.name;
db.airports.save(e);
}
)
after:
{ "_id" : ObjectId("56993251e843bb7e0447829d"), "name" : "Big London
City", "city" : "London" }
Old topic but i had the same problem.
Since mongo 2.4, you can use $concat from aggregation framework.
Example
Consider these documents :
{
"_id" : ObjectId("5941003d5e785b5c0b2ac78d"),
"title" : "cov"
}
{
"_id" : ObjectId("594109b45e785b5c0b2ac97d"),
"title" : "fefe"
}
Append fefe to title field :
db.getCollection('test_append_string').aggregate(
[
{ $project: { title: { $concat: [ "$title", "fefe"] } } }
]
)
The result of aggregation will be :
{
"_id" : ObjectId("5941003d5e785b5c0b2ac78d"),
"title" : "covfefe"
}
{
"_id" : ObjectId("594109b45e785b5c0b2ac97d"),
"title" : "fefefefe"
}
You can then save the results with a bulk, see this answer for that.
this is a sample of one document i have :
{
"_id" : 1,
"s" : 1,
"ser" : 2,
"p" : "9919871172",
"d" : ISODate("2018-05-30T05:00:38.057Z"),
"per" : "10"
}
to append a string to any feild you can run a forEach loop throught all documents and then update desired field:
db.getCollection('jafar').find({}).forEach(function(el){
db.getCollection('jafar').update(
{p:el.p},
{$set:{p:'98'+el.p}})
})
This would not be possible.
One optimization you can do is create batches of updates.
i.e. fetch 10K documents, append relevant strings to each of their keys,
and then save them as single batch.
Most mongodb drivers support batch operations.
db.getCollection('<collection>').update(
// query
{},
// update
{
$set: {<field>:this.<field>+"<new string>"}
},
// options
{
"multi" : true, // update only one document
"upsert" : false // insert a new document, if no existing document match the query
});

How do I query a hash sub-object that is dynamic in mongodb?

I currently have a Question object and am not sure how to query for it?
{ "title" : "Do you eat fast food?"
"answers" : [
{
"_id" : "506b422ff42c95000e00000d",
"title" : "Yes",
"trait_score_modifiers" : {
"hungry" : 1
}
},
{
"_id" : "506b422ff42c95000e00000e",
"title" : "No",
"trait_score_modifiers" : {
"not-hungry" : -1
}
}]
}
I am trying to find questions where the trait_score_modifieres is queried (sometimes it exists, sometimes not)
I have the following but it is not dynamic:
db.questions.find({"answers.trait_score_modifiers.not-hungry":{$exists: true}})
How could i do something like this?
db.questions.find({"answers.trait_score_modifiers.{}.size":{$gt: 0}})
You should modify the schema so you have consistent key names to query on. I ran into a similar problem using the aggregation framework, see question: Total values from all keys in subdocument
Something like this should work (not tested):
{
"title" : "Do you eat fast food?"
"answers" : [
{
"title" : "Yes",
"trait_score_modifiers" : [
{"dimension": "hungry", "value": 1}
]
},
{
"title" : "No",
"trait_score_modifiers" : [
{"dimension": "not-hungry", "value": -1}
]
}]
}
You can return all questions that have a dynamic dimension (e.g. "my new dimension") with:
db.questions.find("answers.trait_score_modifiers.dimension": "my new dimension")
Or limit the returned set to questions that have a specific value on that dimension (e.g. > 0):
db.questions.find(
"answers.trait_score_modifiers": {
"$elemMatch": {
"dimension": "my new dimension",
"value": {"$gt": 0}
}
}
)
Querying nested arrays can be a bit tricky, be sure to read up on the documentation In this case, $elemMatch is needed because otherwise you return a document that has some trait_score_modifier my new dimension but the matching value is in the dimension key of a different array element.
You need $elemMatch criteria in your query.
Refer to: http://docs.mongodb.org/manual/reference/projection/elemMatch/
Let me know if you need the query.

Indexing embedded mongoDB documents (in an array) with Solr

Is there any way, how I can make Solr index embedded mongoDB documents? We already can index top-level values of keys in a mongo document via mongo-connector, pushes the data to Solr.
However, in situations like in this structure which represents a post:
{
author: "someone",
post_text : "some really long text which is already indexed by solr",
comments : [
{
author:"someone else"
comment_text:"some quite long comment, which I do not
know how to index in Solr"
},
{
author:"me"
comment_text:"another quite long comment, which I do not
know how to index in Solr"
}
]
}
This is just an example structure. In our project, we handle more complicated structures, and sometimes, the text we want to index is nested on a second or third level (depth, or what is the formal name for it).
I believe that there is a community of mongoDB + Solr users and so that this issue must have been adressed before, but I was unable to find good materials, that would cover this problem, if there is a nice way, how to handle this or whether there is no solution and workarounds have yet to be founded (and maybe you could provide me with one)
For a better understanding, one of our structures have at top level key that has for its value an array of some several analysis results, where one of them has an array of singular values, that are parts of the result. We need to index these values. E.g. (this is not the actual data structure, we use):
{...
Analysis_performed: [
{
User_tags:
[
{
tag_name: "awesome",
tag_score: 180
},
{
tag_name: "boring",
tag_score: 10
}
]
}
]
}
In this case we would need to index on the tag names. There is a possibility of us having a bad structure for storing the data, we want to store, but we thought hard about it and we think it's quite good. However, even if we switch to less nested information, we will most likely come across at least one situation where we will have to index information stored in embedded documents that are in an array and this is the question's main focus. Can we index such data with SOLR somehow?
I had a question like this a couple months ago. My solution is to use doc_manager.
You can use solr_doc_manager (upsert method), to modify document posted into solr. For example, if you have
ACL: {
Read: [ id1, id2 ... ]
}
you can handle it something like
def upsert(self, doc):
if ("ACL" in doc) and ("Read" in doc["ACL"]):
doc["ACL.Read"] = []
for item in doc["ACL"]["Read"]:
if not isinstance(item, dict):
id = ObjectId(item)
doc["ACL.Read"].append(str(id))
self.solr.add([doc], commit=False)
It adds new field - ACL.Read. This field is multivalued and stores list of ids from ACL : { Read: [ ... ] }
If you do not want to write you own handlers for nested documents, you can try another mongo connector. Github project page https://github.com/SelfishInc/solr-mongo-connector. It supports nested documents out of the box.
Official 10gen mongo connector now supports flattening of arrays and indexing subdocuments.
See https://github.com/10gen-labs/mongo-connector
However for arrays it does something unpleasant like this. It would transform this document:
{
"hashtagEntities" : [
{
"start" : "66",
"end" : "81",
"text" : "startupweekend"
},
{
"start" : "82",
"end" : "90",
"text" : "startup"
},
{
"start" : "91",
"end" : "100",
"text" : "startups"
},
{
"start" : "101",
"end" : "108",
"text" : "london"
}
]
}
into this:
{
"hashtagEntities.0.start" : "66",
"hashtagEntities.0.end" : "81",
"hashtagEntities.0.text" : "startupweekend",
"hashtagEntities.1.start" : "82",
"hashtagEntities.1.end" : "90",
"hashtagEntities.1.text" : "startup",
....
}
The above is very difficult to index in Solr - even more if you have no stable schema for your documents. We wanted something more like this:
{
"hashtagEntities.xArray.start": [
"66",
"82",
"91",
"101"
],
"hashtagEntities.xArray.text": [
"startupweekend",
"startup",
"startups",
"london"
],
"hashtagEntities.xArray.end": [
"81",
"90",
"100",
"108"
],
}
I have implemented an alternative solr_doc_manager.py
If you want to use this, just edit the flatten_doc function in your doc_manager to this, to support such functionality:
def flattened(doc):
return dict(flattened_kernel(doc, []))
def flattened_kernel(doc, path):
for k, v in doc.items():
path.append(k)
if isinstance(v, dict):
for inner_k, inner_v in flattened_kernel(v, path):
yield inner_k, inner_v
elif isinstance(v, list):
for inner_k, inner_v in flattened_list(v, path).items():
yield inner_k, inner_v
path.pop()
else:
yield ".".join(path), v
path.pop()
def flattened_list(v, path):
tem = dict()
#path2 = list()
path.append(str("xArray"))
for li, lv in enumerate(v):
if isinstance(lv, dict):
for dk, dv in flattened_kernel(lv, path):
got = tem.get(dk, list())
if isinstance(dv, list):
got.extend(dv)
else:
got.append(dv)
tem[dk] = got
else:
got = tem.get(".".join(path)+".ROOT", list())
if isinstance(lv, list):
got.extend(lv)
else:
got.append(lv)
tem[".".join(path)+".ROOT"] = got
return tem
In case you do not want to lose data from array, which are not sub-documents, this implementation will place the data into a "array.ROOT" attribute. See here:
{
"array" : [
{
"innerArray" : [
{
"c" : 1,
"d" : 2
},
{
"ahah" : "asdf"
},
42,
43
]
},
1,
2
],
}
into:
{
"array.xArray.ROOT": [
"1.0",
"2.0"
],
"array.xArray.innerArray.xArray.ROOT": [
"42.0",
"43.0"
],
"array.xArray.innerArray.xArray.c": [
"1.0"
],
"array.xArray.innerArray.xArray.d": [
"2.0"
],
"array.xArray.innerArray.xArray.ahah": [
"asdf"
]
}
I had the same issue, I want to index/store in Solr complicated documents. My approach was to modify the the JsonLoader to accept complicated json documents with arrays/objects as values.
It stores the object/array and then flatten it and indexes the fields.
e.g basic example document
{
"titles_json":{"FR":"This is the FR title" , "EN":"This is the EN title"} ,
"id": 1000003,
"guid": "3b2f2998-85ac-4a4e-8867-beb551c0b3c6"
}
It will store
titles_json:{
"FR":"This is the FR title" ,
"EN":"This is the EN title"
}
and then index fields
titles.FR:"This is the FR title"
titles.EN:"This is the EN title"
Not only you will be able to index the child documents, but also when you perform a search on solr you will receive the original complicated structure of the document that you indexed.
If you want to check the source code, installation and integration details with your existing solr, check
http://www.solrfromscratch.com/2014/08/20/embedded-documents-in-solr/
please note that I have tested this for solr 4.9.0
M.