Indexing embedded mongoDB documents (in an array) with Solr - mongodb

Is there any way, how I can make Solr index embedded mongoDB documents? We already can index top-level values of keys in a mongo document via mongo-connector, pushes the data to Solr.
However, in situations like in this structure which represents a post:
{
author: "someone",
post_text : "some really long text which is already indexed by solr",
comments : [
{
author:"someone else"
comment_text:"some quite long comment, which I do not
know how to index in Solr"
},
{
author:"me"
comment_text:"another quite long comment, which I do not
know how to index in Solr"
}
]
}
This is just an example structure. In our project, we handle more complicated structures, and sometimes, the text we want to index is nested on a second or third level (depth, or what is the formal name for it).
I believe that there is a community of mongoDB + Solr users and so that this issue must have been adressed before, but I was unable to find good materials, that would cover this problem, if there is a nice way, how to handle this or whether there is no solution and workarounds have yet to be founded (and maybe you could provide me with one)
For a better understanding, one of our structures have at top level key that has for its value an array of some several analysis results, where one of them has an array of singular values, that are parts of the result. We need to index these values. E.g. (this is not the actual data structure, we use):
{...
Analysis_performed: [
{
User_tags:
[
{
tag_name: "awesome",
tag_score: 180
},
{
tag_name: "boring",
tag_score: 10
}
]
}
]
}
In this case we would need to index on the tag names. There is a possibility of us having a bad structure for storing the data, we want to store, but we thought hard about it and we think it's quite good. However, even if we switch to less nested information, we will most likely come across at least one situation where we will have to index information stored in embedded documents that are in an array and this is the question's main focus. Can we index such data with SOLR somehow?

I had a question like this a couple months ago. My solution is to use doc_manager.
You can use solr_doc_manager (upsert method), to modify document posted into solr. For example, if you have
ACL: {
Read: [ id1, id2 ... ]
}
you can handle it something like
def upsert(self, doc):
if ("ACL" in doc) and ("Read" in doc["ACL"]):
doc["ACL.Read"] = []
for item in doc["ACL"]["Read"]:
if not isinstance(item, dict):
id = ObjectId(item)
doc["ACL.Read"].append(str(id))
self.solr.add([doc], commit=False)
It adds new field - ACL.Read. This field is multivalued and stores list of ids from ACL : { Read: [ ... ] }
If you do not want to write you own handlers for nested documents, you can try another mongo connector. Github project page https://github.com/SelfishInc/solr-mongo-connector. It supports nested documents out of the box.

Official 10gen mongo connector now supports flattening of arrays and indexing subdocuments.
See https://github.com/10gen-labs/mongo-connector
However for arrays it does something unpleasant like this. It would transform this document:
{
"hashtagEntities" : [
{
"start" : "66",
"end" : "81",
"text" : "startupweekend"
},
{
"start" : "82",
"end" : "90",
"text" : "startup"
},
{
"start" : "91",
"end" : "100",
"text" : "startups"
},
{
"start" : "101",
"end" : "108",
"text" : "london"
}
]
}
into this:
{
"hashtagEntities.0.start" : "66",
"hashtagEntities.0.end" : "81",
"hashtagEntities.0.text" : "startupweekend",
"hashtagEntities.1.start" : "82",
"hashtagEntities.1.end" : "90",
"hashtagEntities.1.text" : "startup",
....
}
The above is very difficult to index in Solr - even more if you have no stable schema for your documents. We wanted something more like this:
{
"hashtagEntities.xArray.start": [
"66",
"82",
"91",
"101"
],
"hashtagEntities.xArray.text": [
"startupweekend",
"startup",
"startups",
"london"
],
"hashtagEntities.xArray.end": [
"81",
"90",
"100",
"108"
],
}
I have implemented an alternative solr_doc_manager.py
If you want to use this, just edit the flatten_doc function in your doc_manager to this, to support such functionality:
def flattened(doc):
return dict(flattened_kernel(doc, []))
def flattened_kernel(doc, path):
for k, v in doc.items():
path.append(k)
if isinstance(v, dict):
for inner_k, inner_v in flattened_kernel(v, path):
yield inner_k, inner_v
elif isinstance(v, list):
for inner_k, inner_v in flattened_list(v, path).items():
yield inner_k, inner_v
path.pop()
else:
yield ".".join(path), v
path.pop()
def flattened_list(v, path):
tem = dict()
#path2 = list()
path.append(str("xArray"))
for li, lv in enumerate(v):
if isinstance(lv, dict):
for dk, dv in flattened_kernel(lv, path):
got = tem.get(dk, list())
if isinstance(dv, list):
got.extend(dv)
else:
got.append(dv)
tem[dk] = got
else:
got = tem.get(".".join(path)+".ROOT", list())
if isinstance(lv, list):
got.extend(lv)
else:
got.append(lv)
tem[".".join(path)+".ROOT"] = got
return tem
In case you do not want to lose data from array, which are not sub-documents, this implementation will place the data into a "array.ROOT" attribute. See here:
{
"array" : [
{
"innerArray" : [
{
"c" : 1,
"d" : 2
},
{
"ahah" : "asdf"
},
42,
43
]
},
1,
2
],
}
into:
{
"array.xArray.ROOT": [
"1.0",
"2.0"
],
"array.xArray.innerArray.xArray.ROOT": [
"42.0",
"43.0"
],
"array.xArray.innerArray.xArray.c": [
"1.0"
],
"array.xArray.innerArray.xArray.d": [
"2.0"
],
"array.xArray.innerArray.xArray.ahah": [
"asdf"
]
}

I had the same issue, I want to index/store in Solr complicated documents. My approach was to modify the the JsonLoader to accept complicated json documents with arrays/objects as values.
It stores the object/array and then flatten it and indexes the fields.
e.g basic example document
{
"titles_json":{"FR":"This is the FR title" , "EN":"This is the EN title"} ,
"id": 1000003,
"guid": "3b2f2998-85ac-4a4e-8867-beb551c0b3c6"
}
It will store
titles_json:{
"FR":"This is the FR title" ,
"EN":"This is the EN title"
}
and then index fields
titles.FR:"This is the FR title"
titles.EN:"This is the EN title"
Not only you will be able to index the child documents, but also when you perform a search on solr you will receive the original complicated structure of the document that you indexed.
If you want to check the source code, installation and integration details with your existing solr, check
http://www.solrfromscratch.com/2014/08/20/embedded-documents-in-solr/
please note that I have tested this for solr 4.9.0
M.

Related

How to find nodes with an object that contains a string value

I'm struggling to create a find query that finds nodes that contain "Item1".
{
"_id" : ObjectId("589274f49bd4d562f0a15e07"),
"Value" : [["Item1", {
"Name" : "John",
"Age" : 45
}], ["Item2", {
"Address" : "123 Main St.",
"City" : "Hometown",
"State" : "ZZ"
}]]
}
In this example, "Item1" is not a key/value pair, but rather just a string that is part of an array that is part of a larger array. This is a legacy format so I can't adjust it unfortunately.
I've tried something like: { Value: {$elmemmatch:{$elemmatch:{"Item1"}}}, but that is not returning any matches. Similarly, $regex is not working since it only seems to match on string objects (and the overall object is not a string, but a string in an array in an array).
It seems like you should use the $in or $eq operator to match value.
So try this:
db.collection.find({'Value':{$elemMatch:{$elemMatch:{$in:['Item1']}}}})
Or run this to get the specific Item
db.collection.find({},{'Value':{$elemMatch:{$elemMatch:{$in:['Item1']}}}})
Hope this helps.
var data = {
"_id":"ObjectId('589274f49bd4d562f0a15e07')",
"Value":[
[
"Item1",
{
"Name":"John",
"Age":45
}
],
[
"Item2",
{
"Address":"123 Main St.",
"City":"Hometown",
"State":"ZZ"
}
]
]
}
data.Value[0][0] // 'Item1'
Copy and paste on repl it works.
There was an error on structure ofr your data

MongoDB: Delete item inside multiple objects

Im trying to delete an item inside an object categorized inside multiple keys.
for example, deleting ObjectId("c") from every items section
This is the structure:
{
"somefield" : "value",
"somefield2" : "value",
"objects" : {
"/" : {
"color" : "#112233",
"items" : [
ObjectId("c"),
ObjectId("b")
]
},
"/folder1" : {
"color" : "#112233",
"items" : [
ObjectId("c"),
ObjectId("d")
]
},
"/folder2" : {
"color" : "112233",
"items" : []
},
"/testing" : {
"color" : "112233",
"items" : [
ObjectId("c"),
ObjectId("f")
]
}
}
}
I tried with pull and unset like:
db.getCollection('col').update(
{},
{ $unset: { 'objects.$.items': ObjectId("c") } },
{ multi: true }
)
and
db.getCollection('col').update(
{},
{ "objects": {"items": { $pull: [ObjectId("c")] } } },
{ multi: true }
)
Any idea? thanks!
The problem here is largely with the current structure of your document. MongoDB cannot "traverse paths" in an efficient way, and your structure currently has an "Object" ( 'objects' ) which has named "keys". What this means is that accessing "items" within each "key" needs the explicit path to each key to be able to see that element. There are no wildcards here:
db.getCollection("coll").find({ "objects./.items": Object("c") })
And that is the basic principle to "match" something as you cannot do it "across all keys" without resulting to JavaScript code, which is really bad.
Change the structure. Rather than "object keys", use "arrays" instead, like this:
{
"somefield" : "value",
"somefield2" : "value",
"objects" : [
{
"path": "/",
"color" : "#112233",
"items" : [
"c",
"b"
]
},
{
"path": "/folder1",
"color" : "#112233",
"items" : [
"c",
"d"
]
},
{
"path": "/folder2",
"color" : "112233",
"items" : []
},
{
"path": "/testing",
"color" : "112233",
"items" : [
"c",
"f"
]
}
]
}
It's much more flexible in the long run, and also allows you to "index" fields like "path" for use in query matching.
However, it's not going to help you much here, as even with a consistent query path, i.e:
db.getCollection("coll").find({ "objects.items": Object("c") })
Which is better, but the problem still persists that is it not possible to $pull from multiple sources ( whether object or array ) in the same singular operation. And that is augmented with "never" across multiple documents.
So the best you will ever get here is basically "trying" the "muti-update" concept until the options are exhausted and there is nothing left to "update". With the "modified" structure presented then you can do this:
var bulk = db.getCollection("coll").initializeOrderedBulkOp(),
count = 0,
modified = 1;
while ( modified != 0 ) {
bulk.find({ "objects.items": "c"}).update({
"$pull": { "objects.$.items": "c" }
});
count++;
var result = bulk.execute();
bulk = db.getCollection("coll").initializeOrderedBulkOp();
modified = result.nModified;
}
print("iterated: " + count);
That uses the "Bulk" operations API ( actually all shell methods now use it anyway ) to basically get a "better write response" that gives you useful information about what actually happened on the "update" attempt.
The point is that is basically "loops" and tries to match a document based on the "query" portion of the update and then tries to $pull from the matched array index an item from the "inner array" that matches the conditions given to $pull ( which acts as "query" in itself, just upon the array items ).
On each iteration you basically get the "nModified" value from the response, and when this is finally 0, then the operation is complete.
On the sample ( restructured ) given then this will take 4 iterations, being one for each "outer" array member. The updates are "multi" as implied by bulk .update() ( as opposed to .updateOne() ) already, and therefore the "maximum" iterations is determined by the "maximum" array elements present in the "outer" array across the whole collection. So if there is "one" document out of "one thousand" that has 20 entries then the iterations will be 20, and just because that document still has something that can be matched and modified.
The alternate case under your current structure does not bear mentioning. It is just plain "impossible" without:
Retrieving the document individually
Extracting the present keys
Running an individual $pull for the array under that key
Get next document, rinse and repeat
So "multi" is "right out" as an option and cannot be done, without some some possible "foreknowledge" of the possible "keys" under the "object" key in the document.
So please "change your structure" and be aware of the general limitations available.
You cannot possibly do this in "one" update, but at least if the maximum "array entries" your document has was "4", then it is better to do "four" updates over a "thousand" documents than the "four thousand" that would be required otherwise.
Also. Please do not "obfuscate" the ObjectId value in posts. People like to "copy/paste" code and data to test for themselves. Using something like ObjectId("c") which is not a valid ObjectId value would clearly cause errors, and therefore is not practical for people to use.
Do what "I did" in the listing, and if you want to abstract/obfuscate, then do it with "plain values" just as I have shown.
One approach that you could take is using JavaScript native methods like reduce to create the documents that will be used in the update.
You essentially need an operation like the following:
var itemId = ObjectId("55ba3a983857192828978fec");
db.col.find().forEach(function(doc) {
var update = {
"object./.items": itemId,
"object./folder1.items": itemId,
"object./folder2.items": itemId,
"object./testing.items": itemId
};
db.col.update(
{ "_id": doc._id },
{
"$pull": update
}
);
})
Thus to create the update object would require the reduce method that converts an array into an object:
var update = Object.getOwnPropertyNames(doc.objects).reduce(function(o, v, i) {
o["objects." + v + ".items"] = itemId;
return o;
}, {});
Overall, you would need to use the Bulk operations to achieve the above update:
var bulk = db.col.initializeUnorderedBulkOp(),
itemId = ObjectId("55ba3a983857192828978fec"),
count = 0;
db.col.find().forEach(function(doc) {
var update = Object.getOwnPropertyNames(doc.objects).reduce(function(o, v, i) {
o["objects." + v + ".items"] = itemId;
return o;
}, {});
bulk.find({ "_id": doc._id }).updateOne({
"$pull": update
})
count++;
if (count % 1000 == 0) {
bulk.execute();
bulk = db.col.initializeUnorderedBulkOp();
}
})
if (count % 1000 != 0) { bulk.execute(); }

dynamic size of subdocument mongodb

I'm using mongodb and mongoose for my web application. The web app is used for registration for swimming competitions and each competition can have X number of races. My data structure as of now:
{
"_id": "1",
"name": "Utmanaren",
"location": "town",
"startdate": "20150627",
"enddate": "20150627"
"race" : {
"gender" : "m"
"style" : "freestyle"
"length" : "100"
}
}
Doing this i need to determine and define the number of races for every competition. A solution i tried is having a separate document and having a Id for which competition a races belongs to, like below.
{
"belongsTOId" : "1"
"gender" : "m"
"style" : "freestyle"
"length" : "100"
}
{
"belongsTOId" : "1"
"gender" : "f"
"style" : "butterfly"
"length" : "50"
}
Is there a way of creating and defining dynamic number of races as a subdocument while using Mongodb?
Thanks!
You have basically two approaches of modelling your data structure; you can either design a schema where you can reference or embed the races document.
Let's consider the following example that maps swimming competition and multiple races relationships. This demonstrates the advantage of embedding over referencing if you need to view many data entities in context of another. In this one-to-many relationship between competition and race data, the competition has multiple races entities:
// db.competition schema
{
"_id": 1,
"name": "Utmanaren",
"location": "town",
"startdate": "20150627",
"enddate": "20150627"
"races": [
{
"gender" : "m"
"style" : "freestyle"
"length" : "100"
},
{
"gender" : "f"
"style" : "butterfly"
"length" : "50"
}
]
}
With the embedded data model, your application can retrieve the complete swimming competition information with just one query. This design has other merits as well, one of them being data locality. Since MongoDB stores data contiguously on disk, putting all the data you need in one document ensures that the spinning disks will take less time to seek to a particular location on the disk. The other advantage with embedded documents is the atomicity and isolation in writing data. To illustrate this, say you want to remove a competition which has a race "style" property with value "butterfly", this can be done with one single (atomic) operation:
db.competition.remove({"races.style": "butterfly"});
For more details on data modelling in MongoDB, please read the docs Data Modeling Introduction, specifically Model One-to-Many Relationships with Embedded Documents
The other design option is referencing documents follow a normalized schema where the race documents contain a reference to the competition document:
// db.race schema
{
"_id": 1,
"competition_id": 1,
"gender": "m",
"style": "freestyle",
"length": "100"
},
{
"_id": 2,
"competition_id": 1,
"gender": "f",
"style": "butterfly",
"length": "50"
}
The above approach gives increased flexibility in performing queries. For instance, to retrieve all child race documents where the main parent entity competition has id 1 will be straightforward, simply create a query against the collection race:
db.race.find({"competiton_id": 1});
The above normalized schema using document reference approach also has an advantage when you have one-to-many relationships with very unpredictable arity. If you have hundreds or thousands of race documents per given competition, the embedding option has so many setbacks in as far as spacial constraints are concerned because the larger the document, the more RAM it uses and MongoDB documents have a hard size limit of 16MB.
If your application frequently retrieves the race data with the competition information, then your application needs to issue multiple queries to resolve the references.
The general rule of thumb is that if your application's query pattern is well-known and data tends to be accessed only in one way, an embedded approach works well. If your application queries data in many ways or you unable to anticipate the data query patterns, a more normalized document referencing model will be appropriate for such case.
Ref:
MongoDB Applied Design Patterns: Practical Use Cases with the Leading NoSQL Database By Rick Copeland
You basically want to update the data, so you should upsert the data which is basically an update on the subdocument key.
Keep an array of keys in the main document.
Insert the sub-document and add the key to the list or update the list.
To push single item into the field ;
db.yourcollection.update( { $push: { "races": { "belongsTOId" : "1" , "gender" : "f" , "style" : "butterfly" , "length" : "50"} } } );
To push multiple items into the field it allows duplicate in the field;
db.yourcollection.update( { $push: { "races": { $each: [ { "belongsTOId" : "1" , "gender" : "f" , "style" : "butterfly" , "length" : "50"}, { "belongsTOId" : "2" , "gender" : "m" , "style" : "horse" , "length" : "70"} ] } } } );
To push multiple items without duplicated items;
db.yourcollection.update( { $addToSet: { "races": { $each: [ { "belongsTOId" : "1" , "gender" : "f" , "style" : "butterfly" , "length" : "50"}, { "belongsTOId" : "2" , "gender" : "m" , "style" : "horse" , "length" : "70"} ] } } } );
$pushAll deprecated since version 2.4, so we use $each in $push instead of $pushAll.
While using $push you will be able to sort and slice items. You might check the mongodb manual.

MongoDB: How can I order by distance considering multiple fields?

I have a collection that stores information about doctors. Each doctor can work in private practices and/or in hospitals.
The collection has the following relevant fields (there are geospatial indexes on both privatePractices.address.loc and hospitals.address.loc):
{
"name" : "myName",
"privatePractices" : [{
"_id": 1,
"address" : {
"loc" : {
"lng" : 2.1608502864837646,
"lat" : 41.3943977355957
}
}
},
...
],
"hospitals" : [{
"_id": 5,
"address" : {
"loc" : {
"lng" : 2.8192520141601562,
"lat" : 41.97784423828125
}
}
},
...
]
}
I am trying to query that collection to get a list of doctors ordered by distance from a given point. This is where I am stuck:
The following queries return a list of doctors ordered by distance to the point defined in $nearSphere, considering only one of the two location types:
{ "hospitals.address.loc" : { "$nearSphere" : [2.1933, 41.4008] } }
{ "privatePractices.address.loc" : { "$nearSphere" : [2.1933, 41.4008] } }
What I want is to get the doctors ordered by the nearest hospital OR private practice, whatever is the nearest. Is it possible to do it on a single Mongo query?
Plan B is to use the queries above and then manually order the results outside Mongo (eg. using Linq). To do this, my two queries should return the distance of each hospital or private practice to the $nearSphere point. Is it possible to do that in Mongo?
EDIT - APPLIED SOLUTION (MongoDB 2.6):
I took my own approach inspired by what Neil Lunn suggests in his answer: I added a field in the Doctor document for sorting purposes, containing an array with all the locations of the doctor.
I tried this approach in MongoDB 2.4 and MongoDB 2.6, and the results are different.
Queries on 2.4 returned duplicate doctors that had more than a location, even if the _id was included in the query filter. Queries on 2.6 returned valid results.
I would have been hoping for a little more information here, but the basics still apply. So the general problem you have stumbled on is trying to have "two" location fields on what appears to be your doctors documents.
There is another problem with the approach. You have the "locations" within arrays in your document/ This would not give you an error when creating the index, but it also is not going to work like you expect. The big problem here is that being within an array, you might find the document that "contains" the nearest location, but then the question is "which one", as nothing is done to affect the array content.
Core problem though is you cannot have more than one geo-spatial index per query. But to really get what you want, turn the problem on it's head, and essentially attach the doctors to the locations, which is the other way around.
For example here, a "practices" collection or such:
{
"type": "Hospital",
"address" : {
"loc" : {
"lng" : 2.8192520141601562,
"lat" : 41.97784423828125
}
},
"doctors": [
{ "_id": 1, "name": "doc1", "specialty": "bones" },
{ "_id": 2, "name": "doc2", "specialty": "heart" }
]
}
{
"type": "Private",
"address" : {
"loc" : {
"lng" : 2.1608502864837646,
"lat" : 41.3943977355957
}
},
"doctors": [
{ "_id": 1, "name": "doc1", "specialty": "bones" },
{ "_id": 3, "name": "doc3", "specialty": "brain" }
]
}
The advantage here is that you have here is that as a single collection and all in the same index you can simply get both "types" and correctly ordered by distance or within bounds or whatever your geo-queries need be. This avoids the problems with the other modelling form.
As for the "doctors" information, of course you actually keep a separate collection for the full doctor information themselves, and possibly even keep an array of the _id values for the location documents there as well. But the main point here is that you can generally live with "embedding" some of the useful search information in a collection here that will help you.
That seems to be the better option here, and matching a doctor to criteria from inside the location is something that can be done, where as finding or sorting the nearest entry inside array is something that is not going to be supported by MongoDB itself, and would result in you applying the Math yourself in processing the results.

How do I query a hash sub-object that is dynamic in mongodb?

I currently have a Question object and am not sure how to query for it?
{ "title" : "Do you eat fast food?"
"answers" : [
{
"_id" : "506b422ff42c95000e00000d",
"title" : "Yes",
"trait_score_modifiers" : {
"hungry" : 1
}
},
{
"_id" : "506b422ff42c95000e00000e",
"title" : "No",
"trait_score_modifiers" : {
"not-hungry" : -1
}
}]
}
I am trying to find questions where the trait_score_modifieres is queried (sometimes it exists, sometimes not)
I have the following but it is not dynamic:
db.questions.find({"answers.trait_score_modifiers.not-hungry":{$exists: true}})
How could i do something like this?
db.questions.find({"answers.trait_score_modifiers.{}.size":{$gt: 0}})
You should modify the schema so you have consistent key names to query on. I ran into a similar problem using the aggregation framework, see question: Total values from all keys in subdocument
Something like this should work (not tested):
{
"title" : "Do you eat fast food?"
"answers" : [
{
"title" : "Yes",
"trait_score_modifiers" : [
{"dimension": "hungry", "value": 1}
]
},
{
"title" : "No",
"trait_score_modifiers" : [
{"dimension": "not-hungry", "value": -1}
]
}]
}
You can return all questions that have a dynamic dimension (e.g. "my new dimension") with:
db.questions.find("answers.trait_score_modifiers.dimension": "my new dimension")
Or limit the returned set to questions that have a specific value on that dimension (e.g. > 0):
db.questions.find(
"answers.trait_score_modifiers": {
"$elemMatch": {
"dimension": "my new dimension",
"value": {"$gt": 0}
}
}
)
Querying nested arrays can be a bit tricky, be sure to read up on the documentation In this case, $elemMatch is needed because otherwise you return a document that has some trait_score_modifier my new dimension but the matching value is in the dimension key of a different array element.
You need $elemMatch criteria in your query.
Refer to: http://docs.mongodb.org/manual/reference/projection/elemMatch/
Let me know if you need the query.