How to query a relative element using MongoDB - mongodb

I have a document like this:
{
"whoKnows" : {
"name" : "Jeff",
"phone" : "123-123-1234"
},
"anotherElement" : {
"name" : "Jeff",
"phone" : "321-321-3211"
}
}
How can any instance of "name" by queried? For example, using a wildcard may look something like,
db.collection.find( { "*.name" : "Jeff" } )
Or if regex was support in the element place, it might look like,
db.collection.find( { /.*\.name/ : "Jeff" } )
Is it possible to accomplish this using MongoDB?
Side note: I'm not looking for a solution like,
db.collection.find({
"$or": [
{ "whoKnows.name" : "Jeff" },
{ "anotherElement.name" : "Jeff" }
]
})
I need a truly relative path solution as I do not know what the parent element will be (unless there is a way to generate the name of every element - then I could dynamically generate the $or clause at runtime).

Everything about this is fairly horrible, you cannot possibly index on something like the "name" values and your "path" to each attribute is going to vary everywhere. So this is really bad for queries.
I notice you mention "nested" structures, and you still could accommodate this with a similar proposal and some additional tagging, but I want you to consider this "phone book" type example:
{
"phones": [
{
"type": "Home",
"name" : "Jeff",
"phone" : "123-123-1234"
},
{
"type": "Work",
"name" : "Jeff",
"phone" : "123-123-1234"
},
]
}
Since this is actually sub-documents within an array, fields like "name" always share the same path, so not only can you index these (which is going to be good for performance) but the query is very basic:
db.collection({ "phones.name": "Jeff" })
That does exactly what you need by finding "Jeff" in any "name" entry. If you need a hierachy, then add some fields in those sub-documents to indicate the parent/child relationship that you can use in post processing. Or even as a materialized path which could aid your queries.
It really is the better approach.
If you really must keep this kind of structure then at least do something like this with the JavaScript that will bail out on the first match at depth:
db.collection.find(
function () {
var found = false;
var finder = function( obj, field, value ) {
if ( obj.hasOwnProperty(field) && obj[field] == value )
found = true;
if (found) return true;
for( var n in obj ) {
if ( Object.prototype.toString.call(obj[n]) === "[object Object]" ) {
finder( obj[n], field, value );
if (found) return true;
}
}
};
finder( this, "name", "Jeff" );
return found;
}
)
The format there is shorthand notation for the $where operator, which is pretty bad news for performance, but your structure isn't offering much other choice. At any rate, the function should recurse into each nested document until the "field" with the "value" is found.
For anything of production scale, really look at changing the structure to something that can be indexed and accessed quickly. The first example should give you a starting point. Relying on arbitrary JavaScript for queries as your present structure constrains you to is bad news.

If these are similar instance, what stops you in putting these in an array? That would be easier to query.
In it's current form this looks as good as writing your own $where condition to parse all document structure and is not an efficient operation!
Although highly inefficient and I wouldn't suggest using this in a production environment, following is one of the simplest way (with its own various catches) you can query:
db.query.find({$where: function() { x = tojsononeline(this); return x.indexOf('"name" : "Jeff",') >= 0; } })
Please note that this will cause a tablescan and if you have a pre-condition you may want to specify that before the where clause in the query.

Related

search phrase or words in document with timestamped words

I've been trying to do this for some days, I guess it's time to ask for a little help.
I'm using elasticsearch 6.6 (I believe it could be upgraded if needed) and nest for c# net5.
The task is to create an index where the documents are the result of a speech-to-text recognition, where all the recognized words have a timestamp (so that that said timestamp can be used to find where the word is spoken in the original file). There are 1000+ texts from media files, and every file is 4 hours long (that means usually 5000~15000 words).
Main idea was to split every text in 3 sec long segments, creating a document with the words in that time segment, and index it so that it can be searched.
I thought that it would not work that well, so next idea was to create a document for every window of 10~12 words scanning the document and jumping by 2 words at time, so that the search could at least match a decent phrase, and have highlighting of the hits too.
Since it's yet far from perfect, I thought it would be nice to index every whole text as a document so to maintain its coherency, the problem is the timestamp associated with every word. To keep this relationship I tried to use nested objects in the document:
PUT index-tapes-nested
{
"mappings" : {
"_doc" : {
"properties" : {
"$type" : { "type" : "text" },
"ContentId" : { "type" : "long" },
"Inserted" : { "type" : "date" },
"TrackId" : { "type" : "long" },
"Words" : {
"type" : "nested",
"properties" : {
"StartMillisec" : { "type" : "integer" },
"Word": { "type" : "text" }
}
}
}
}
}
}
This kinda works, but I don't know exactly how to write the query to search in the index.
A very basic query could be for example:
GET index-tapes-nested/_search
{
"query":{
"nested":{
"path":"Words",
"score_mode":"avg",
"query":{
"match":{
"Words.Word": "a bunch of things"
}
},
"inner_hits": {}
}
}
}
but something like that, especially with the avg scoring, gives low quality results; there could be the right document in the hits, but it doesn't get the word order, so it's not certain and it's not clear.
So as far as I understand it the span_near should come handy in these situations, but I get no results:
GET index-tapes-nested/_search
{
"query": {
"nested":{
"path":"Words",
"score_mode": "avg",
"query": {
"span_near": {
"clauses": [
{ "span_term": { "Words.Word": "bunch" }},
{ "span_term": { "Words.Word": "of" }},
{ "span_term": { "Words.Word": "things" }}
],
"slop": 2,
"in_order": true
}
}
}
}
}
I don't know much about elasticsearch, maybe I should change approach and change the model, maybe rewriting the query is enough, I don't know, this is pretty time consuming, so any help is really appreciated (is this a fairly common task?). For the sake of brevity I'm cutting some stuff and some ideas, I'm available to give some data or other examples if needed.
I also had problems with the c# nest client to manage the nested index, but that is another story.
This could be interpreted in a few ways i guess, having something like an "alternative stream" for a field, or metadata for every word, and so on. What i needed was this: https://github.com/elastic/elasticsearch/issues/5736 but it's not yet done, so for now i think i'll go with the annotated_text plugin or the 10 words window.
I have no idea if in the case of indexing single words there can be a query that 'restores' the integrity of the original text (which means 1. grouping them by an id 2. ordering them) so that elasticsearch can give the desired results.
I'll keep searching in the docs if there's something interesting, or if i can hack something to get what i need (like require_field_match or intervals query).

How do I update values in a nested array?

I would like to preface this with saying that english is not my mother tongue, if any of my explanations are vague or don't make sense, please let me know and I will attempt to make them clearer.
I have a document containing some nested data. Currently product and customer are arrays, I would prefer to have them as straight up ObjectIDs.
{
"_id" : ObjectId("5bab713622c97440f287f2bf"),
"created_at" : ISODate("2018-09-26T13:44:54.431Z"),
"prod_line" : ObjectId("5b878e4c22c9745f1090de66"),
"entries" : [
{
"order_number" : "123",
"product" : [
ObjectId("5ba8a0e822c974290b2ea18d")
],
"customer" : [
ObjectId("5b86a20922c9745f1a6408d4")
],
"quantity" : "14"
},
{
"order_number" : "456",
"product" : [
ObjectId("5b878ed322c9745f1090de6c")
],
"customer" : [
ObjectId("5b86a20922c9745f1a6408d5")
],
"quantity" : "12"
}
]
}
I tried using the following query to update it, however that proved unsuccessful as Mongo didn't behave quite as I had expected.
db.Document.find().forEach(function(doc){
doc.entries.forEach(function(entry){
var entry_id = entry.product[0]
db.Document.update({_id: doc._id}, {$set:{'product': entry_id}});
print(entry_id)
})
})
With this query it sets product in the root of the object, not quite what I had hoped for. What I was hoping to do was to iterate through entries and change each individual product and customer to be only their ObjectId and not an array. Is it possible to do this via the mongo shell or do I have to look for another way to accomplish this? Thanks!
In order to accomplish your specified behavior, you just need to modify your query structure a bit. Take a look here for the specific MongoDB documentation on how to accomplish this. I will also propose an update to your code below:
db.Document.find().forEach(function(doc) {
doc.entries.forEach(function(entry, index) {
var productElementKey = 'entries.' + index + '.product';
var productSetObject = {};
productSetObject[productElementKey] = entry.product[0];
db.Document.update({_id: doc._id}, {$set: productSetObject});
print(entry_id)
})
})
The problem that you were having is that you were not updating the specific element within the entries array, but rather adding a new key to the top-level of the document named product. Generally, in order to set the value of an inner document within an array, you need to specify the array key first (entries in this case) and the inner document key second (product in this case). Since you are trying to set specific elements within the entries array, you need to also specify the index in your query object, I have specified above.
In order to update the customer key in the inner documents, simply switch out the product for customer in my above code.
You're trying to add a property 'product' directly into your document with this line
db.Document.update({_id: doc._id}, {$set:{'product': entry_id}});
Try to modify all your entries first, then update your document with this new array of entries.
db.Document.find().forEach(function(doc){
let updatedEntries = [];
doc.entries.forEach(function(entry){
let newEntry = {};
newEntry["order_number"] = entry.order_number;
newEntry["product"] = entry.product[0];
newEntry["customer"] = entry.customer[0];
newEntry["quantity"] = entry.quantity;
updatedEntries.push(newEntry);
})
db.Document.update({_id: doc._id}, {$set:{'entries': updatedEntries}});
})
You'll need to enumerate all the documents and then update the documents one and a time with the value store in the first item of the array for product and customer from each entry:
db.documents.find().snapshot().forEach(function (elem) {
elem.entries.forEach(function(entry){
db.documents.update({
_id: elem._id,
"entries.order_number": entry.order_number
}, {
$set: {
"entries.$.product" : entry.product[0],
"entries.$.customer" : entry.customer[0]
}
}
);
});
});
Instead of doing 2 updates each time you could possibly use the filtered positional operator to do all updates to all arrays items within one update query.

Searching with dynamic field name in MongoDB

I have a situation where records in Mongo DB are like :
{
"_id" : "xxxx",
"_class" : "xxxx",
"orgId" : xxx,
"targetKeyToOrgIdMap" : {
"46784_56139542ecaa34c13ba9e314" : 46784,
"47530_562f1bc5fc1c1831d38d1900" : 47530,
"700004280_56c18369fc1cde1e2a017afc" : 700004280
},
}
I have to find out the records where child nodes of targetKeyToOrgIdMap has a particular set of values. That means, I know what the value is going to be there in the record in "46784_56139542ecaa34c13ba9e314" : 46784 part. And the field name is variable, its combination of the value and some random string.
In above example, I have 46784, and I need to find all the records which have 46784 in that respective field.
Is there any way I can fire some regex or something like that or by using any other mean where I would get the records which has the value I need in the child nodes of the field targetKeyToOrgIdMap.
Thanks in advance
You could use MongoDB's $where like this:
db.myCollection.find( { $where: function() {
for (var key in obj.targetKeyToOrgIdMap) {
if (obj.targetKeyToOrgIdMap[key] == 46784){
return true;
}
}
}}).each { obj ->
println obj
}
But be aware that this will require a full table scan where the function is executed for each document. See documentation.

Add new field to all documents in a nested array

I have a database of person documents. Each has a field named photos, which is an array of photo documents. I would like to add a new 'reviewed' flag to each of the photo documents and initialize it to false.
This is the query I am trying to use:
db.person.update({ "_id" : { $exists : true } }, {$set : {photos.reviewed : false} }, false, true)
However I get the following error:
SyntaxError: missing : after property id (shell):1
Is this possible, and if so, what am I doing wrong in my update?
Here is a full example of the 'person' document:
{
"_class" : "com.foo.Person",
"_id" : "2894",
"name" : "Pixel Spacebag",
"photos" : [
{
"_id" : null,
"thumbUrl" : "http://site.com/a_s.jpg",
"fullUrl" : "http://site.com/a.jpg"
},
{
"_id" : null,
"thumbUrl" : "http://site.com/b_s.jpg",
"fullUrl" : "http://site.com/b.jpg"
}]
}
Bonus karma for anyone who can tell me a cleaner why to update "all documents" without using the query { "_id" : { $exists : true } }
For those who are still looking for the answer it is possible with MongoDB 3.6 with the all positional operator $[] see the docs:
db.getCollection('person').update(
{},
{ $set: { "photos.$[].reviewed" : false } },
{ multi: true}
)
Is this possible, and if so, what am I doing wrong in my update?
No. In general MongoDB is only good at doing updates on top-level objects.
The exception here is the $ positional operator. From the docs: Use this to find an array member and then manipulate it.
However, in your case you want to modify all members in an array. So that is not what you need.
Bonus karma for anyone who can tell me a cleaner why to update "all documents"
Try db.coll.update(query, update, false, true), this will issue a "multi" update. That last true is what makes it a multi.
Is this possible,
You have two options here:
Write a for loop to perform the update. It will basically be a nested for loop, one to loop through the data, the other to loop through the sub-array. If you have a lot of data, you will want to write this is your driver of choice (and possibly multi-thread it).
Write your code to handle reviewed as nullable. Write the data such that if it comes across a photo with reviewed undefined then it must be false. Then you can set the field appropriately and commit it back to the DB.
Method #2 is something you should get used to. As your data grows and you add fields, it becomes difficult to "back-port" all of the old data. This is similar to the problem of issuing a schema change in SQL when you have 1B items in the DB.
Instead just make your code resistant against the null and learn to treat it as a default.
Again though, this is still not the solution you seek.
You can do this
(null, {$set : {"photos.reviewed" : false} }, false, true)
The first parameter is null : no specification = any item in the collection.
"photos.reviewed" should be declared as string to update subfield.
You can do like this:
db.person.update({}, $set:{name.surname:null}, false, true);
Old topic now, but this just worked fine with Mongo 3.0.6:
db.users.update({ _id: ObjectId("55e8969119cee85d216211fb") },
{ $set: {"settings.pieces": "merida"} })
In my case user entity looks like
{ _id: 32, name: "foo", ..., settings: { ..., pieces: "merida", ...} }

Most efficient way to generate a list of Unigrams from a text field in MongoDB

I need to generate a vector of unigrams, i.e. a vector of all the unique words which appear in a specific text field that I have stored as part of a broader JSON object in MongoDB.
I'm not really sure what's the easiest and most efficient way to generate this vector. I was thinking of writing a simple Java app which could handle the tokenization (using something like OpenNLP), however I think that a better approach may be to try to tackle this using Mongo's Map-Reduce feature... However I'm not really sure how I could go about this.
Another option would be to use Apache Lucene indexing, but it would mean I'd still need to export this data in one by one. Which is really the same issue I would have with the custom Java or Ruby approach...
Map reduce sounds good however the Mongo data is growing by the day as more document are inserted. This isn't really a one off task as there are new documents being added all the time. Updates are very rare. I really don't want to run a Map-Reduce over the millions of documents every time I want to update my Unigram vector as I fear this will be very inefficient use of resources...
What would be the most efficient way to generate the unigram vector and then keep it updated?
Thanks!
Since you have not provided a sample document (object) format take this as a sample collection called 'stories'.
{ "_id" : ObjectId("4eafd693627b738f69f8f1e3"), "body" : "There was a king", "author" : "tom" }
{ "_id" : ObjectId("4eafd69c627b738f69f8f1e4"), "body" : "There was a queen", "author" : "tom" }
{ "_id" : ObjectId("4eafd72c627b738f69f8f1e5"), "body" : "There was a queen", "author" : "tom" }
{ "_id" : ObjectId("4eafd74e627b738f69f8f1e6"), "body" : "There was a jack", "author" : "tom" }
{ "_id" : ObjectId("4eafd785627b738f69f8f1e7"), "body" : "There was a humpty and dumpty . Humtpy was tall . Dumpty was short .", "author" : "jane" }
{ "_id" : ObjectId("4eafd7cc627b738f69f8f1e8"), "body" : "There was a cat called Mini . Mini was clever cat . ", "author" : "jane" }
For the given dataset, you can use the following javascript code to get to your solution. The collection "authors_unigrams" contains the result. All the code is supposed to be run using mongo console (http://www.mongodb.org/display/DOCS/mongo+-+The+Interactive+Shell).
First, we need to mark of all the new documents that have come afresh into the 'stories' collection. We do it using following command. It will add a new attribute called "mr_status" into each document and assign value "inprocess". Later, we will see that map-reduce operation will only take those documents in account which are having the value "inprocess" for the field "mr_status". This way, we can avoid reconsidering all the documents for map-reduce operation that have been already considered in any of the previous attempt, making the operation efficient as asked.
db.stories.update({mr_status:{$exists:false}},{$set:{mr_status:"inprocess"}},false,true);
Second, we define both map() and reduce() function.
var map = function() {
uniqueWords = function (words){
var arrWords = words.split(" ");
var arrNewWords = [];
var seenWords = {};
for(var i=0;i<arrWords.length;i++) {
if (!seenWords[arrWords[i]]) {
seenWords[arrWords[i]]=true;
arrNewWords.push(arrWords[i]);
}
}
return arrNewWords;
}
var unigrams = uniqueWords(this.body) ;
emit(this.author, {unigrams:unigrams});
};
var reduce = function(key,values){
Array.prototype.uniqueMerge = function( a ) {
for ( var nonDuplicates = [], i = 0, l = a.length; i<l; ++i ) {
if ( this.indexOf( a[i] ) === -1 ) {
nonDuplicates.push( a[i] );
}
}
return this.concat( nonDuplicates )
};
unigrams = [];
values.forEach(function(i){
unigrams = unigrams.uniqueMerge(i.unigrams);
});
return { unigrams:unigrams};
};
Third, we actually run the map-reduce function.
var result = db.stories.mapReduce( map,
reduce,
{query:{author:{$exists:true},mr_status:"inprocess"},
out: {reduce:"authors_unigrams"}
});
Fourth, we mark all the records that have been considered for map-reduce in last run as processed by setting "mr_status" as "processed".
db.stories.update({mr_status:"inprocess"},{$set:{mr_status:"processed"}},false,true);
Optionally, you can see the result collection "authors_unigrams" by firing following command.
db.authors_unigrams.find();