MongoDB find fields which are substring of a query text - mongodb

I have been looking for a way to do that but couldn't find any.
I'd like to know if is possible to, from a given query, return all fields that are contained in that query.
For example my dataset is as follows:
{ "_id" : ObjectId("5d5c2b4cc1f74ace3a48a072"), "id" : 0, "term" : "shorts" }
{ "_id" : ObjectId("5d5c2b4cc1f74ace3a48a072"), "id" : 0, "term" : "jacket" }
{ "_id" : ObjectId("5d5c2b4cc1f74ace3a48a072"), "id" : 1, "term" : "yellow jacket" }
{ "_id" : ObjectId("5d5c2b56c1f74ace3a48a073"), "id" : 2, "term" : "blue jacket" }
{ "_id" : ObjectId("5d5c2b65c1f74ace3a48a074"), "id" : 3, "term" : "blue shorts" }
{ "_id" : ObjectId("5d5c2b71c1f74ace3a48a075"), "id" : 4, "term" : "red shorts" }
And now, given a text like: "I really love blue shorts", the return should be only:
{ "_id" : ObjectId("5d5c2b71c1f74ace3a48a075"), "id" : 3, "term" : "blue shorts" }
{ "_id" : ObjectId("5d5c2b4cc1f74ace3a48a072"), "id" : 0, "term" : "shorts" }
It's something like query.contains(field)

Using $where is generally discouraged in mongodb because of
javascript execution in the query system and can be slow.
You can try this out if the dataset is not very large. Its like doing reverse regex for the field value contained in the query.
db.collection.find({$where: "\""I really love blue shorts\".match(this.term)"});
Which outputs:
{ "_id" : ObjectId("5d5c32c1236f19364a8aad4d"), "id" : 0, "term" : "shorts"}
{ "_id" : ObjectId("5d5c32c1236f19364a8aad51"), "id" : 3, "term" : "blue shorts"}
NOTE: This takes the assumption that term is defined in the documents, else you can use a javascript function for the $where value to deal with edge cases such as not defined fields, etc.
{ $where: function() { return /* after edge cases dealt with*/ }

The following query can get us the expected output:
db.collection.aggregate([
{
$addFields:{
"searchString":"I really love blue shorts"
}
},
{
$match:{
$expr:{
$gt:[
{
$indexOfBytes:["$searchString","$term"]
},
-1
]
}
}
},
{
$project:{
"searchString":0
}
}
]).pretty()
Data set:
{
"_id" : ObjectId("5d5c2b4cc1f74ace3a48a070"),
"id" : 0,
"term" : "shorts"
}
{
"_id" : ObjectId("5d5c2b4cc1f74ace3a48a071"),
"id" : 0,
"term" : "jacket"
}
{
"_id" : ObjectId("5d5c2b4cc1f74ace3a48a072"),
"id" : 1,
"term" : "yellow jacket"
}
{
"_id" : ObjectId("5d5c2b56c1f74ace3a48a073"),
"id" : 2,
"term" : "blue jacket"
}
{
"_id" : ObjectId("5d5c2b65c1f74ace3a48a074"),
"id" : 3,
"term" : "blue shorts"
}
{
"_id" : ObjectId("5d5c2b71c1f74ace3a48a075"),
"id" : 4,
"term" : "red shorts"
}
Output:
{
"_id" : ObjectId("5d5c2b4cc1f74ace3a48a070"),
"id" : 0,
"term" : "shorts"
}
{
"_id" : ObjectId("5d5c2b65c1f74ace3a48a074"),
"id" : 3,
"term" : "blue shorts"
}

Related

MongoDB - how to optimise find query with regex search, with sort

I need to execute the following query:
db.S12_RU.find({"venue.raw":a,"title":/b|c|d|e/}).sort({"year":-1}).skip(X).limit(Y);
where X and Y are numbers.
The number of documents in my collection is:
208915369
Currently, this sort of query takes about 6 minutes to execute.
I have the following indexes:
[
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_"
},
{
"v" : 2,
"key" : {
"venue.raw" : 1
},
"name" : "venue.raw_1"
},
{
"v" : 2,
"key" : {
"venue.raw" : 1,
"title" : 1,
"year" : -1
},
"name" : "venue.raw_1_title_1_year_-1"
}
]
A standard document looks like this:
{ "_id" : ObjectId("5fc25fc091e3146fb10484af"), "id" : "1967181478", "title" : "Quality of Life of Swedish Women with Fibromyalgia Syndrome, Rheumatoid Arthritis or Systemic Lupus Erythematosus", "authors" : [ { "name" : "Carol S. Burckhardt", "id" : "2052326732" }, { "name" : "Birgitha Archenholtz", "id" : "2800742121" }, { "name" : "Kaisa Mannerkorpi", "id" : "240289002" }, { "name" : "Anders Bjelle", "id" : "2419758571" } ], "venue" : { "raw" : "Journal of Musculoskeletal Pain", "id" : "49327845" }, "year" : 1993, "n_citation" : 31, "page_start" : "199", "page_end" : "207", "doc_type" : "Journal", "publisher" : "Taylor & Francis", "volume" : "1", "issue" : "", "doi" : "10.1300/J094v01n03_20" }
Is there any way to make this query execute in a few seconds?

Mongo: get the nth document by creation date

I have been looking for a solution all morning but could not really find anything I could use in a production environment.
The final goal is, given a mongo collection, get the N oldest messages, and apply an update to them. This is not easily doable, things like
db.coll.update({}, {$set: {status: 'UPDATED'}}).limit(N)
do not work.
So keeping in mind I want to update these N documents (or the whole collection if it has less than N documents), I was thinking about sorting the documents by creation date, getting the Nth one, and updating all documents $lte _id(N). (with y beautiful pseudo-code).
Thing is, I can't seem to be able to find an efficient way to do this. First I tried stuff like:
db.coll.find().sort(_id: 1).limit(N).sort(_id: -1).limit(1)
to then realize that having two sort in the same command was useless...
This works:
db.coll.find().limit(N).skip(N-1).next()
but has two big drawbacks:
I have to be sure beforehand that I have at least N documents (which is not that big an issue in my case but...)
.skip() is known to be CPU intensive because it actually go through the whole cursor. Although in my case N should not be greater than 1M, still not a good thing, when what I want is really the last document of the cursor.
So I guess my question is, assuming me trying to use the creation date to do this is the right way, how can I either:
get the Nth document inserted in my collection (under some criteria)
how to get the last record of my cursor (db.coll.find().limit(N))
Thank you a lot!!
Alexis
PS: If that makes any difference, we are actually coding in Java for this.
Using map-reduce you can achieve almost the desired result.
The map and reduce functions are trivial:
map = function() {
this.value.status = 'UPDATED';
emit(this._id, this.value)
}
reduce = function(key, values) {
// XXX should log an error if we reach that point
return {unexpectedReduce: values}
}
The trick is to use the merge output action of mapReduce (as well as limit, sort and query to select only the required input documents):
db.test.mapReduce(map, reduce,
{ query: {"value.status": {$ne: 'UPDATED'}},
sort: { _id: 1 },
limit: 10,
out: {merge: 'test'},
}
)
But, there is a but: you have to store the document in you collection as {_id: ... , value: { field1: ..., field2: ..., ... }} as this is the only output format currently supported by mapReduce jobs.
Here is a sample test collection I used when writing this answer:
> for(i = 0; i < 100; ++i) {
db.test.insert({value:{field1: i, field2: "hello"+i}}); sleep(500);
}
(BTW, note that I use ObjectID to identify older documents as the 4 most significant bytes are seconds since the Unix epoch)
Running the above map-reduce jobs will update the collection by batch of 10 older non-updated records:
> db.test.mapReduce(map, reduce,
{ query: {"value.status": {$ne: 'UPDATED'}},
sort: { _id: 1 },
limit: 10,
out: {merge: 'test'},
}
)
> db.test.find()
{ "_id" : ObjectId("556cd4d00027c9fdf8af809f"), "value" : { "field1" : 96, "field2" : "hello96" } }
{ "_id" : ObjectId("556cd4d00027c9fdf8af80a0"), "value" : { "field1" : 97, "field2" : "hello97" } }
{ "_id" : ObjectId("556cd4d10027c9fdf8af80a1"), "value" : { "field1" : 98, "field2" : "hello98" } }
{ "_id" : ObjectId("556cd4d10027c9fdf8af80a2"), "value" : { "field1" : 99, "field2" : "hello99" } }
{ "_id" : ObjectId("556cd49f0027c9fdf8af803f"), "value" : { "field1" : 0, "field2" : "hello0", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a00027c9fdf8af8040"), "value" : { "field1" : 1, "field2" : "hello1", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a00027c9fdf8af8041"), "value" : { "field1" : 2, "field2" : "hello2", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a10027c9fdf8af8042"), "value" : { "field1" : 3, "field2" : "hello3", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a10027c9fdf8af8043"), "value" : { "field1" : 4, "field2" : "hello4", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a20027c9fdf8af8044"), "value" : { "field1" : 5, "field2" : "hello5", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a20027c9fdf8af8045"), "value" : { "field1" : 6, "field2" : "hello6", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a30027c9fdf8af8046"), "value" : { "field1" : 7, "field2" : "hello7", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a30027c9fdf8af8047"), "value" : { "field1" : 8, "field2" : "hello8", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a40027c9fdf8af8048"), "value" : { "field1" : 9, "field2" : "hello9", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a40027c9fdf8af8049"), "value" : { "field1" : 10, "field2" : "hello10" } }
...
Scroll right to see the updated status in the above and below code blocks
And running the same mapReduce job again:
{ "_id" : ObjectId("556cd4d00027c9fdf8af809f"), "value" : { "field1" : 96, "field2" : "hello96" } }
{ "_id" : ObjectId("556cd4d00027c9fdf8af80a0"), "value" : { "field1" : 97, "field2" : "hello97" } }
{ "_id" : ObjectId("556cd4d10027c9fdf8af80a1"), "value" : { "field1" : 98, "field2" : "hello98" } }
{ "_id" : ObjectId("556cd4d10027c9fdf8af80a2"), "value" : { "field1" : 99, "field2" : "hello99" } }
{ "_id" : ObjectId("556cd49f0027c9fdf8af803f"), "value" : { "field1" : 0, "field2" : "hello0", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a00027c9fdf8af8040"), "value" : { "field1" : 1, "field2" : "hello1", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a00027c9fdf8af8041"), "value" : { "field1" : 2, "field2" : "hello2", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a10027c9fdf8af8042"), "value" : { "field1" : 3, "field2" : "hello3", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a10027c9fdf8af8043"), "value" : { "field1" : 4, "field2" : "hello4", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a20027c9fdf8af8044"), "value" : { "field1" : 5, "field2" : "hello5", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a20027c9fdf8af8045"), "value" : { "field1" : 6, "field2" : "hello6", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a30027c9fdf8af8046"), "value" : { "field1" : 7, "field2" : "hello7", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a30027c9fdf8af8047"), "value" : { "field1" : 8, "field2" : "hello8", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a40027c9fdf8af8048"), "value" : { "field1" : 9, "field2" : "hello9", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a40027c9fdf8af8049"), "value" : { "field1" : 10, "field2" : "hello10", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a50027c9fdf8af804a"), "value" : { "field1" : 11, "field2" : "hello11", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a50027c9fdf8af804b"), "value" : { "field1" : 12, "field2" : "hello12", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a60027c9fdf8af804c"), "value" : { "field1" : 13, "field2" : "hello13", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a60027c9fdf8af804d"), "value" : { "field1" : 14, "field2" : "hello14", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a70027c9fdf8af804e"), "value" : { "field1" : 15, "field2" : "hello15", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a70027c9fdf8af804f"), "value" : { "field1" : 16, "field2" : "hello16", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a80027c9fdf8af8050"), "value" : { "field1" : 17, "field2" : "hello17", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a80027c9fdf8af8051"), "value" : { "field1" : 18, "field2" : "hello18", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a90027c9fdf8af8052"), "value" : { "field1" : 19, "field2" : "hello19", "status" : "UPDATED" } }
{ "_id" : ObjectId("556cd4a90027c9fdf8af8053"), "value" : { "field1" : 20, "field2" : "hello20" } }
{ "_id" : ObjectId("556cd4aa0027c9fdf8af8054"), "value" : { "field1" : 21, "field2" : "hello21" } }
...

Compound GeoSpatial Index in MongoDB is not working as intended

the collection nodesWays has the following indexes:
> db.nodesWays.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "h595.nodesWays"
},
{
"v" : 1,
"key" : {
"amenity" : 1,
"geo" : "2dsphere"
},
"name" : "amenity_1_geo_2dsphere",
"ns" : "h595.nodesWays",
"2dsphereIndexVersion" : 2
},
{
"v" : 1,
"key" : {
"geo" : "2dsphere"
},
"name" : "geo_2dsphere",
"ns" : "h595.nodesWays",
"2dsphereIndexVersion" : 2
}
]
Now the following two queries should return the same result, but they don't.
I want the nearest 10 restaurants to the specified point.
The first query is working how it should be, the second is not working like intended.
The only difference between these two queries is that the first one uses the geo_2dsphere-Index
and the second query the amenity_1_geo_2dsphere-Index.
> db.nodesWays.find(
{
geo:{
$nearSphere:{
$geometry:{
type: "Point", coordinates: [9.7399777,52.3715156]
}
}
}, "amenity":"restaurant",
name: {$exists: true}
}, {id:1, name:1}).hint( "geo_2dsphere" ).limit(10)
{ "_id" : ObjectId("53884860e552e471be2b7192"), "id" : "321256694", "name" : "Masa" }
{ "_id" : ObjectId("53884860e552e471be2b7495"), "id" : "323101271", "name" : "Bavarium" }
{ "_id" : ObjectId("53884862e552e471be2ba605"), "id" : "442496282", "name" : "Naxos" }
{ "_id" : ObjectId("53884860e552e471be2b7488"), "id" : "323101189", "name" : "Block House" }
{ "_id" : ObjectId("53884878e552e471be2d1a41"), "id" : "2453236451", "name" : "Maestro" }
{ "_id" : ObjectId("53884870e552e471be2c8aab"), "id" : "1992166428", "name" : "Weinstube Leonardo Ristorante" }
{ "_id" : ObjectId("53884869e552e471be2c168b"), "id" : "1440320284", "name" : "Altdeutsche küche" }
{ "_id" : ObjectId("53884861e552e471be2b88f7"), "id" : "353119010", "name" : "Mövenpick" }
{ "_id" : ObjectId("5388485de552e471be2b2c86"), "id" : "265546900", "name" : "Miles" }
{ "_id" : ObjectId("53884863e552e471be2bb5d3"), "id" : "532304135", "name" : "Globetrotter" }
> db.nodesWays.find(
{
geo:{
$nearSphere:{
$geometry:{
type: "Point", coordinates: [9.7399777,52.3715156]
}
}
}, "amenity":"restaurant",
name: {$exists: true}
}, {id:1, name:1}).hint( "amenity_1_geo_2dsphere" ).limit(10)
{ "_id" : ObjectId("53884875e552e471be2cf4a8"), "id" : "2110027373", "name" : "Schloßhof Salder" }
{ "_id" : ObjectId("5388485be552e471be2aff19"), "id" : "129985174", "name" : "Balkan Paradies" }
{ "_id" : ObjectId("5388485be552e471be2afeb4"), "id" : "129951134", "name" : "Asia Dragon" }
{ "_id" : ObjectId("53884863e552e471be2ba811"), "id" : "450130115", "name" : "Kings Palast" }
{ "_id" : ObjectId("53884863e552e471be2ba823"), "id" : "450130135", "name" : "Restaurant Montenegro" }
{ "_id" : ObjectId("53884877e552e471be2d053a"), "id" : "2298722569", "name" : "Pizzaria Da-Lucia" }
{ "_id" : ObjectId("53884869e552e471be2c152e"), "id" : "1420101752", "name" : "Napoli" }
{ "_id" : ObjectId("5388485be552e471be2b0028"), "id" : "136710095", "name" : "Europa" }
{ "_id" : ObjectId("53884862e552e471be2ba5bc"), "id" : "442136241", "name" : "Syrtaki" }
{ "_id" : ObjectId("53884863e552e471be2ba763"), "id" : "447972565", "name" : "Pamukkale" }
My goal with the second index is to:
select all restaurants
then use the nearSphere-Operator to sort them in regards to the distance from the specified point
Auf Wiedersehen
I think you should try to put the geolocation first in the index.

What is reason that on mapreduce sometimes the mapper generates more documents than the original data in mongodb?

I am performing a state-wise population count and getting extra documents with the original output. To check the reason i found that mappers would generate intermediate data a lot of more than the original data in mongodb . How can i resolve this ? The total count of document in source collection is 29468.
Sample from the Dataset:
{ "city" : "SPLENDORA", "loc" : [ -95.199308, 30.232609 ], "pop" : 11287, "state" : "TX", "_id" : "77372" }
{ "city" : "SPRING", "loc" : [ -95.377329, 30.053241 ], "pop" : 33118, "state" : "TX", "_id" : "77373" }
{ "city" : "TOMBALL", "loc" : [ -95.62006, 30.073923 ], "pop" : 19801, "state" : "TX", "_id" : "77375" }
{ "city" : "WILLIS", "loc" : [ -95.497583, 30.432025 ], "pop" : 9988, "state" : "TX", "_id" : "77378" }
{ "city" : "KLEIN", "loc" : [ -95.528481, 30.023377 ], "pop" : 35275, "state" : "TX", "_id" : "77379" }
{ "city" : "CONROE", "loc" : [ -95.492392, 30.225725 ], "pop" : 1635, "state" : "TX", "_id" : "77384" }
map function:
var m=function(){ emit(this.city,this.pop);}
reduce function:
var r=function(c,p){ return p;}
MR output to a new collection :
{ "_id" : "81080", "value" : 172 }
{ "_id" : "81250", "value" : 467 }
{ "_id" : "82057", "value" : 60 }
{ "_id" : "95411", "value" : 133 }
{ "_id" : "95414", "value" : 226 }
{ "_id" : "95440", "value" : 2876 }
{ "_id" : "95455", "value" : 843 }
{ "_id" : "95467", "value" : 328 }
{ "_id" : "95489", "value" : 358 }
{ "_id" : "95495", "value" : 367 }
{ "_id" : "98791", "value" : 5345 }
{ "_id" : "PLEASANT GROVE", "value" : [ 8458, 15703, 80, 772,
{ "_id" : "POINTBLANK", "value" : 2911 }
{ "_id" : "PORTER", "value" : [ 13541, 19024, 985, 425, 2705 ]
{ "_id" : "SHEPHERD", "value" : [ 9604, 17397, 2078 ] }
{ "_id" : "SPLENDORA", "value" : 11287 }
{ "_id" : "SPRING", "value" : [ 33118, 8379, 21805, 8540 ] }
{ "_id" : "TOMBALL", "value" : 19801 }
{ "_id" : "WILLIS", "value" : [ 9988, 2769, 2574 ] }
{ "_id" : "KLEIN", "value" : 35275 }
Your output isn't as expected because your reduce function is incorrect. The prototype for a reduce function is function(key,values) {...}, where values is an array associated with the key.
Your reduce function is returning the values array rather than reducing it.
To sum up the values for a given key, your reduce() function should look like:
var r=function(key, values) {
return Array.sum(values);
}
If you want to calculate population by state, your map() function is also incorrect: you should be emitting the state & population instead of city & population:
var m=function() {
emit(this.state,this.pop);
}
Putting that together, your output should end up looking like:
{
"_id" : "AK",
"value" : 550043
},
{
"_id" : "AL",
"value" : 4040587
},
{
"_id" : "AR",
"value" : 2350725
}
...
The MongoDB manual has further details on writing and testing your reduce function:
Requirements for the reduce function
Troubleshooting the reduce function

MongoDB show not all elements in subdocument

I have the document of the following structure:
{
"_id" : ObjectId("50b8f881065f90c025000014"),
"education" : {
"schoolCountry" : 4,
"schoolTown" : -1,
"uniCountry" : 4,
"uniTown" : -1
},
"info" : {
"ava" : "auto.jpg",
"birthday" : ISODate("1942-04-01T21:00:00Z"),
"email" : "mail#gmail.com",
"name" : "name",
"sex" : 1,
"surname" : "surname"
}
}
I am trying to output only surname and name
The only thing I was able to achieve is this:
db.COLL.find({ }, {
"_id" : 0,
"education" : 0,
"info" : 1
})
My idea to show only elements that I need from subdocument failed:
db.COLL.find({ }, {
"_id" : 0,
"education" : 0,
"info.surname" : 1,
"info.name" : 1,
})
But hidding (info.email : 0) works. Is it possible to achieve my goal without hidding all unneeded fields?
You can't mix including and excluding fields, aside from turning off _id (which is included by default).
So just request the info.surname and info.name fields:
db.coll.find({ }, {
"_id" : 0,
"info.surname" : 1,
"info.name" : 1,
})
Sample output:
{ "info" : { "name" : "name", "surname" : "surname" } }