How to use Atlas Search to find a text containing a subtext - mongodb

I have a collection hosted on Atlas,
I currently have declared an Atlas Search index with the default configuration, but I am unable to use it to find documents that partially matches the text.
For instance, I have the following documents :
[
{
_id: 'ABC123',
designation: 'ENPHASE IQ TERMINAL CABLE 3PH-1 UD',
supplierIdentifier: 205919
},
{
_id: 'DEF456',
designation: 'ENPHASE CABLE VERT IQ 60/72CELLS 400VAC',
supplierIdentifier: 205919
},
{
_id: 'GHI789',
designation: 'P/SOLAR PC ASTROENERGY 275W 60 CELULAS',
supplierIdentifier: 206382
}
]
If I use the text search to search "EN", Nothing is returned :
[{ "$search" : { "index" : "default", "text" : { "query" : "EN", "path" : { "wildcard" : "*"}}, "count": {"type": "total"}}}]
No result
But if i use the regex search, my documents are correctly returned :
db.testproducts.aggregate([{ "$search" : { "index" : "default", "regex" : { "query" : "(.*)EN(.*)", "allowAnalyzedField" : true, "path" : { "wildcard" : "*"}}, "count": {"type": "total"}}}])
[
{
_id: 'ABC123',
designation: 'ENPHASE IQ TERMINAL CABLE 3PH-1 UD',
supplierIdentifier: 205919
},
{
_id: 'DEF456',
designation: 'ENPHASE CABLE VERT IQ 60/72CELLS 400VAC',
supplierIdentifier: 205919
},
{
_id: 'GHI789',
designation: 'P/SOLAR PC ASTROENERGY 275W 60 CELULAS',
supplierIdentifier: 206382
}
]
As the regex operator is pretty slow, how to achieve the same with the text search ?

Gfhyser, you have a few options and I'm not sure which one you will like the best as they both have limitations.
Option 1, you can specify a path. As you can imagine, wildcard paths and leading ad trailing regex can be expensive. If you know the path you want search is designation, performance will be better if you change your existing query to:
db.testproducts.aggregate([{ "$search" : { "index" : "default", "regex" : { "query" : "(.*)EN(.*)", "allowAnalyzedField" : true, "path" : "designation", "count": {"type": "total"}}}])
Option 2, you can refine your search. Ask yourself if you are truly looking for Enphase and Energy wherever they appear in the same result.
Option 3,The final option is somewhat experimental for me because I need to spend more time on it. I simply want to help. It might be the best performing, involves you reversing your tokens indexed and when querying with a custom analyzer because it can speed up leading wild card queries.If you don't mind a bit of complexity, here is how it would look. Let me know if works out as I don't use regular expressions as much these days.
I create a custom analyzer in the sample_airbnb.listings_and_reviews dataset to search with leading wildcard characters. The index looks like:
{
"analyzer": "lucene.keyword",
"mappings": {
"dynamic": false,
"fields": {
"name": [
{
"dynamic": true,
"type": "document"
},
{
"type": "string"
}
],
"summary": {
"analyzer": "fastRegex",
"type": "string"
}
}
},
"analyzers": [
{
"charFilters": [],
"name": "fastRegex",
"tokenFilters": [
{
"type": "reverse"
}
],
"tokenizer": {
"type": "keyword"
}
}
]
}
And a query that exploits this speed and has the flexibility to potentially match both of your desired terms would look like this:
[
{
'$search': {
'index': 'reviews_search',
'compound': {
'should': [
{
'wildcard': {
'query': '*cated*',
'path': 'summary',
'allowAnalyzedField': true
}
}
]
}
}
}
]

Related

Is there a way to include a Int32 field in a search index in MongoDB (with Atlas Search)?

I have a collection in a Mongo Atlas DB on which I have a search index including some specific string fields. What I want to do is include a Int32 field in this search index to be able to do a search on this number, along with the other fields. I tried to add the field (Number) as a new field in the search index, with the type number, but it doesn't work. I guess it's because it compares the query, a string, with an Int32, but is there a way to make it work ? Or do I have to copy the "Number" in another field "NumberString" to include in the search index ?
Here is an example of one of these documents :
{
“_id” : ObjectId(“010000000000000000000003”),
“Description” : {
“fr-CA” : “Un lot de test”,
“en-CA” : “A test item”
},
“Name” : {
“fr-CA” : “Lot de test”,
“en-CA” : “Test item”
},
“Number” : 345,
“Partners” : [],
[...]
}
The index :
{
“mappings”: {
“dynamic”: false,
“fields”: {
“Description”: {
“fields”: {
“en-CA”: {
“analyzer”: “lucene.english”,
“searchAnalyzer”: “lucene.english”,
“type”: “string”
},
“fr-CA”: {
“analyzer”: “lucene.french”,
“searchAnalyzer”: “lucene.french”,
“type”: “string”
}
},
“type”: “document”
},
“Name”: {
“fields”: {
“en-CA”: {
“analyzer”: “lucene.english”,
“searchAnalyzer”: “lucene.english”,
“type”: “string”
},
“fr-CA”: {
“analyzer”: “lucene.french”,
“searchAnalyzer”: “lucene.french”,
“type”: “string”
}
},
“type”: “document”
},
“Number”:
{
“representation”: “int64”,
“type”: “number”
},
“Partners”: {
“fields”: {
“Name”: {
“type”: “string”
}
},
“type”: “document”
}}}}
And finally the query I try to do.
db.[myDB].aggregate([{ $search: { "index": "default", "text": { "query": "345", "path": ["Number", "Name.fr-CA", "Description.fr-CA", "Partners.Name"]}}}])
For this example, I want the query to be applied on Number, Name, Description and Partners and to return everything that matches. I would expect to have the item #345, but also any items with 345 in the name or description. Is it possible ?
Thanks !
With your current datatype you, should be able to search for #345 in text. However, I would structure the query like so, to support the numeric field as well:
db.[myDB].aggregate([
{
$search: {
"index": "default",
"compound": {
"should":[
{
"text": {
"query": "345",
"path": ["Name.fr-CA", "Description.fr-CA", "Partners.Name"]
}
},
{
"near": {
"origin": 345,
"path": "Number",
"pivot": 2
}
}
]
}
}
}
])

Forbid usage of the specifix index for the query

I have a mongodb collection with the following schema:
{
"description": "some arbitrary text",
"type": "TYPE", # there are a lot of different types
"status": "STATUS" # there are a few different statuses
}
I also have two indexes: for "type" and for "status".
Now I run a query:
db.obj.count({
type: { $in: ["SOME_TYPE"] },
status: { $ne: "SOME_STATUS" },
description: { $regex: /.*/ }
})
MongoDB chooses to use an index for "status", while "type" would be much better:
"query": {
"count": "obj",
"query": {
"description": Regex('.*', 2),
"status": {
"$ne": "SOME_STATUS"
},
"type": {
"$in": [
"SOME_TYPE"
]
}
}
},
"planSummary": "IXSCAN { status: 1 }"
I know I can use hint to specify an index to use, but I have different queries (which should use different indexes) and I can't annotate every one of them.
As far as I can see, a possible solution would be to forbid usage of "status" index for all queries that contain status: { $ne: "SOME_STATUS" } condition.
Is there a way to do it? Or maybe I want something weird and there is a better way?

Mongoose how to return only the modified object in array of objects? [duplicate]

Given the following MongoDB collection:
{
"_id": ObjectId("56d6a7292c06e85687f44541"),
"name": "My ranking list",
"rankings": [
{
"_id": ObjectId("46d6a7292c06e85687f55542"),
"name": "Ranking 1",
"score": 1
},
{
"_id": ObjectId("46d6a7292c06e85687f55543"),
"name": "Ranking 2",
"score": 10
},
{
"_id": ObjectId("46d6a7292c06e85687f55544"),
"name": "Ranking 3",
"score": 15
},
]
}
Here is how I increase the score of a given ranking:
db.collection.update(
{ "_id": ObjectId("56d6a7292c06e85687f44541"), "rankings._id" : ObjectId("46d6a7292c06e85687f55543") },
{ $inc : { "rankings.$.score" : 1 } }
);
How do I get the new score value? In the previous query I increase the second ranking from 10 to 11... How do I get this new value back after the update?
If you are on MongoDB 3.0 or newer, you need to use the .findOneAndUpdate() and use projection option to specify the subset of fields to return. You also need to set returnNewDocument to true. Of course you need to use the $elemMatch projection operator here because you cannot use a positional projection and return the new document.
As someone pointed out:
You should be using .findOneAndUpdate() because .findAndModify() is highlighed as deprecated in every official language driver. The other thing is that the syntax and options are pretty consistent across drivers for .findOneAndUpdate(). With .findAndModify(), most drivers don't use the same single object with "query/update/fields" keys. So it's a bit less confusing when someone applies to another language to be consistent. Standardized API changes for .findOneAndUpdate() actually correspond to server release 3.x rather than 3.2.x. The full distinction being that the shell methods actually lagged behind the other drivers ( for once ! ) in implementing the method. So most drivers actually had a major release bump corresponding with the 3.x release with such changes.
db.collection.findOneAndUpdate(
{
"_id": ObjectId("56d6a7292c06e85687f44541"),
"rankings._id" : ObjectId("46d6a7292c06e85687f55543")
},
{ $inc : { "rankings.$.score" : 1 } },
{
"projection": {
"rankings": {
"$elemMatch": { "_id" : ObjectId("46d6a7292c06e85687f55543") }
}
},
"returnNewDocument": true
}
)
From MongoDB 3.0 onwards, you need to use findAndModify and the fields options also you need to set new to true in other to return the new value.
db.collection.findAndModify({
query: {
"_id": ObjectId("56d6a7292c06e85687f44541"),
"rankings._id" : ObjectId("46d6a7292c06e85687f55543")
},
update: { $inc : { "rankings.$.score" : 1 } },
new: true,
fields: {
"rankings": {
"$elemMatch": { "_id" : ObjectId("46d6a7292c06e85687f55543") }
}
}
})
Both queries yield:
{
"_id" : ObjectId("56d6a7292c06e85687f44541"),
"rankings" : [
{
"_id" : ObjectId("46d6a7292c06e85687f55543"),
"name" : "Ranking 2",
"score" : 11
}
]
}

How to get back the new value after an update in a embedded array?

Given the following MongoDB collection:
{
"_id": ObjectId("56d6a7292c06e85687f44541"),
"name": "My ranking list",
"rankings": [
{
"_id": ObjectId("46d6a7292c06e85687f55542"),
"name": "Ranking 1",
"score": 1
},
{
"_id": ObjectId("46d6a7292c06e85687f55543"),
"name": "Ranking 2",
"score": 10
},
{
"_id": ObjectId("46d6a7292c06e85687f55544"),
"name": "Ranking 3",
"score": 15
},
]
}
Here is how I increase the score of a given ranking:
db.collection.update(
{ "_id": ObjectId("56d6a7292c06e85687f44541"), "rankings._id" : ObjectId("46d6a7292c06e85687f55543") },
{ $inc : { "rankings.$.score" : 1 } }
);
How do I get the new score value? In the previous query I increase the second ranking from 10 to 11... How do I get this new value back after the update?
If you are on MongoDB 3.0 or newer, you need to use the .findOneAndUpdate() and use projection option to specify the subset of fields to return. You also need to set returnNewDocument to true. Of course you need to use the $elemMatch projection operator here because you cannot use a positional projection and return the new document.
As someone pointed out:
You should be using .findOneAndUpdate() because .findAndModify() is highlighed as deprecated in every official language driver. The other thing is that the syntax and options are pretty consistent across drivers for .findOneAndUpdate(). With .findAndModify(), most drivers don't use the same single object with "query/update/fields" keys. So it's a bit less confusing when someone applies to another language to be consistent. Standardized API changes for .findOneAndUpdate() actually correspond to server release 3.x rather than 3.2.x. The full distinction being that the shell methods actually lagged behind the other drivers ( for once ! ) in implementing the method. So most drivers actually had a major release bump corresponding with the 3.x release with such changes.
db.collection.findOneAndUpdate(
{
"_id": ObjectId("56d6a7292c06e85687f44541"),
"rankings._id" : ObjectId("46d6a7292c06e85687f55543")
},
{ $inc : { "rankings.$.score" : 1 } },
{
"projection": {
"rankings": {
"$elemMatch": { "_id" : ObjectId("46d6a7292c06e85687f55543") }
}
},
"returnNewDocument": true
}
)
From MongoDB 3.0 onwards, you need to use findAndModify and the fields options also you need to set new to true in other to return the new value.
db.collection.findAndModify({
query: {
"_id": ObjectId("56d6a7292c06e85687f44541"),
"rankings._id" : ObjectId("46d6a7292c06e85687f55543")
},
update: { $inc : { "rankings.$.score" : 1 } },
new: true,
fields: {
"rankings": {
"$elemMatch": { "_id" : ObjectId("46d6a7292c06e85687f55543") }
}
}
})
Both queries yield:
{
"_id" : ObjectId("56d6a7292c06e85687f44541"),
"rankings" : [
{
"_id" : ObjectId("46d6a7292c06e85687f55543"),
"name" : "Ranking 2",
"score" : 11
}
]
}

Waypoint matching query

We have collection as follows. Each document represent a trip of driver, loc property contains way-points, time property contains time corresponding to way-points. For example, in Trip A, Driver would be at GeoLocation tripA.loc.coordinates[0] at the time tripA.time[0]
{
tripId : "Trip A",
time : [
"2015-03-08T04:47:43.589Z",
"2015-03-08T04:48:43.589Z",
"2015-03-08T04:49:43.589Z",
"2015-03-08T04:50:43.589Z",
],
loc: {
type: "MultiPoint",
coordinates: [
[ -73.9580, 40.8003 ],
[ -73.9498, 40.7968 ],
[ -73.9737, 40.7648 ],
[ -73.9814, 40.7681 ]
]
}
}
{
tripId : "Trip B",
time : [
"2015-03-08T04:47:43.589Z",
"2015-03-08T04:48:43.589Z",
"2015-03-08T04:49:43.589Z",
"2015-03-08T04:50:43.589Z",
],
loc: {
type: "MultiPoint",
coordinates: [
[ -72.9580, 41.8003 ],
[ -72.9498, 41.7968 ],
[ -72.9737, 41.7648 ],
[ -72.9814, 41.7681 ]
]
}
}
We would like to query for trips which starts near (1km) location "[long1,lat1]" around the time t (+-10 minutes) and ends at [long2,lat2].
Is there simple and efficient way to formulate above query for MongoDB or Elasticsearch?
If so could you please give the query to do so. either in MongoDB or Elasticsearch. (MongoDB preferable)
This did start as a comment but was clearly getting way to long. So it's a long explanation of the limitations and the approach.
The bottom line of what you are asking to achieve here is effectively a "union query" which is generally defined as two separate queries where the end result is the "set intersection" of each of the results. In more brief, where the selected "trips" from your "origin" query matches results found in your "destination" query.
In general database terms we refer to a "union" operation as a "join" or at least a condition where the selection of one set of criteria "and" the selection of another must both meet with a common base grouping identifier.
The base points in MongoDB speak as I believe also applies to elastic search indexes is that neither datastore mechanism supports the notion of a "join" in any way from a direct singular query.
There is another MongoDB principle here considering your proposed or existing modelling, in that even with items specified in "array" terms, there is no way to implement an "and" condition with a geospatial search on coordinates and that also considering your choice of modelling as a GeoJSON "MultiPoint" the query cannot "choose" which element of that object to match the "nearest" to. Therefore "all points" would be considered when considering the "nearest match".
Your explanation is quite clear in the intent. So we can see that "origin" is both notated as and represented within what is essentially "two arrays" in your document structure as the "first" element in each of those arrays. The representative data being a "location" and "time" for each progressive "waypoint" in the "trip". Naturally ending in your "destination" at the end element of each array, considering of course that the data points are "paired".
I see the logic in thinking that this is a good way to store things, but it does not follow the allowed query patterns of either of the storage solutions you mention here.
As I mentioned already, this is indeed a "union" in intent so while I see the thinking that led to the design it would be better to store things like this:
{
"tripId" : "Trip A",
"time" : ISODate("2015-03-08T04:47:43.589Z"),
"loc": {
"type": "Point",
"coordinates": [ -73.9580, 40.8003 ]
},
"seq": 0
},
{
"tripId" : "Trip A",
"time" : ISODate("2015-03-08T04:48:43.589Z"),
"loc": {
"type": "Point",
"coordinates": [ -73.9498, 40.7968 ]
},
"seq": 1
},
{
"tripId" : "Trip A",
"time" : ISODate("2015-03-08T04:49:43.589Z"),
"loc": {
"type": "Point",
"coordinates": [ -73.9737, 40.7648 ]
},
"seq": 2
},
{
"tripId" : "Trip A",
"time" : ISODate("2015-03-08T04:50:43.589Z"),
"loc": {
"type": "Point",
"coordinates": [ -73.9814, 40.7681 ]
},
"seq": 3,
"isEnd": true
}
In the example, I'm just inserting those documents into a collection called "geojunk", and then issuing a 2dsphere index for the "loc" field:
db.geojunk.ensureIndex({ "loc": "2dsphere" })
The processing of this is then done with "two" .aggregate() queries. The reason for .aggregate() is because you want to match the "first" document "per trip" in each case. This represents the nearest waypoint for each trip found by the queries. Then basically you want to "merge" these results into some kind of "hash" structure keyed by "tripId".
The end logic says that if both an "origin" and a "destination" matched your query conditions for a given "trip", then that is a valid result for your overall query.
The code I give here is an arbitrary nodejs implementaion. Mostly because it's a good base to represent issuing the queries in "parallel" for best performance and also because I'm choosing to use nedb as an example of the "hash" with a little more "Mongolike" syntax:
var async = require('async'),
MongoClient = require("mongodb").MongoClient;
DataStore = require('nedb');
// Common stream upsert handler
function streamProcess(stream,store,callback) {
stream.on("data",function(data) {
// Clean "_id" to keep nedb happy
data.trip = data._id;
delete data._id;
// Upsert to store
store.update(
{ "trip": data.trip },
{
"$push": {
"time": data.time,
"loc": data.loc
}
},
{ "upsert": true },
function(err,num) {
if (err) callback(err);
}
);
});
stream.on("err",callback)
stream.on("end",callback);
}
MongoClient.connect('mongodb://localhost/test',function(err,db) {
if (err) throw err;
db.collection('geojunk',function(err,collection) {
if (err) throw err;
var store = new DataStore();
// Parallel execution
async.parallel(
[
// Match origin trips
function(callback) {
var stream = collection.aggregate(
[
{ "$geoNear": {
"near": {
"type": "Point",
"coordinates": [ -73.9580, 40.8003 ],
},
"query": {
"time": {
"$gte": new Date("2015-03-08T04:40:00.000Z"),
"$lte": new Date("2015-03-08T04:50:00.000Z")
},
"seq": 0
},
"maxDistance": 1000,
"distanceField": "distance",
"spherical": true
}},
{ "$group": {
"_id": "$tripId",
"time": { "$first": "$time" },
"loc": { "$first": "$loc" }
}}
],
{ "cursor": { "batchSize": 1 } }
);
streamProcess(stream,store,callback);
},
// Match destination trips
function(callback) {
var stream = collection.aggregate(
[
{ "$geoNear": {
"near": {
"type": "Point",
"coordinates": [ -73.9814, 40.7681 ]
},
"query": { "isEnd": true },
"maxDistance": 1000,
"distanceField": "distance",
"spherical": true
}},
{ "$group": {
"_id": "$tripId",
"time": { "$first": "$time" },
"loc": { "$first": "$loc" }
}}
],
{ "cursor": { "batchSize": 25 } }
);
streamProcess(stream,store,callback);
}
],
function(err) {
if (err) throw err;
// Just documents that matched origin and destination
store.find({ "loc": { "$size": 2 }},{ "_id": 0 },function(err,result) {
if (err) throw err;
console.log( JSON.stringify( result, undefined, 2 ) );
db.close();
});
}
);
});
});
On the sample data as I listed it this will return:
[
{
"trip": "Trip A",
"time": [
"2015-03-08T04:47:43.589Z",
"2015-03-08T04:50:43.589Z"
],
"loc": [
{
"type": "Point",
"coordinates": [
-73.958,
40.8003
]
},
{
"type": "Point",
"coordinates": [
-73.9814,
40.7681
]
}
]
}
]
So it found the origin and destination that was nearest to the queried locations, also being an "origin" within the required time and something that is defined as a destination, i.e. "isEnd".
So the $geoNear operation does the matching with the returned results being the documents nearest to the point and other conditions. The $group stage is required because other documents in the same trip could "possibly" match the conditions,so it's just a way of making sure. The $first operator makes sure that the already "sorted" results will contain only one result per "trip". If you are really "sure" that will not happen with the conditions, then you could just use a standard $nearSphere query outside of aggregation instead. So I'm playing it safe here.
One thing to note there that even with the inclusion on "nedb" here and though it does support dumping output to disk, the data is still accumulated in memory. If you are expecting large results then rather than this type of "hash table" implementation, you would need to output in a similar fashion to what is shown to another mongodb collection and retrieve the matching results from there.
That doesn't change the overall logic though, and yet another reason to use "nedb" to demonstrate, since you would "upsert" to documents in the results collection in the same way.