Why doesn't this Cloudant/couchdb $regex query work? - ibm-cloud

I am trying to pull (and delete) all records from our database that don't have a URL with the word 'box' in it. This is the query I'm using:
{
"selector": {
"$not": {
"url": {
"$regex": ".*box.*"
}
}
},
"limit": 50
}
This query returns no records. But if I remove the $not, I get all records that do have the word 'box' in the url, but that's the opposite of what I want. Why do I get no results when adding the $not?
I have tried adding a simple base to the query like "_id":{"$gte":0} but that doesn't help.

from the Cloudant doc:
You can create more complex selector expressions by combining
operators. However, for Cloudant NoSQL DB Query indexes of type json,
you cannot use 'combination' or 'array logical' operators such as
$regex as the basis of a query.
$not is a combination operator and therefore cannot be the basis of a query
i am able to get the following to work:
index
{
"index": {
"fields": ["url"]
},
"name" : "url-json-index",
"type" : "json"
}
query
{
"selector": {
"url": {
"$not": {
"$regex": ".*box.*"
}
}
},
"limit": 50,
"use_index": "url-json-index"
}
if you are still seeing problems, can you provide the output from _/explain and the indexes you have in place.

The "no results" issue is due to a bug in text indexes that has been recently fixed. However, neither $not nor $regex operators are able to take advantage of global indexes so will always result in a full database or index scan.
The way to optimise this query is to use a partial index. A partial index filters documents at indexing time rather than at query time, creating an index over a subset of the database. You then need to tell the _find endpoint to explicitly use the partial index. For example, create an index which only includes documents not matching your regex:
POST /<db>/_index
{
"index": {
"partial_filter_selector": {
"url": {
"$not": {
"$regex": ".*box.*"
}
}
},
"fields": ["type"]
},
"ddoc" : "url-not-box",
"type" : "json"
}
then at query time:
{
"selector": {
"url": {
"$not": {
"$regex": ".*box.*"
}
}
},
"limit": 50,
"use_index": "url-not-box"
}
You can see how many documents are scanned to fulfil the query in the Cloudant UI - the execution statistics are displayed in a popup underneath the query text area.
You may also find this This article about partial indexes helpful.

Related

MongoDB not using Index on simple find

I have a collection called "EN" and I created an index as follow:
db.EN.createIndex( { "Prod_id": 1 } );
When I run db.EN.getIndexes() I get this:
[{ "v": 2, "key": {
"_id": 1 }, "name": "_id_" }, { "v": 2, "key": {
"Prod_id": 1 }, "name": "Prod_id_1" }]
However, when I run the following query:
db.EN.find({'Icecat-interface.Product.#Prod_id':'ABCD'})
.explain()
I get this:
{ "explainVersion": "1", "queryPlanner": {
"namespace": "Icecat.EN",
"indexFilterSet": false,
"parsedQuery": {
"ICECAT-interface.Product.Prod_id": {
"$eq": "ABCD"
}
},
"queryHash": "D12BE22E",
"planCacheKey": "9F077ED2",
"maxIndexedOrSolutionsReached": false,
"maxIndexedAndSolutionsReached": false,
"maxScansToExplodeReached": false,
"winningPlan": {
"stage": "COLLSCAN",
"filter": {
"ICECAT-interface.Product.Prod_id": {
"$eq": "ABCD"
}
},
"direction": "forward"
},
"rejectedPlans": [] }, "command": {
"find": "EN",
"filter": {
"ICECAT-interface.Product.Prod_id": "ABCD"
},
"batchSize": 1000,
"projection": {},
"$readPreference": {
"mode": "primary"
},
"$db": "Icecat" }, "serverInfo": {
It's using COLLSCAN instead of the index, why is this happening?
MongoDB version is 5.0.9-8
Thanks
EDIT (and solution)
It turns that the field name has "#" in front and the index was created without this character so was not picking it up at all.
Once I created a new index using the field name as it was supposed to be it worked OK.
It was interesting though to see how indexing works and best practices
Your find operation is defined as
.find({'Icecat-interface.Product.#Prod_id':'ABCD'})
What is Icecat-interface.Product.#?
The parsedQuery in the explain output confirms that MongoDB is attempting to look for a document that has has a value of "ABCD" for a different field name than the one you have aindexed. From the explain you've provided, that field name is "ICECAT-interface.Product.Prod_id". As the field name being queried and the one that is indexed are different, MongoDB cannot use the index to perform the operation.
Marginally related, the # character that is used in the find is absent in the explain output. This appears to because the actual operation that was used to generate the explain was slightly different. This is also noticeable by the fact that the explain include a batchSize of 1000 which is absent in the operation that was shown as the one being explained.
Depending on what the Icecat-interface.Product.# prefix is supposed to be, the solution is probably to simply remove that from the query predicate in the find itself.
Edit to respond to the comment and the edit to the question. Regarding the comment first:
When I run this: .find({'Prod_id':'ABCD'}) it uses COLLSCAN which to me is wrong, as I have an index on that field, unless I'm missing something here
MongoDB will look to use an index if its first key is used by the query. So an index on { y: 1 } would not be eligible for use by a query of .find({ x: 1}). Similarly to a generic x and y example, Icecat-interface.Product.Prod_id and Prod_id are different field names. So if you query on one but only an index on the other exists, then a collection scan is the only way for the database to execute the query.
This then overlaps some with the edit to the question. In the edited question the new explain plan shows the database successfully using an index. However, that index is { "ICECAT-interface.Product.Prod_id": 1 } which is not the index that you originally show being created or present on the collection ({ "Prod_id": 1 }).
Moreover, you also mention that you "don't get any result back, even with products I know are in the DB". Which field in the database contains the value that you are searching on ('ABCD')? This is going to directly inform what results you get back and what index is used to find the results. Remember that you can search on any arbitrary field in MongoDB, even if it doesn't exist in the database.
I would recommend some extra attention be paid to the namespaces and field names that are being used. Unless this { "ICECAT-interface.Product.Prod_id": 1 } index was created after the db.EN.getIndexes() output was gathered, you may be inadvertently connecting to different systems or namespaces since that index is definitely present somewhere.
Based on your live comments while I'm writing this, seems like you've solved the field name mystery.

What is the indexing strategy for a variable query?

The most common use case for this would probably be a user table, with name, lname, email, phone.
I might search for name contains "paul", email contains 2#yahoo"
I might search for phone = 01234567890
I might search for email = "foo#bar.com"
It is my understanding that in a mongo index works in order. So an index that looks like
name:1, lname:1, email:1, phone:1 wouldn't work for any of the above queries?
What's the best indexing strategy to account for search tables like this?
so, paul you will need to create an index definition before you can run the query. Creating your first search index definition in the collection view in Atlas Data Explorer can be tricky.
Here's what I would recommend for an index definition based on those docs:
{
"mappings": {
"fields": {
"email": {
"analyzer": "lucene.keyword",
"type": "string"
},
"phone": {
"analyzer": "lucene.keyword",
"type": "string"
},
"name": {
"analyzer": "lucene.keyword",
"type": "string"
},
"lname": {
"analyzer": "lucene.keyword",
"type": "string"
}
}
}
}
Here is what I would recommend for a contains-style query on the email and name fields:
{
$search: {
index: 'default',
compound: {
must: [{
wildcard: {
query: '*paul*',
path: 'name'
}
},{
wildcard: {
query: '*2#yahoo*',
path: 'email'
}
}]
}
}
}
Should be a lightning fast query, even for a large index, and as one of multiple clauses as you have described. Let me know if you have any more trouble. There's lot of features like highlighting that should be helpful as well. Note that this query is a single clause. If you want multiple clauses as you have described, embed this clause in a compound operator as seen here.

How does 'fuzzy' work in MongoDB's $searchBeta stage of aggregation?

I'm not quite understanding how fuzzy works in the $searchBeta stage of aggregation. I'm not getting the desired result that I want when I'm trying to implement full-text search on my backend. Full text search for MongoDB was released last year (2019), so there really aren't many tutorials and/or references to go by besides the documentation. I've read the documentation, but I'm still confused, so I would like some clarification.
Let's say I have these 5 documents in my db:
{
"name": "Lightning Bolt",
"set_name": "Masters 25"
},
{
"name": "Snapcaster Mage",
"set_name": "Modern Masters 2017"
},
{
"name": "Verdant Catacombs",
"set_name": "Modern Masters 2017"
},
{
"name": "Chain Lightning",
"set_name": "Battlebond"
},
{
"name": "Battle of Wits",
"set_name": "Magic 2013"
}
And this is my aggregation in MongoDB Compass:
db.cards.aggregate([
{
$searchBeta: {
search: { //search has been deprecated, but it works in MongoDB Compass; replace with 'text'
query: 'lightn',
path: ["name", "set_name"],
fuzzy: {
maxEdits: 1,
prefixLength: 2,
maxExpansion: 100
}
}
}
}
]);
What I'm expecting my result to be:
[
{
"name": "Lightning Bolt", //lightn is in 'Lightning'
"set_name": "Masters 25"
},
{
"name": "Chain Lightning", //lightn is in 'Lightning'
"set_name": "Battlebond"
}
]
What I actually get:
[] //empty array
I don't really understand why my result is empty, so it would be much appreciated if someone explained what I'm doing wrong.
What I think is happening:
db.cards.aggregate... is looking for documents in the "name" and "set_name" fields for words that have a max edit of one character variation from the "lightn" query. The documents that are in the cards collection contain edits that are greater than 2, and therefor your expected result is an empty array. "Fuzzy is used to find strings which are similar to the search term or terms"; used with maxEdits and prefixLength.
Have you tried the term operator with the wildcard option? I think the below aggregation would get you the results you were actually expecting.
e.g.
db.cards.aggregate([
{$searchBeta:
{"term":
{"path":
["name","set_name"],
"query": "l*h*",
"wildcard":true}
}}]).pretty()
You need to provide an index to use with your search query.
The index is basically the analyzer that your query will use to process your results regarding if you want to a full match of the text, or you want a partial match etc.
You can read more about Analyzers from here
In your case, an index based on STANDARD analyzer will help.
After you create your index your code, modified below, will work:
db.cards.aggregate([
{
$search:{
text: { //search has been deprecated, but it works in MongoDB Compass; replace with 'text'
index: 'index_name_for_analyzer (STANDARD in your case)'
query: 'lightn',
path: ["name"] //since you only want to search in one field
fuzzy: {
maxEdits: 1,
prefixLength: 2,
maxExpansion: 100
}
}
}
}
]);

Nested documents and _id indexes in mongodb

I have a collection with nested documents in it. Each document also has an _id field.
Here's an example of a documents structure
{
"_id": ObjectId("top_level_doc"),
"title": "Cadernos",
"parent": "4fd55bbc5d1709793b000008",
"criterias": {
"0": {
"_id": ObjectId("a_nested_doc"),
"value": "caderno",
"operator": "contains",
"field": "design0"
}
}
}
I want to be able to find the nested document just by searching it's _id
With this query
{
"criterias._id" : ObjectId("a_nested_doc")
}
It returns the parent document (i just want the one that's nested).
Ideally I would do this
{
"_id" : ObjectId("a_nested_doc")
}
And it would return the document with that id (either its nested or not).
Ps. I edited the "_id" values for the sake of simplicity just for this example.
You may have to live with selecting criterias._id (without writing a wrapper around the query, at least), but you can select the document itself by simply retrieving a subset of the fields.
http://www.mongodb.org/display/DOCS/Retrieving+a+Subset+of+Fields
// The simplest case converted to your use case
db.collection.find( { criterias._id : ObjectId("a_nested_doc") }, { criterias : 1 } );

Mongo using indexes with sort

I'm trying to optimize a mongodb query. I have an index on from_account_id, to_account_id, and created_at. But the following query does a full collection scan.
{
"ts": {
"$date": "2012-03-18T20:29:27.038Z"
},
"op": "query",
"ns": "heroku_app2281692.transactions",
"query": {
"$query": {
"$or": [
{
"from_account_id": {
"$oid": "4f55968921fcaf0001000005"
}
},
{
"to_account_id": {
"$oid": "4f55968921fcaf0001000005"
}
}
]
},
"$orderby": {
"created_at": -1
}
},
"ntoreturn": 25,
"nscanned": 2643718,
"responseLength": 20,
"millis": 10499,
"client": "10.64.141.77",
"user": "heroku_app2281692"
}
If I don't do the or, and only query from_account_id or to_account_id with an order on it, it's fast.
What's the best way to get the desired effect? Should I be keeping account_ids (both from and to) in one field like an array? Or perhaps there is a better way. Thanks!
Unfortunately, as you have discovered, an $or clause can make life difficult for the optimizer.
So, to work around this you have a couple options. Among them:
Divide your query into two and manually merge the results.
Change your data model to allow efficient querying. For example, you might add a "referenced_accounts" field that is an array of all the accounts referenced in the transaction.