How to improve performance of mongo simple search query - mongodb

This is explanation of a simple search in mongodb, its taking more than 2.4 secs and more to retrieve data.If i add index(search params) it takes more than 5 secs.
Query
db.CX_EMPLOYEES.find({ "$or" : [{ "AML_FULLNAME" : /RAJ/ },
{ "AML_FULLALIAS" : /RAJ/ }] })
Explain
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 79,
"nscannedObjects" : 504570,
"nscanned" : 504570,
"nscannedObjectsAllPlans" : 504570,
"nscannedAllPlans" : 504570,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 2423,
"indexBounds" : {},
"server" : "SERVER:27017"
}

Scheduled for the 2.6 version of MongoDb is a full text search feature. It's available as a development preview in current builds if enabled.
Given the nature of your query, it's likely to be the only option that might be effective using only MongoDb. As you're trying to do a "string contains" search according to the regular expressions you provided, the performance of doing a search to match strings on multiple fields will be terrible given the size of your collection. While it's a simple query in concept, translation to an efficient query is very difficult. Mongo needs to scan every document for a match. Breaking the words apart doesn't help, as Mongo still needs to scan every document.
If you can anchor the regular expression which means it would change to a "string starts with" rather than "string contains", the performance should be reasonable if you normalize the strings so that all character case is ignored, and realizing that matches will be exact still. For example, a is not รก and would need to be specially handled.
Mongo's support for this type of query really are limited for production use. You may find that the full text search feature isn't suitable either. If this query is important, I'd suggest considering alternative search mechanisms. Possibly looking at something like Elastic Search for example.

There are no any reasons to add index on this search params because you using regExp. Index can improve find with regExp only if regExp has an anchor for the beginning.
db.CX_EMPLOYEES.find({ "$or" : [{ "AML_FULLNAME" : /^RAJ/ }, { "AML_FULLALIAS" : /^RAJ/ }] })
From documentation:
$regex can only use an index efficiently when the regular expression has an anchor for the beginning (i.e. ^) of a string and is a case-sensitive match. Additionally, while /^a/, /^a.*/, and /^a.*$/ match equivalent strings, they have different performance characteristics. All of these expressions use an index if an appropriate index exists; however, /^a.*/, and /^a.*$/ are slower. /^a/ can stop scanning after matching the prefix.

There is not a lot of things you can do. You have half of a million elements and you are doing full scan on all of them. Not a surprise that it takes time. Moreover your search is based on regex which can be anywhere in the string. So indexes can not help you in this case.
If your search is based on words, you can try to create array from the string. For example string 'Salvador Domingo Dali' will be converted to ['Salvador', 'Domingo', 'Dali']. If you will add an index on this array and will try to look for 'Dali', that the search will take advantage of this index.
P.S. databases and indexes are not a silver bullet. Sometimes you need a better logic to be able to deal with a lot of data.

Related

couchdb find function with string comparison does not work as expected

as you can see in the image i am querying a date string d_lastOpened against d_updated. The goal is to show all that where updated after lastOpened, that means d_lastOpened <= d_updated. Why is the 2. entry of the search result shown when it clearly does not fit that criteria?
i tried swapping fields around and using $gte but that does not work at all, which is logical.
You're making a string comparison between the strings "d_updated" and the values for d_lastOpened, starting with "2022".
"2022" < "d", therefore the "$lte" comparison returns the two results, but "$gte" fails.

MongoDB - find if a field value is contained in a given string

Is it possible to query documents where a specific field is contained in a given string?
for example if I have these documents:
{a: 'test blabla'},
{a: 'test not'}
I would like to find all documents that field a is fully included in the string "test blabla test", so only the first document would be returned.
I know I can do it with aggregation using $indexOfCP and it is also possible with $where and mapReduce. I was wandering if it's possible to do it in find query using the standard MongoDB operators (e.g., $gt, $in).
thanks.
I can think of 2 ways you could do this:
Option 1
Using $where:
db.someCol.find( { $where: function() { return "test blabla test".indexOf(this.a) > -1; } }
Explained: Find all documents whose value of field "a" is found WITHIN some given string.
This approach is universal, as you can run any code you like, but less recommended from a performance perspective. For instance, it cannot take advantage of indexes. Read full $where considerations here: https://docs.mongodb.com/manual/reference/operator/query/where/#considerations
Option 2
Using regex matching trickery, ONLY under certain circumstances; below is an example that only works with matching that the field value is found as a starting substring of the given string:
db.someCol.find( { a : /^(t(e(s(t( (b(l(a(b(l(a( (t(e(s(t)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?$/ } )
Explained: Break up the components of your "should-be-contained-within" string and match against all sub-possibilities of that with regex.
For your case, this option is pretty much insane, but it's worth noting as there may be specific cases (such as limited namespace matching), where you would not have to break up each letter, but some very finite set of predetermined parts. And in that case, this option could still make use of indexes, and not suffer the $where performance pentalties (as long as the complexity of the regex doesn't outweigh that benefit :)
You can use regex to search .
db.company.findOne({"companyname" : {$regex : ".*hello.*"}});
If you are using Mongo v3.6+, you can use $expr.
As you mentioned $indexOfCP can be used to get index, here it will be
{
"$expr": {
{$ne : [{$indexOfCP: ["test blabla test", "$a"]}, -1]}
}
}
The field name should be prefixed with a dollar sign ($), as $expr allows filters from aggregation pipeline.

Find and Delete documents in MongoDB using a Windows GUI tool

Okay, I'm a SQL Server based DBA, but there's a biz-critical app that uses MongoDB as a core component, and the problem is that it's grown too large. (The "large-ness" isn't critical yet, I'm trying to be proactive!)
Specifically, it has a "message log" collection which is over 400 GB, where the date-stamp is actually stored as a 2-element array [Int64, Int32], the 0th element being some measure of time (and the 1th element is just always '0').
So for example, a document:
{
"_id" : ObjectId("55ef63618c782e0afcf346cf"),
"CertNumber" : null,
"MachineName" : "WORKERBEE1",
"DateTime" : [
NumberLong(635773487051900000),
0
],
"Message" : "Waited 00:00:30.0013381 to get queue lock: Specs to verify",
"ScopeStart" : false
}
And just because 2 is better than 1, another example document:
{
"_id" : ObjectId("55ef63618c782e0afcf323be"),
"CertNumber" : null,
"MachineName" : "WORKERBEE2",
"DateTime" : [
NumberLong(635773487056430453),
0
],
"Message" : "Waited 00:00:30.0012345 to get queue lock: Specs to verify",
"ScopeStart" : false
}
I need to figure out two things:
What the heck does that "DateTime" really mean? It's not Unix Epoch time (seconds nor milliseconds); and even if I strip off the trailing 0's, it represents (in millis) 6/20/2171, so, unless we're building a time machine here, it makes no sense. If I strip off the last 6 digits, it means 2/23/1990, but even that doesn't seem likely, as this application has only existed since the early 2000's. (AFAIK)
Assuming we figure out #1, can we use some kind of command to remove (delete) all documents in the collection that are older than, say, 1/1/2016?
Again, I'm a SQL guy, so try to explain using analogs in that vein, e.g. "this is like your WHERE clause" and such.
PS: Yes, I read thru questions such as Find objects between two dates MongoDB and How do I convert a property in MongoDB from text to date type? , but so far nothing has jumped out at me.

Is there any way to force a schema to be respected?

First, I'd like to say that I really love NoSQL & MongoDB but I've got some major concerns with its schema-less aspect.
Let's say I have 2 tables. Employees and Movies.
And... I have a very stupid data layer / framework that sometimes like to save objects in the wrong tables.
So one day, a Movie gets saved in the Employees table. Like this:
> use mongoTests;
switched to db mongoTests
> db.employees.insert({ name : "Max Power", sex : "Male" });
> db.employees.find();
{ "_id" : ObjectId("4fb25ce6420141116081ae57"), "name" : "Max Power", "sex" : "Male" }
> db.employees.insert({ title : "Fight Club", actors : [{ name : "Brad Pitt" }, { name : "Edward Norton" }]});
> db.employees.find();
{ "_id" : ObjectId("4fb25ce6420141116081ae57"), "name" : "Max Power", "sex" : "Male" }
{ "_id" : ObjectId("4fb25db834a31eb59101235b"), "title" : "Fight Club", "actors" : [ { "name" : "Brad Pitt" }, { "name" : "Edward Norton" } ] }
This is VERY wrong.
Let's switch the context, think about Movies, and CreditCards (for whatever reason, in this context credit cards would be stored in clear text inside the DB). This is SUPER WRONG?
The code would probably explode because it's trying to use an object
structure and receives another totally unknown structure.
Even worst, the code actually works and the webstore visitors
actually see credit cards information in the "Rent a movie" list.
Is there anything, built-in that would prevent such threat to ever happen? Like some way to "force" a schema to be respected for only some tables?
Or is there any way to force MongoDB to make a schema mandatory? (Can't create new fields in a table, etc)
EDIT: For those who thinks I'm trolling, I'm really not, this is an important question for me and my team because this is a big decision whether or not we're going to use NoSQL.
Thanks and have a nice day.
The schema-less aspect is one of the major positives.
A DB with a schema doesn't fully remove this kind of issue - e.g. there could be a bug in a system that uses a RDBMS that puts the wrong data in the wrong field/table.
IMHO, the bigger concern would be, how did that kind of bug make it through dev, testing and out into production?!
Having said that, you could set up a process that checks the "schema" of documents within a collection (e.g. look at newly added documents, check whether they have fields you would expect to see in there) - then flag up for investigation. There is such a tool (node.js) here (I think, I've never used it):
http://dhendo.github.com/node-mongodb-schema-validator/
Edit:
For those finding this question in future, so the link in my comment doesn't go overlooked, there's a jira item for this kind of thing here:
http://jira.mongodb.org/browse/SERVER-3536

MongoDB $ne Explanation

The official MongoDB api wrote very little about $ne
http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-%24ne
So when I encountered something like
db.papers.update({"authors cited" : {"$ne" : "Richie"}},
... {$push : {"authors cited" : "Richie"}})
I have no choice but to become utterly confused. Can someone please explain it to me?
This would add "Richie" to the list of authors cited for a paper that does not already have "Richie" as an author.
An alternative would be to use $addToSet.
But then how would I know whether {"authors cited" : {"$ne" : "Richie"}} means the elements in the list corresponding to "author cited", vs the value corresponding to "author cited"?
That is a bit confusing. Generally (I am sure there are exceptions, but those should be documented), all selectors target the individual values for multi-values fields. In Mongo this is called "multikeys" .
Note that this led me to assume initially that your query would target all papers that have at least one author who is not Richie. Then I checked and this turned out to be wrong. +1 for your question, because this really needs to be documented better.