Can MongoDB do alphanumeric sort? - mongodb

I need to do a alphanumeric sort on one of the fields in my collection. This field contains Chromosome info.
Chr1,
Chr2,
..,
..,
..,
Chr10,
Chr11,
..,
..,
ChrX,
ChrY
Instead of getting the values in this order, I get them in
Chr1,
Chr11
This is a common problem and was wondering whether MongoDB has an in-built solution for this sort or should we hack it as we normally do in Oracle.
If there is no in-built solution, what is the best way to get this sort? Natural sort is an acceptable solution, but I have already applied it to another field.
Any help would would be greatly appreciated.

MongoDB has no build-in way of sorting data alphanumerical. But when all of your data has a field with a string with the same convention of "Chr" + number, you could put the number into a different field, keep it as an integer instead of a string, and sort by that.

Please refer to "https://docs.mongodb.com/manual/reference/collation".
Mongo provides collation for alphanumeric sorting.
Example:-
db.names.find({}).sort({username:1}).collation({locale:'en_US', numericOrdering:true})

You can also use if you are using aggregation in mongodb:-
db.names.aggregate([...query, {$sort:{username:1}]).collation({locale:'en_US', numericOrdering:true});
Don't forget to add sorting of key in query array otherwise you will not get sorting order.
This thing is not available in documentation.

Related

mongodb index on regex fields not working

I'm new in mongoDB and I'm facing an issue about performance that need your help. I have a collection with 400k records, when not create index for any field on the collection it takes 20-30s for each query then I create indexs for fields that usually using for search query, but the problem is, when using $regex to search for a string field with index on it, mongoDB does not use index on that field, mongodb still scan for all records in that collection, I've searched on internet with this keyword: "index on regex fields mongodb" and I found some answers which say that "MongoDB use prefix of RegEx to lookup indexes" which means you have to use "^" prefix for the index to work like "db.users.find({name: /^key word/})", but that is not working for me, does "index on $regex field" need MongoDB Atlas to work? because i'm using comunity version of mongoDB. Thanks!
There's a lot to unpack here. We'll split the answer into two parts, the first to try and answer some of the direct questions about index usage and the second to explore solutions to satisfy the application requirements.
Index Usage with $regex
As is true with an index in any database that captures the full string value as the key, MongoDB can use the index for a $regex operation but its efficiency in doing so greatly depends on the regex being applied. That is what the Index Use documentation from the comments and the other answers you reference are describing.
In the comments you mention that an example query might be db.users.find({name: {$regex: '.*keyword.*', $options: 'i'}}). That means that the regex is a both unanchored and case-insensitive. The aforementioned doumentation states directly:
Case insensitive regular expression queries generally cannot use indexes effectively.
Why is this? because the substring that you are searching for can be found in any string value captured by the index. So the document with matching value {name: 'a keyword'} would be located at one end of the index, {name: 'keyWord' }, may be somewhere in the middle, and {name: 'Z keyword'} may be at the end. The only way to ensure correct results is for the database to scan the index for all string values. So while it is still using the index, it may not be efficient as most of the scanned values will not be match and will be discarded.
You may always use .explain() to better understand how the database is answering the query, such as if and how it is using an index.
Solutions
So what do we do about this?
Well as #rickhg12hs suggests in the comments, it depends on exactly what you are trying to achieve. You reiterate that that you are looking for 'full regex search capability', but that is really an approach/solution rather than a goal. If what you really need, for example, is just to match an exact string in a case insensitive manner, then something as simple as a case insensitive index would likely do the trick.
However if truly do wish to perform arbitrary substring searching, then you are really looking at search engine capabilities. In that situation your best bets would probably be to emulate their indexes directly in MongoDB (e.g. have the application manually tokenize the strings to be indexed), stand up something like Solr/Elasticsearch next to MongoDB, or use MongoDB's Atlas Search offering. The $text operator mentioned in the comment has limitations when it comes to substring searching (such as just part of a word), which may or may not be relevant for your needs.

What's the easiest way to return the results of a query for a given key/value pair in mongo as an array of the values returned?

I have a field called id (not _id) in documents from two collections. I need to compare the contents of the first collection with the second. Basically, I need to know what documents with a given value 'id' exist in collection 'A', but not 'B'. What's the easiest way to build an array of id's from Collection A that I can use to do something like the following. :
db.B.find({id:{$nin: array_of_ids_from_coll_A}})
Please don't get hung up over why I'm using 'id' in this case, and not '_id'. Thanks.
Strictly speaking, this doesn't answer the question of 'how to build an array that...', but I'd iterate over collection A and, for each element, try to find a match in B. If none is found, add to a list.
This has a lot of roundtrips to the database, so it's not very fast, but it's very simple. Also, if A contains a lot of elements, the array of ids might be too large to throw all of them in the $nin, which otherwise would have to be solved by splitting up the array of ids. To make matters worse, $nin isn't efficient with indexes anyway.
I incorrectly assumed that the function 'distinct' returned a set of distinct documents based on a given 'field'. In fact, it returns an array of distinct values, provided a specific field. So, I was able to construct the array I was looking for with db.A.distinct('id'). Thanks to anyone who took the time to read this question, anyway.

How can I filter by the length of an embedded document in MongoDB?

For example given the BlogPost/Comments schema here:
http://mongoosejs.com/
How would I find all posts with more than five comments? I have tried something along the lines of
where('comments').size.gte(5)
But I'm getting tripped up with the syntax
MongoDb doesn't support range queries with size operator (Link). They recommend you to create a separate field to contain the size of the list that you increment yourself.
You cannot use $size to find a range of sizes (for example: arrays with more than 1 element). If you need to query for a range, create an extra size field that you increment when you add elements.
Note that for some queries, it may be feasible to just list all the counts you want in or excluded using (n)or conditions.
In your example, the following query will give all documents with more than 5 comments (using standard mongodb syntax, not mongoose):
db.col.find({"comments":{"$exists"=>true}, "$nor":[{"comments":{"$size"=>4}}, {"comments":{"$size"=>3}}, {"comments":{"$size"=>2}}, {"comments":{"$size"=>1}}, {"comments":{"$size"=>0}}]})
Obviously, this is very repetitive, so it only makes sense for small boundaries, if at all. Keeping a separate count variable, as recommended in the mongodb docs, is usually the better solution.
It's slow, but you could also use the $where clause:
db.Blog.find({$where:"this.comments.length > 5"}).exec(...);

Sort a Range Query Using Zend Lucene

According to the documentation, Zend Lucene is supposed to sort lexicographically. I am finding this is not the case. If I have a query 'avg:[050 TO 300]', yes it will return all values in that range, but it will sort them according to the document id, not the value.
I have found that the find() function can accept additional parameters, allowing me to sort by a specific column (eg $hits = $index->find($query, 'avg', SORT_NUMERIC, SORT_ASC);). However, I am creating $query dynamically and do not want to sort every search by 'avg'.
How do I force Lucene to sort the results automatically, lexicographically, when I do a range search? And if that's not possible, how do I dynamically add a sort field to the find function?
Why don't you sort $hits by yourself after getting the result from $index->find(...)? Ok this looks like a workaround and will be time-consuming for very large resultsets, but I guess that this is the easiest way in most cases.

Normalize unicode

Let's say I have document indexed with Apache Solr that contains this string:
Klüft skräms inför
I want to be able to find it with search using this keyword (note the "u"-"ü"):
kluft
Is there a way to do this ?
Use the ASCIIFoldingFilterFactory for both the index and query analyzers.