Full Text Search & Inverted Indexes in MongoDB - mongodb

I’m playing around with MongoDB for the moment to see what nice features it has. I’ve created a small test suite representing a simple blog system with posts, authors and comments, very basic.
I’ve experimented with a search function which uses the MongoRegEx class (PHP Driver), where I’m just searching through all post content and post titles after the sentence ‘lorem ipsum’ with case sensitive on “/I”.
My code looks like this:
$regex = new MongoRegEx('/lorem ipsum/i');
$query = array('post' => $regex, 'post_title' => $regex);
But I’m confused and stunned about what happens. I check every query for running time (set microtime before and after the query and get the time with 15 decimals).
For my first test I’ve added 110.000 blog documents and 5000 authors, everything randomly generated. When I do my search, it finds 6824 posts with the sentence “lorem ipsum” and it takes 0.000057935714722 seconds to do the search. And this is after I’ve reset the MongoDB service (using Windows) and this is without any index other than the default on _id.
MongoDB uses a B-tree index, which most definitely isn’t very efficient for a full text search. If I create an index on my post content attribute, the same query as above runs in 0.000150918960571, which funny enough is slower than without any index (slower with a factor of 0.000092983245849). Now this can happen for several reasons because it uses a B-tree cursor.
But I’ve tried to search for an explanation as to how it can query it so fast. I guess that it probably keeps everything in my RAM (I’ve got 4GB and the database is about 500MB). This is why I try to restart the mongodb service to get a full result.
Can anyone with experience with MongoDB help me understand what is going on with this kind of full text search with or without index and definitely without an inverted index?
Sincerely
- Mestika

I think you simply didn't iterate over the results? With just a find(), the driver will not send a query to the server. You need to fetch at least one result for that. I don't believe MongoDB is this fast, and I believe your error to be in your benchmark.
As a second thing, for regular expression search that is not anchored at the beginning of the field's value with an ^, no index is used at all. You should play with explain() to see what is actually happening.

Related

Elasticsearch vs MongoDB for full text search

This is a full text search question.
I was using Elasticsearch for my logging system. And now I heard that MongoDB also supports full text search and tested the performance.
I made a text index and tested it.
With 10,000 words, 10 million documents were created.
And it looked up two words. (ex. "apple pineapple")
The results were surprising. MongoDB searches were faster.
Am I misunderstanding full text search in Elasticsearch?? did i do the test wrong?
In terms of full text search performance, is there no reason why Elasticsearch should be used?
Am I misunderstanding full text search??
Please teach me.
If your use case is full text search only, I will still be more inclined towards Elasticsearch as it is designed for the same. I admit however that I haven't explored Mongodb capabilities in this regard. Elasticsearch provides various search paths fuzzy, proximity matches, match phrases and more which can be used depending on your use case.
One another difference between Elastic and Mongo's data storage is that Elastic keeps everything in memory while Mongo balances between disk and memory. So ideally Elastic should be faster if you load test it.
In terms of your test, please make sure that both mongo and elastic clusters are of equivalent strength in terms of resources. Else it is not apple to apple comparison.

How well do the search engines of databases like mongoDB and Postgres do compared to something like Elasticsearch? (in 2020)

I am working on a website, where users can create entries into a database. These entries are all of the same form, so I was using Sqlite (with FTS5) so far, which is the only database I know ;)
However it is crucial that users can search these entries on the website. The full text search is working decently well (the users know approximately what they are looking for) but I need two improvements:
Wrong spelling should return the correct entry (I have seen the spellfix extension for sqlite for that, but I don't know how well it works)
more importantly if a user enters a few query words on the website I try to MATCH those with a sql query. If a user enters too many words, it will return 0 matches:
for example: if a user inputs "sars covid 19" into the search-bar:
CREATE VIRTUAL TABLE TEST USING FTS5(SomeText);
INSERT INTO TEST(SomeText) VALUES
('Covid 19');
SELECT SomeText
FROM TEST
WHERE SomeText MATCH 'sars covid 19';
=> 0 matches, but I would want it to return the 'covid 19' entry.
I realise that sqlite might be too simplistic for a database that needs to handle searches well. Now my question is: Do Postgres or MongoDB have search engines that include the functionality that I need? Or is it worth diving into solutions with Elastic Search etc.?
Most articles I found on this are 5-10 years old, so I am wondering what the current state of affairs is regarding search engines in databases. Any hints are greatly appreciated
Combination es + mongodb work well, you index and perform full text search in es and you keep the original documents with some key fields indexed in mongodb...
Elasticsearch will work for sure. You only have to think about how you will index your document, and you will be able to find them the way you index them, in your context it seems that the default text will work with a match query :
https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html
MongoDb will work too in this simple case : https://docs.mongodb.com/manual/text-search/ , but mongo wont work with tokenizer so if you need to upgrade your text search mongo will be limited.
Postgresql could do it, using the like but I am not familiar with enough, if you have 10k entries, it will be ok for sure, if you expect 1 millions, mongo or es would be better.
If you have to choose between mongodb and es, you have to be more specific in your question, for full text, es is really nice, having a lot of features, mongodb give some nice database tools too. Sometimes es will be better, sometimes mongo, depends of what you need. If you only want full text, es is a must.

How to efficiently query MongoDB for documents when I know that 95% are not used

I have a collection of ~500M documents.
Every time when I execute a query, I receive one or more documents from this collection. Let's say I have a counter for each document, and I increase this counter by 1 whenever this document is returned from the query. After a few months of running the system in production, I discover that the counter of only 5% of the documents is greater than 0 (zero). Meaning, 95% of the documents are not used.
My question is: Is there an efficient way to arrange these documents to speedup the query execution time, based on the fact that 95% of the documents are not used?
What is the best practice in this case?
If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?
~500M documents That is quite a solid figure, good job if that's true. So here is how I see the solution of the problem:
If you want to re-write/re-factor and rebuild the DB of an app. You could use versioning pattern.
How does it looks like?
Imagine you have a two collections (or even two databases, if you are using micro service architecture)
Relevant docs / Irrelevant docs.
Basically you could use find only on relevant docs collection (which store 5% of your useful docs) and if there is nothing, then use Irrelevant.find(). This pattern will allows you to store old/historical data. And manage it via TTL index or capped collection.
You could also add some Redis magic to it. (Which uses precisely the same logic), take a look:
This article can also be helpful (as many others, like this SO question)
But don't try to replace Mongo with Redis, team them up instead.
Using Indexes and .explain()
If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?
Yes, it will deal with your problem. To take a look, download MongoDB Compass, create this boolean field in your schema, (don't forget to add default value), index the field and then use Explain module with some query. But don't forget about compound indexes! If you create field on one index, measure the performance by queering only this one field.
The result should been looks like this:
If your index have usage (and actually speed-up) Compass will shows you it.
To measure the performance of the queries (with and without indexing), use Explain tab.
Actually, all this part can be done without Compass itself, via .explain and .index queries. But Compass got better visuals of this process, so it's better to use it. Especially since he becomes absolutely free for all.

MongoDB, sort() and pagination

I known there is already some patterns on pagination with mongo (skip() on few documents, ranged queries on many), but in my situation i need live sorting.
update:
For clarity i'll change point of question. Can i make query like this:
db.collection.find().sort({key: 1}).limit(n).sort({key: -1}).limit(1)
The main point, is to sort query in "usual" order, limit the returned set of data and then reverse this with sorting to get the last index of paginated data. I tried this approach, but it seems that mongo somehow optimise query and ignores first sort() operator.
I am having a huge problem attempting to grasp your question.
From what I can tell when a user refreshes the page, say 6 hours later, it should show not only the results that were there but also the results that are there now.
As #JohnnyHK says MongoDB does "live" sorting naturally whereby this would be the case and MongoDB, for your queries would give you back the right results.
Now I think one problem you might trying to get at here (the question needs clarification, massively) is that due to the data change the last _id you saw might no longer truely represent the page numbers etc or even the diversity of the information, i.e. the last _id you saw is now in fact half way through page 13.
These sorts of things you would probably spend more time and performance trying to solve than just letting the user understand that they have been AFAK for a long time.
Edit
Aha, I think I see what your trying to do now, your trying to be sneaky by getting both the page and the last item in the list at the same time. Unfortunately just like SQL this is not possible. Even if sort worked like that the sort would not function like it should since you can only sort one way on a single field.
However for future reference the sort() function is exactly that on a cursor and until you actually open the cursor by starting to iterate it calling sort() multiple times will just overwrite the cursor property.
I am afraid that this has to be done with two queries, so you get your page first and then client side (I think your looking for the max of that page) scroll through the records to find the last _id or just do a second query to get the last _id. It should be super dupa fast.

How to store query output in temp db?

I am really new to the programming but I am studying it. I have one problem which I don't know how to solve.
I have collection of docs in mongoDB and I'm using Elasticsearch to query the fields. The problem is I want to store the output of search back in mongoDB but in different DB. I know that I have to create temporary DB which has to be updated with every search result. But how to do this? Or give me documentation to read so I could learn it. I will really appreciate your help!
Mongo does not natively support "temp" collections.
A typical thing to do here is to not actually write the entire results output to another DB since that would be utterly pointless since Elasticsearch does its own caching as such you don't need any layer over the top.
As well, due to IO concerns it is normally a bad idea to write say a result set of 10k records to Mongo or another DB.
There is a feature request for what you talk of: https://jira.mongodb.org/browse/SERVER-3215 but no planning as of yet.
Example
You could have a table of results.
Within this table you would have a doc that looks like:
{keywords: ['bok', 'mongodb']}
Each time you search and scroll through each result item you would write a row to this table populating the keywords field with keywords from that search result. This would be per search result per search result list per search. It would probably be best to just stream each search result to MongoDB as they come in. I have never programmed Python (though I wish to learn) so an example in pseudo:
var elastic_results = [{'elasticresult'}];
foreach(elastic_results as result){
//split down the phrases in this result and make a keywords array
db.results_collection.insert(array_formed_from_splitting_down_result); // Lets just lazy insert no need for batch or trying to shrink the amount of data to one go or whatever, lets just stream it in.
}
So as you go along your results you basically just mass insert as fast a possible create a sort of "stream" of input to MongoDB. It can do this quite well.
This should then give you a shardable list of words and language verbs to process things like MRs on and stuff to aggregate statistics about them.
Without knowing more and more about your scenario this is pretty much my best answer.
This does not use the temp table concept but instead makes your data permanent which is fine by the sounds of it since you wish to use Mongo as a storage engine for further tasks.
Actually there is MongoDB river plugin to work with Elasticsearch...
db.your_table.find().forEach(function(doc) { b.another_table.insert(doc); } );