This RediSearch page refers to 5 scoring models listed below.
We are using MongoDB as our primary store, but using RediSearch for faster cached queries. We would like the same result for each.
Does one of the scoring models listed below for RedisSearch match the one in MongoDB? Do they both use Lucene under the covers?
Scoring model ΒΆ RediSearch comes with a few very basic scoring
functions to evaluate document relevance. They are all based on
document scores and term frequency. This is regardless of the ability
to use sortable fields (see below). Scoring functions are specified by
adding the SCORER {scorer_name} argument to a search request.
If you prefer a custom scoring function, it is possible to add more
functions using the Extension API .
These are the pre-bundled scoring functions available in RediSearch:
TFIDF (Default)
Basic TF-IDF scoring with document score and proximity boosting
factored in.
TFIDF.DOCNORM Identical to the default TFIDF scorer, with one
important distinction:
BM25
A variation on the basic TF-IDF scorer, see this Wikipedia article for
more info .
DISMAX
A simple scorer that sums up the frequencies of the matched terms; in
the case of union clauses, it will give the maximum value of those
matches.
DOCSCORE
A scoring function that just returns the priory score of the document
without applying any calculations to it. Since document scores can be
updated, this can be useful if you'd like to use an external score and
nothing further.
Related
This question is about choosing the type of database to run queries on for an application. Keeping other factors aside for the moment, and given that the choice is between mongodb and elastic, the key criterion is that the query should be resolved in near real time. The queries will be ad-hoc and as such can contain any of the fields in the JSON objects and will likely contain aggregations and subaggregations. Furthermore, there will not be nested objects and none of the fields will be containing 'descriptive' text (like movie reviews etc.), i.e., all the fields will be keyword type fields like State, Country, City, Name etc.
Now, I have read that elasticsearch performance is near real time and that elasticsearch uses inverted indices and creates them automatically for every field.
Given all the above, my questions are as follows.
(there is a similar question posted in stack but I do not think it answers my questions
elasticsearch v.s. MongoDB for filtering application)
1) Since the fields in the use case I mentioned do not contain descriptive text and hence would not require the full-text search capability and other additional features that elastic provides (especially for text search), what would be a better choice between elastic and mongo? How would elastic search and mongo query/aggregation performance compare if I were to create single field indices on all the available fields in mongo?
2) I am not familiar with advanced indexing, so I am assuming that it would be possible to create indices on all available fields in mongo (either using multiple single field indices or maybe compound indices?). I understand that this will come with a cost for storage and write speed which is true for elastic as well.
3) Also, in elastic the user can trade off write speed (indexing rate) with the speed with which the written document becomes available (refresh_interval) for a query. Is there a similar feature in mongo?
I think the size of your data set is also a very important aspect about choosing DB engine. According to this benckmark (2015), if you have over 10 millions of documents, Elasticsearch could be a better choice. If your data set is small there should be no obvious different about performance between Elasticsearch and MongoDB.
Can someone explain the exact difference of MLT and normal select query in Solr ? I know that Solr uses an advanced form of TF.IDF to score documents based on a select query for a textual field, but how does the scoring algorithm differ when MLT is being used ?
I'm not sure if the question really makes sense - More Like This is used to find more documents similar to one you already have. This is different from entering a query and wanting to get something back, they're used to solve very different modes of operation.
Behind the scenes they're both queries in the meaning of "looks up something in the index based on input", which for MLT is the terms from the existing document, instead of the query the user has entered.
You can see how the MLT query is built in MoreLikeThis.java. If I read the code correctly, a PriorityQueue is used to fetch the scores for all the terms, which are then added as boost queries to a large set of terms in a boolean query, where each term is set to SHOULD occur. That way the terms are boosted based on MLT semantics, while it uses the ClassicSimilarity behind the scenes.
But again, the use case for MLT is very different from when you'd use a regular query.
I would like to upon user request graph median values of many documents. I'd prefer not to transfer entire documents from the database to my application solely for purposes of determining median values.
I understand that development is still planned for a median aggregator in MongoDB, however I see that currently the following operations are supported:
sort
count
limit
Short of editing mongo source code, Is there any reasonable way I can combine these operations to obtain median values; for example, to sort values, count them, and limit to return median values?
It appears that editing Mongo source code is the only solution.
I am building a MongoDB database that will work with an Android app. I have a user collection and a records collection. The records documents consist of GPS tracks such as start and end coordinates, total time and top speed and distance. The user document is has user id, first name, last name and so forth.
I want to have aggregate stats for each user that summarizes total distance, total time, total average speed and top speed to date.
I am confused if I should do a map reduce and create an aggregate collection for users, or if I should add these stats to the user document with some kind of cron job type soliuton. I have read many guides about map reduce and aggregation for MongoDB but can't figure this out.
Thanks!
It sounds like your aggregate indicator values are per-user, in which case I would simply calculate them and push them directly into the user object as the same time as you update current co-oordinates, speed etc. They would be nice and easy (and fast) to query, and you could aggregate them further if you wished.
When I say pre-calculate, I don't mean MapReduce, which you would use as a batch process, I simply mean calculate on update of the user object.
If your aggregate stats are compiled across users, then you could still pre-calculate them on update, but if you also need to be able to query those aggregate stats against some other condition or filter, such as, "tell me what the total distance travelled for all users within x region", then depending on the number of combinations you may not be able to cover all those with pre-calculation.
So, if your aggregate stats ARE across users, AND need some sort of filter applying, then they'll need to be calculated from some snapshot of data. The two approaches here are;
the aggregation framework in 2.2
MapReduce
You would need to use MapReduce say, if you've a LOT of historical data that you want to crunch and you can pre-calculate the results for fast reading later. By my definition, that data isn't changing frequently, but even if it did, you can also use incremental MR to add new results to an existing calculation.
The aggregation framework in 2.2 will allow you to do a lot of this on demand, but it won't be as quick of course as pre-calculated values but way quicker than MR when executed on-demand. It can't cope with the high volume result-sets that you can do with MR, but it's better suited to queries where you don't know the parameter values in advance.
By way of example, if you wanted to calculate the aggregate sums of users stats within a particular lat/long, you couldn't use MR because there are just too many combinations of that filter, so you'd need to do that on the fly.
If however, you wanted it by city, well you could conceivably use MR there because you could stick to a finite set of cities and just pre-calculate them all.
But to wrap up, if your aggregate indicator values are per-user alone, then I'd start by calculating and storing the values inside the user object when I update the user object as I said in the first paragraph. Yes, you're storing the value as well as the inputs, but that's the model that saves you having to calculate on the fly.
I am looking for a way to update every document in a collection called "posts".
Posts get updated periodically with a popularity (sitewide popularity) and a strength (the estimated relevance to that particular user), each from different sources. What I need to do is multiply popularity and strength on each post to get a third field, relevance. Relevance is used for sorting the posts.
class Post
include Mongoid::Document
field :popularity
field :strength
field :relevance
...
The current implementation is as follows:
1) I map/reduce down to a separate collection, which stores the post id and calculated relevance.
2) I update every post individually from the map reduce results.
This is a huge amount of individual update queries, and it seems silly to map each post to its own result (1-to-1), only to update the post again. Is it possible to multiply in place, or do some sort of in-place map?
Is it possible to multiply in place, or do some sort of in-place map?
Nope.
The ideal here would be to have the Map/Reduce update the Post directly when it is complete. Unfortunately, M/R does not have that capability. In theory, you could issue updates from the "finalize" stage, but this will collapse in a sharded environment.
However, if all you are doing is a simple multiplication, then you don't really need M/R at all. You can just run a big for loop, or you can hook up the save event to update :relevance when :popularity or :strength are updated.
MongoDB doesn't have triggers, so it can't do this automatically. But you're using a business layer which is the exact place to put this kind of logic.