Are Mongo DB text search scores comparable across queries? - mongodb

I know I can sort the results of a query based on the text score that each result has been assigned using MongoDB text search. But, given two different queries A and B that retrieve different documents D1 and D2, if score(A, D1) > score(B, D2) does it mean that D1 is more related to query A than D2 is to query B?
In other words, are the scores relative to the query or also valid absolutely?

given two different queries A and B that retrieve different documents D1 and D2, if score(A, D1) > score(B, D2) does it mean that D1 is more related to query A than D2 is to query B?
Assuming both queries are against equivalent text search indexes, the same scoring algorithm is used so this seems a correct inference to make.
Factors that can influence the scoring for queries would include text index options like:
field weights
language
text index version (eg: MongoDB 3.2 has text search enhancements associated with version 3 text indexes).

Related

What does the distinct on clause mean in cloud datastore and how does it effect the reads?

This is what the cloud datastore doc says but I'm having a hard time understanding what exactly this means:
A projection query that does not use the distinct on clause is a small operation and counts as only a single entity read for the query itself.
Grouping
Projection queries can use the distinct on clause to ensure that only the first result for each distinct combination of values for the specified properties will be returned. This will return only the first result for entities which have the same values for the properties that are being projected.
Let's say i have a table for questions and i only want to get the question text sorted by the created date would this be counted as a single read and rest as small operations?
If your goal is to just project the date and text fields, you can create a composite index on those two fields. When you query, this is a small operation with all the results as a single read. You are not trying to de-duplicate (so no distinct/on) in this case and so it is a small operation with a single read.

Query one document per association from MongoDB

I'm investigating how MongoDB would work for us. One of the most used queries is used to get latest (or from a given time) measurements for each station. There is thousands of stations and each station has tens of thousands of measurements.
So we plan to have one collection for stations and another for measurements.
In SQL we would do the query with
SELECT * FROM measurements
INNER JOIN (
SELECT max(meas_time) station_id
FROM measurements
WHERE meas_time <= 'time_to_query'
GROUP BY station_id
) t2 ON t2.station_id = measurements.station_id
AND t2.meas_time = measurements.meas_time
This returns one measurement for each station, and the measurement is the newest one before the 'time_to_query'.
What query should be used in MongoDB to produce the same result? We are really using Rails and MongoId, but it should not matter.
update:
This question is not about how to perform a JOIN in MongoDB. The fact that in SQL getting the right data out of the table requires a join doesn't necessary mean that in MongoDB we would also need a join. There is only one table used in the query.
We came up with this query
db.measurements.aggregate([{$group:{ _id:{'station_id':"$station_id"}, time:{$max:'$meas_time'}}}]);
with indexes
db.measurements.createIndex({ station_id: 1, meas_time: -1 });
Even though it seems to give the right data it is really slow. Takes roughly a minute to get a bit over 3000 documents from a collection of 65 million.
Just found that MongoDB is not using the index in this query even though we are using the 3.2 version.
I guess worst case solution would be something like this (out of my head):
meassures = []
StationId.all.each do |station|
meassurement = Meassurment.where(station_id: station.id, meas_time <= 'time_to_query').order_by(meas_time: -1).limit(1)
meassures << [station.name, meassurement.measure, ....]
end
It depends on how much time query can take. Data should anyway be indexed by station_id and meas_time.
How much time does the SQL query take?

With MongoDB, what are the fields that need to be picked for building indexes on?

If there is a query that filters on field a and b and then order on c, do I need to build separate indexes for a, b, and c, or should I actually build a compound index of (a, b, c)? And also, does the should the sequence in the query match the sequence in the index? that is if the filter sequence in the query is filter b, filter c, and then order on a, then should it better to have compound index of (b, c, a)?
Since MongoDB currently uses only one index per query, you will need a compound index.
The order of the index params does matter, although not necessarily in the way you mention in the question.
Since filtering happens first, if the index was (c,b,a), it wouldn't be very useful for filtering, especially if there are a lot of items in the collection. The fields used for sorting should be specified last in the index.
So the index should either be (a,b,c) or (b,a,c). Which one of those it should be depends on selectivity -- in other words, which field will eliminate items that don't match faster?
If there are 10,000 likely values for b, and only two likely values for a, then the index should be (b,a,c). Conversely, if there are many more possible values for a, then it should probably be (a,b,c). If the two fields are roughly the same in their ability to eliminate documents from the query, then it won't matter that much.
All of this and more is answered in the docs. See the sections on compound indexes.
The order of the query params doesn't matter.

How to best filter a MongoDB collection using a predicate

We would like to filter a MongoDB collection using an "overspecified" find() query. For example: collection A, the collection we want to filter, has documents that contain a set of requirements for attributes. Examples are the document a, which contains the requirement {req: age:{min:20,max:30}} and b which contains the requirement {req: gender:male}.
We also have a document, d, from collection D that contains the following attributes: d = {age:21, gender: male}.
In this case, both a and b should be in the set of documents that d is eligible for, as d fulfills the requirements for both.
However, if we include all of d's attributes in a find query, we get db.A.find({d.age > req.age.min, d.age < req.age.max, d.gender: req.gender}), which would exclude both a and b from our result.
What is the best way to select all the documents in A that d fulfills the requirements for, given that d may contain more attributes than the requirements of a document in A specify, and that the requirements in A and attributes in D are not fixed? We would like to avoid specifying every possible attribute in D in all A.req documents as we would like our requirements to be as flexible as possible.
There are no straightforward ways to do this. The only route you can take is performing an existence check on each requirement which doesn't result in the most elegant queries imaginable. Using your query format :
db.A.find({$and:[{req.age.min:{$exists:true}}, {d.age > req.age.min}], ....)
In other words. You modify your query so it follows "if D's attribute has a requirement in A check if it meets that requirement". Frankly I think looking at a more appropriate schema might be a more elegant route though.

Sphinx Search: excluding index B results from index A results

Here's my issue:
I have 2 indexes:
A - product titles only
B - product titles and product descriptions
By default I search index A to categorize products (e.g. most bikes have "bike" in title).
Sometimes there instances where to determine category (which might be a sub-category of something) we need to look at description, mostly to exclude irrelevant results. In order for pagination on search result page to work, I need to get this clean result as one array after running RunQueries().
But it does not work. It basically adds results of both queries, and looks like there's no way to subtract results. Anyone has any ideas?
Tell me if I'm completely missing something but it sounds to me like your trying to include results with product titles that match a certain query and exclude results with a description that matches another query?
If this is the case it seems to me that having 2 indexes is useless, and you can have one index with both product titles and descriptions and then run a full text search query as such:
#title queryA #description -queryB
You can use the same query to search for matches that have a title of queryA AND a description of queryB by simply removing the - symbol.
If this is off base the only other way I could think of doing it is using SphinxQL (I'm not well versed in any of the libraries since support for all the libraries which don't use SphinxQL is being phased out in the future as far as I've read)
Using SphinxQL you could run 2 queries, one which is like
SELECT id FROM indexB WHERE MATCH('#description queryB')
And then run a second query using a the list of ids you got from the first query as such
SELECT id FROM indexA WHERE id NOT IN(id1,id2,id3,...)