I am trying to use Geofirestore to query results in a given radius from a center location. As far as I know, GeoFire can't limit the number of results queried. My solution to this problem is to increment my radius in steps until I get X amount of results. It seems by doing this, I will re-query results multiple times which could be costly. If I query 10 results but want 20. Would Firestore consider this as 30+ reads?
The geoquery implementations I know of on top of Firestore all use geohashes for their implementation, which means they combine latitude and longitude into a single value, that can then be used to filter all documents within a certain geographical range. While it may technically be possible to limit the number of documents returned by adding a limit(...) to the query, the results will not be ordered by their distance to the center of the query. So you're likely to simply get results in one area of the queries range, instead of the closest documents.
Firestore charges you for all documents read from the server. Since it has a local cache, this may not be all the documents for your subsequent query, as the documents from the previous query may already be in the local cache.
That said: be aware that using geohashes for geoqueries leads to significant over-reading of documents. In my experimentation this can lead to reading between 3x and 10x more documents than are within the queried range. I'd highly recommend reading/watching up on geohashes and geoqueries on Firestore, for example by watching this talk I gave a while ago about the topic: https://www.youtube.com/watch?v=mx1mMdHBi5Q.
Related
Why does firestore not work with multiple field range filters?
I also want to know why I can not do full text search.
Is there a reason for the algorithm?
I want to know the reason.
Firestore has a quite unique performance guarantee: the time it takes to retrieve data only depends on the amount of data you retrieve, not on the amount of data you retrieve it from. So no matter if there are a thousand, a million, or a billion documents in a collection, retrieving ten of those documents will always take the same amount of time.
In order to be able to guarantee this performance, Firestore has a limited set of (mostly query) capabilities. For example: Firestore only supports queries for which it can jump to the correct starting point in an index, and stream results from that starting point until the end of the query. This precludes it from supporting things like:
OR queries, since those would require skipping through the index, which would make it impossible to guarantee the performance. Note that work is under way to support a subset of possible IN queries, so that you could query for something like "give me all restaurants that serve either Thai or Italian food".
queries for substring of text. Firestore only supports so-called prefix queries, so where the value starts with a substring, but not where the value contains a substring. So you could search for "give me all names that start with 'zla'", but not for "give me all names that contain 'lat'".
querying on multiple ranges, since those too might make it impossible to guarantee the performance.
For another good explanation of this, see the Getting to know Cloud Firestore episode on queries.
Status:
As far as I understand from reading the mongodb documentation and also playing with indexes, queries and monitoring results, my understanding of the way geo location queries in mongodb work is the following:
Start at the given location
look at EVERY document from close to far
keep those matching additional query criteria
until either number_limit or distance_limit is reached
Example
To show what we are trying to do: Let's take the mongodb tutorial example as a base: https://docs.mongodb.com/manual/tutorial/geospatial-tutorial/
Let's assume we have a list of restaurants with location and much more information on top, like established_at, type (chinese, thai, italian, ...), priceOfACoke, numberOfWaiters, wheelchairAccess, ...
Problem:
Let's assume you want to query the collection of all restaurants in the US of A to return all Italian restaurants close to the city center of Pittsburgh that have been established between 2-5 years ago, with wheelChair access and more than 50 waiters, where a coke is cheaper than 1$.
This is a geo query with with restrictive additional criteria and no distance limit; and since the "waiters>50 an coke cheaper than $1" filters out most/all of the results, this query seems to be running through the whole collection and takes very long.
If run without "geoNear", assuming there is a combined index of the fields in question, this query is quite fast, even if there are only 10 results out of 1 million documents.
However, as soon as geoNear comes into play, the performance is terrible
From what I understand, there can only be ONE geo index per collection and only ONE additional property in the geo index, so there is not much to do to help mongodb finding the results with several criteria, as a traditional index seems not to be used.
Also, when using aggregate, geo has to be the first filter...
Are there any hints or pointers to speed up queries like this?
If possible, I'd prefer not get "Use ElasticSearch" or "Use multiple collections " responses - I still hope there is another way to help mongodb reduce the number of documents to check BEFORE it starts doing the geoNear part.
For my thesis I'm currently investigating the speed (down to milliseconds) of Elasticsearch and MongoDB.
I've noticed that, compared to MongoDB, Elasticsearch is very consistent when it comes to the speed at which it returns data and the total items found. Where other MongoDB takes a longer time to return data the more results are found, Elasticsearch's response time is almost always the same, regardless of the total amount of requests sent.
My hypothesis is that in Elasticsearch, when using the size operator, the number of documents that are actually looked up and retrieved after the search in the indexes is finished is exactly the amount set in the size operator. Where in MongoDB this is not the case, in MongoDB all documents that matched in the index are retrieved, and only the top X amount is eventually returned to the client based on the cursor's batch_size and eventually the max limit() that is set.
I have no way, other than to spend hours looking through the source code, to figure out if this hypothesis is correct, or if something else is going on that I must have missed.
Thanks for taking the time to read this, any responses are appreciated and will help me further my research.
To make it a bit clearer how Elasticsearch actually retrieves results: It uses query then fetch.
So if you search for N results, the first phase will query all the shards involved and return a list of their N results containing the score and the ID — not other information. In the second phase you fetch the top N global results by their ID. So you will retrieve more scores and IDs than you need, but you will only fetch the actual results.
I am developing an Android app that uses MongoDB to store user records in document format. I will have several records that contain information about a GPS track such as start longitude and latitude, finish longitude and latitude, total time, top speed and total distance.
My question is regarding average speed. Should I let my app compute the average speed and store that as a field in the document, or should I compute this by only getting time and distance?
I will have thousands of records that should be sorted based on average speed and the most reasonable seems to store the average speed in the document as well. However, that breaks away from traditional SQL Acid thinking where speed would be calculated outside the DB.
The current document structure for the record collection is like this:
DocumentID (record)
DocumentID (user)
Start lnlt
Finish lnlt
Start time/date/gmt
End time/date/gmt
Total distance
Total time
Top speed
KMZ File
You should not talk abount ACID properties once you made a choice to use document oriented DB such as Mongo. Now, you have answered the question yourself:
" the most reasonable seems to store the average speed in the document as well."
We programmers have the tendency of ignoring the reasonable or simple approaches. We always try to question our selves whenever the solution we find looks obvious or common sense ;-).
Anyways, my suggestion is to store it as you want the sort to be performed by DB and not the application. This means that if any of the variables that influence the average speed change after initial storage then you should remember to update the result field as well.
My question is regarding average speed. Should I let my app compute the average speed and store that as a field in the document, or should I compute this by only getting time and distance?
As #Panegea rightly said, MongoDB does not rely on ACID properties. It relies on your app being able to control the distributed nature of it self, however that being said calculating the average speed outside of the DB isn't all that bad and using an atomic operator like $set will stop oddities when not using full ACID queries.
What you and #Panegea talk about is a form of pre-aggregation of your needed value to a pre-defined field on the document. This is by far a recommended approach not only in MongoDB but also in SQL (like the total shares on a facebook wall post) where querying for the aggregation of a computed field would be tiresome and very difficult for the server, or just not wise.
Edit
You clould achieve this with the aggregation framework: http://docs.mongodb.org/manual/applications/aggregation/ as well, might wanna take a look there, but pre-aggregation is by far the fastest method.
I am building a MongoDB database that will work with an Android app. I have a user collection and a records collection. The records documents consist of GPS tracks such as start and end coordinates, total time and top speed and distance. The user document is has user id, first name, last name and so forth.
I want to have aggregate stats for each user that summarizes total distance, total time, total average speed and top speed to date.
I am confused if I should do a map reduce and create an aggregate collection for users, or if I should add these stats to the user document with some kind of cron job type soliuton. I have read many guides about map reduce and aggregation for MongoDB but can't figure this out.
Thanks!
It sounds like your aggregate indicator values are per-user, in which case I would simply calculate them and push them directly into the user object as the same time as you update current co-oordinates, speed etc. They would be nice and easy (and fast) to query, and you could aggregate them further if you wished.
When I say pre-calculate, I don't mean MapReduce, which you would use as a batch process, I simply mean calculate on update of the user object.
If your aggregate stats are compiled across users, then you could still pre-calculate them on update, but if you also need to be able to query those aggregate stats against some other condition or filter, such as, "tell me what the total distance travelled for all users within x region", then depending on the number of combinations you may not be able to cover all those with pre-calculation.
So, if your aggregate stats ARE across users, AND need some sort of filter applying, then they'll need to be calculated from some snapshot of data. The two approaches here are;
the aggregation framework in 2.2
MapReduce
You would need to use MapReduce say, if you've a LOT of historical data that you want to crunch and you can pre-calculate the results for fast reading later. By my definition, that data isn't changing frequently, but even if it did, you can also use incremental MR to add new results to an existing calculation.
The aggregation framework in 2.2 will allow you to do a lot of this on demand, but it won't be as quick of course as pre-calculated values but way quicker than MR when executed on-demand. It can't cope with the high volume result-sets that you can do with MR, but it's better suited to queries where you don't know the parameter values in advance.
By way of example, if you wanted to calculate the aggregate sums of users stats within a particular lat/long, you couldn't use MR because there are just too many combinations of that filter, so you'd need to do that on the fly.
If however, you wanted it by city, well you could conceivably use MR there because you could stick to a finite set of cities and just pre-calculate them all.
But to wrap up, if your aggregate indicator values are per-user alone, then I'd start by calculating and storing the values inside the user object when I update the user object as I said in the first paragraph. Yes, you're storing the value as well as the inputs, but that's the model that saves you having to calculate on the fly.