Elasticsearch 'size:' vs MongoDB batch_size - mongodb

For my thesis I'm currently investigating the speed (down to milliseconds) of Elasticsearch and MongoDB.
I've noticed that, compared to MongoDB, Elasticsearch is very consistent when it comes to the speed at which it returns data and the total items found. Where other MongoDB takes a longer time to return data the more results are found, Elasticsearch's response time is almost always the same, regardless of the total amount of requests sent.
My hypothesis is that in Elasticsearch, when using the size operator, the number of documents that are actually looked up and retrieved after the search in the indexes is finished is exactly the amount set in the size operator. Where in MongoDB this is not the case, in MongoDB all documents that matched in the index are retrieved, and only the top X amount is eventually returned to the client based on the cursor's batch_size and eventually the max limit() that is set.
I have no way, other than to spend hours looking through the source code, to figure out if this hypothesis is correct, or if something else is going on that I must have missed.
Thanks for taking the time to read this, any responses are appreciated and will help me further my research.

To make it a bit clearer how Elasticsearch actually retrieves results: It uses query then fetch.
So if you search for N results, the first phase will query all the shards involved and return a list of their N results containing the score and the ID — not other information. In the second phase you fetch the top N global results by their ID. So you will retrieve more scores and IDs than you need, but you will only fetch the actual results.

Related

Does Firestore count "Re-Reads" as document reads?

I am trying to use Geofirestore to query results in a given radius from a center location. As far as I know, GeoFire can't limit the number of results queried. My solution to this problem is to increment my radius in steps until I get X amount of results. It seems by doing this, I will re-query results multiple times which could be costly. If I query 10 results but want 20. Would Firestore consider this as 30+ reads?
The geoquery implementations I know of on top of Firestore all use geohashes for their implementation, which means they combine latitude and longitude into a single value, that can then be used to filter all documents within a certain geographical range. While it may technically be possible to limit the number of documents returned by adding a limit(...) to the query, the results will not be ordered by their distance to the center of the query. So you're likely to simply get results in one area of the queries range, instead of the closest documents.
Firestore charges you for all documents read from the server. Since it has a local cache, this may not be all the documents for your subsequent query, as the documents from the previous query may already be in the local cache.
That said: be aware that using geohashes for geoqueries leads to significant over-reading of documents. In my experimentation this can lead to reading between 3x and 10x more documents than are within the queried range. I'd highly recommend reading/watching up on geohashes and geoqueries on Firestore, for example by watching this talk I gave a while ago about the topic: https://www.youtube.com/watch?v=mx1mMdHBi5Q.

Timeseries storage in Mongodb

I have about 1000 sensors outputting data during the day. Each sensor outputs about 100,000 points per day. When I query the data I am only interested in getting data from a given sensor on a given day. I don t do any cross sensor queries. The timeseries are unevenly spaced and I need to keep the time resolution so I cannot do things like arrays of 1 point per second.
I plan to store data over many years. I wonder which scheme is the best:
each day/sensor pair corresponds to one collection, thus adding 1000 collections of about 100,000 documents each per day to my db
each sensor corresponds to a collection. I have a fixed number of 1000 collections that grow every day by about 100,000 documents each.
1 seems to intuitively be faster for querying. I am using mongoDb 3.4 which has no limit for the number of collections in a db.
2 seems cleaner but I am afraid the collections will become huge and that querying will gradually become slower as each collection grows
I am favoring 1 but I might be wrong. Any advice?
Update:
I followed the advice of
https://bluxte.net/musings/2015/01/21/efficient-storage-non-periodic-time-series-mongodb/
Instead of storing one document per measurement, I have a document containing 128 measurement,startDate,nextDate. It reduces the number of documents and thus the index size but I am still not sure how to organize the collections.
When I query data, I just want the data for a (date,sensor) pair, that is why I thought 1 might speed up the reads. I currently have about 20,000 collections in my DB and when I query the list of all collections, it takes ages which makes me think that it is not a good idea to have so many collections.
What do you think?
I would definitely recommend approach 2, for a number of reasons:
MongoDB's sharding is designed to cope with individual collections getting larger and larger, and copes well with splitting data within a collection across separate servers as required. It does not have the same ability to split data which exists in many collection across different servers.
MongoDB is designed to be able to efficiently query very large collections, even when the data is split across multiple servers, as long as you can pick a suitable shard key which matches your most common read queries. In your case, that would be sensor + date.
With approach 1, your application needs to do the fiddly job of knowing which collection to query, and (possibly) where that collection is to be found. Approach 2, with well-configured sharding, means that the mongos process does that hard work for you
Whilst MongoDB has no limit on collections I tried a similar approach to 2 but moved away from it to a single collection for all sensor values because it was more manageable.
Your planned data collection is significant. Have you considered ways to reduce the volume? In my system I compress same-value runs and only store changes, I can also reduce the volume by skipping co-linear midpoints and interpolating later when, say, I want to know what the value was at time 't'. Various different sensors may need different compression algorithms (e.g. a stepped sensor like a thermostat set-point vs one that represents a continuous quantity like a temperature). Having a single large collection also makes it easy to discard data when it does get too large.
If you can guarantee unique timestamps you may also be able to use the timestamp as the _id field.
When I query the data I m only interested in getting data from a
given sensor on a given day. I don t do any cross sensor queries.
But that's what exactly what Cassandra is good for!
See this article and this one.
Really, in one of our my projects we were stuck with legacy MongoDB and the scenario, similar to yours, with the except of new data amount per day was even lower.
We tried to change data structure, granulate data over multiple MongoDB collections, changed replica set configurations, etc.
But we were still disappointed as data increases, but performance degrades
with the unpredictable load and reading data request affects writing response much.
With Cassandra we had fast writes and data retrieving performance effect was visible with the naked eye. If you need complex data analysis and aggregation, you could always use Spark (Map-reduce) job.
Moreover, thinking about future, Cassandra provides straightforward scalability.
I believe that keeping something for legacy is good as long as it suits well, but if not, it's more effective to change the technology stack.
If I understand right, you plan to create collections on the fly, i.e. at 12 AM you will have new collections. I guess MongoDB is a wrong choice for this. If required in MongoDB there is no way you can query documents across collections, you will have to write complex mechanism to retrieve data. In my opinion, you should consider elasticsearch. Where you can create indices(Collections) like sensor-data-s1-3-14-2017. Here you could do a wildcard search across indices. (for eg: sensor-data-s1* or sensor-data-*). See here for wildcard search.
If you want to go with MongoDB my suggestion is to go with option 2 and shard the collections. While sharding, consider your query pattern so you could get optimal performance and that does not degrade over the period.
Approach #1 is not cool, key to speed up is divide (shard) and rule. What-if number of singal itself reaches 100000.
So place one signal in one collection and shard signals over nodes to speed up read. Multiple collections or signals can be on same node.
How this Will Assist
Usually for signal processing time-span is used like process signal for 3 days, in that case you can parallel read 3 nodes for the signal and do parallel apache spark processing.
Cross-Signal processing: typically most of signal processing algorithms uses same period for 2 or more signals for analysis like cross correlation and as these (2 or more signals) are parallel fetch it'll also be fast and ore-processing of individual signal can be parallelized.

How to paginate query results in NoSQL databases when there are no unique fields included in the projection?

I've heard using MongoDB's skip() to batch query results is a bad idea because it can lead to the server becoming IO bound as it has to 'walk through' all the results. I want to only return a maximum of 200 documents at a time, and then the user will be able to fetch the next 200 if they want (assuming they haven't limited it to less).
Initially I read up on paginating results and most things I read said the easiest way in MongoDB at least is to modify the query criteria to simulate skipping through.
For example if a field called account number on the last document is 28022004, then the next query should have "accNumber > 28022004" in the criteria. But what if there are no unique fields included in the projection? What if the user wants to sort the records by a non-unique field?

Paginating results in MongoDB without relying on .skip()

I'm building an app that calls data from MongoDB. For purposes of this question, pretend that the user searches my app for a certain query, and MongoDB has 4,000 results to spit out that match the query.
After reading around a bit, I see that it's possible to paginate using the .skip() method, but MongoDB themselves suggest against using this as it requires the curser to iterate through all the records up until the one you're skipping to, which gets more and more expensive the higher in the list you go.
I've seen a few tutorials that rely on the _id property of the results to be sequential, but this doesn't apply here - my database has tens of thousands of records, and each has a unique id, and the 4000 results that apply to the user's query are definitely not going to be sequential.
Can anyone think of a way to do this, or is skip() the only option here?
Other considerations:
The pagination will work based on the position on the page. For instance, the first query should spit out 20 records to my app. When the user scrolls to the bottom of the page, I could potentially get the _id of the 20th element on the page and pass that to my query, find it in the list of 4,000 results, find the subsequent result and start the next set of 20 from there. Is that sort of thing possible, and would it be less CPU intensive than skip()?
Your trick in "other considerations" works only if you add a sort on _id, otherwise you can't guarantee order for follow up queries. If you want to sort on a different field, you need to index that field. I would also suggest you query for 21 elements so that you don't have to go back and find the next one after the 20th element (of course, you can still show only the first 20 elements).
MongoDB ranged pagination has a good example as well.

Unable to fetch more than 10k records

I am developing an app where I have more than 10k records added to a class in parse. Now I am trying to fetch those records using PFQuery( I am using the "skip" property). But I am unable to fetch records beyond 10k and I get the following error message
"Skips larger than 10000 are not allowed"
This is a big problem for me since I need all the data.
Has anybody come across such problem. Please share your views.
Thanks
The problem is indeed due to the cost of mongo skip operations. You can formulate a query such that you don't need the skip operator. My preferred method is to orderBy objectId and then add a condition that objectId > last yielded objectId. This type of query can be indexed and remain fast, unlike skip pagination, which has a O(N^2) cost in seeks.
My assumption would be that it's based on performance issues with MongoDB's skip implementation.
The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return result. As offset (e.g. pageNumber above) increases, cursor.skip() will become slower and more CPU intensive. With larger collections, cursor.skip() may become IO bound.