Unable to fetch more than 10k records - iphone

I am developing an app where I have more than 10k records added to a class in parse. Now I am trying to fetch those records using PFQuery( I am using the "skip" property). But I am unable to fetch records beyond 10k and I get the following error message
"Skips larger than 10000 are not allowed"
This is a big problem for me since I need all the data.
Has anybody come across such problem. Please share your views.
Thanks

The problem is indeed due to the cost of mongo skip operations. You can formulate a query such that you don't need the skip operator. My preferred method is to orderBy objectId and then add a condition that objectId > last yielded objectId. This type of query can be indexed and remain fast, unlike skip pagination, which has a O(N^2) cost in seeks.

My assumption would be that it's based on performance issues with MongoDB's skip implementation.
The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return result. As offset (e.g. pageNumber above) increases, cursor.skip() will become slower and more CPU intensive. With larger collections, cursor.skip() may become IO bound.

Related

Does Firestore's 500 writes/second limit apply to updates to non-sequential fields in documents with indexed sequential fields?

Let's say we have the following database structure:
[collection]
<documentId>
- indexedSequentialField
- indexedNonSequentialField
- nonIndexedSequentialField
Firestore's 500 writes/second limit will apply to the creation of new documents if indexedSequentialField is there during the creation. Similarly, Firestore's 500 writes/second limit should also apply to any updates to the documents that change indexedSequentialField because that involves rewriting the index entry. This part is clear.
My understanding is that this limit comes from writing the index entries and not to the collection itself.
If that's true, would it be correct to say that making more than 500 updates per second to the documents that change only the indexedNonSequentialField or nonIndexedSequentialField is fine as long as the indexedSequentialField is not changed, even if the indexedSequentialField is already present in the documents and the index entries since their creation?
For the sake of this question, please assume that there are no composite indices present that end up being sequential in nature.
Firestore's hotspots on writes occur when it needs to write data from multiple write operations close to each other on disk, as it needs to synchronize those writes across multiple data centers while also isolation each write from each other (since its write operations are immediately consistent).
If your collection or collection group has an index with sequential fields, that will trigger a hotspot indeed. Note that the limit of 500 writes per second is a soft limit, and you may well be able to write much more than that before hitting a hotspot. Nowadays I recommend using the Key Visualizer to analyze the performance of your writes.

Elasticsearch 'size:' vs MongoDB batch_size

For my thesis I'm currently investigating the speed (down to milliseconds) of Elasticsearch and MongoDB.
I've noticed that, compared to MongoDB, Elasticsearch is very consistent when it comes to the speed at which it returns data and the total items found. Where other MongoDB takes a longer time to return data the more results are found, Elasticsearch's response time is almost always the same, regardless of the total amount of requests sent.
My hypothesis is that in Elasticsearch, when using the size operator, the number of documents that are actually looked up and retrieved after the search in the indexes is finished is exactly the amount set in the size operator. Where in MongoDB this is not the case, in MongoDB all documents that matched in the index are retrieved, and only the top X amount is eventually returned to the client based on the cursor's batch_size and eventually the max limit() that is set.
I have no way, other than to spend hours looking through the source code, to figure out if this hypothesis is correct, or if something else is going on that I must have missed.
Thanks for taking the time to read this, any responses are appreciated and will help me further my research.
To make it a bit clearer how Elasticsearch actually retrieves results: It uses query then fetch.
So if you search for N results, the first phase will query all the shards involved and return a list of their N results containing the score and the ID — not other information. In the second phase you fetch the top N global results by their ID. So you will retrieve more scores and IDs than you need, but you will only fetch the actual results.

MongoDB: what is faster: single find() query or many find_one()?

I have the following problem connected to the MongoDB database design. Here is my situation:
I have a collection with about 50k documents (15kB each),
every document have a dictionary storing data samples,
my query always gets all the data from the document,
every query uses an index,
the collection have only one index (based on a single datetime field),
in most cases, I need to get data from many documents (typically 25 < N < 100),
it is easier for me to perform many SELECT queries over a single one,
I have a lot of updates in my databases, much less than SELECT ones,
I use the WiredTiger engine (the newest version of MongoDB),
server instance and web application are on the same machine.
I have two possibilities for making a SELECT query:
perform a single query retrieving all documents I am interested in,
perform N queries, everyone gets a single document, where typically 25 < N < 100 (what about a different scenario when 100 < N < 1k or 1k < N < 10k?)
So the question is if there is any additional overhead when I perform many small queries over a single one? In relational databases making many queries is a very bad practice - but in NoSQL? I am asking about a general practice - should I avoid that much queries?
In the documentation, I read that the number of queries is not important but the number of searches over documents - is that true?
Thanks for help ;)
There is a similar question like the one you asked : Is it ok to query mongodb multiple times
IMO, for your use-case i.e. 25<N<100, one should definitely go with batching.
In case of Single queries :
Looping in a single thread will not suffice, you'll have to make parallel requests which would create additional overhead
creates tcp/ip overhead for every request
there is a certain amount of setup and teardown for each query creating and exhausting cursors which would create unnecessary overhead.
As explained in the answer above, there appears be a sweet-spot for how many values to batch up vs. the number of round trips and that depends on your document type as well.
In broader terms, anything 10<N<1000 should go with batching and the remaining records should form part of other batches but querying single document at a time would definitely create unnecessary overhead.
The problem when you perform small queries over one query is network overhead that is the network latency roundtrip.
For a single request in a batch processing it may be not much, but if you make multiple requests like these or use this technique on frontend it will decrease performance.
Also you may need to preprocess the data like sorting aggregating it manually.

cursor.skip() is expensive, is there alternate approach

While going through
cursor.skip() MongoDB I read that this is expensive approach and I totally understand it why it is expensive as cursor has to go through from start to execute this skip. And in the below paragraph they wrote
Consider using range-based pagination for these kinds of tasks. That is, query for a range of objects, using logic within the application to determine the pagination rather than the database itself. This approach features better index utilization, if you do not need to easily jump to a specific page.
I don't understand this part, how this overcome the "expensive(ness)" of skip() operation.
Thanks
When using cursor.skip(N) the server finds all the matching data and then skips over the first N matching documents.
When using range based pagination (ie. with a date range) the server will only find and return the matching documents. If the property you base your pagination on is indexed the index will also be used.
The difference is the amount of data the server has to read in the two situations.

Morphia is there a difference between fetch and asList in performance wise

We are using morphia 0.99 and java driver 2.7.3 I would like to learn is there any difference between fetching records one by one using fetch and retrieving results by asList (assume that there is enough memory to retrieve records through asList).
We iterate over a large collection, while using fetch I sometimes encounter cursor not found exception on the server during the fetch operation, so I need to execute another command to continue, what could be the reason for this?
1-)fetch the record
2-)do some calculation on it
3-)+save it back to database again
4-)fetch another record and repeat the steps until there isn't any more records.
So which one would be faster? Fetching records one by one or retrieving bulks of results using asList, or isn't there any difference between them using morphia implementation?
Thanks for the answers
As far as I understand the implementation, fetch() streams results from the DB while asList() will load all query results into memory. So they will both get every object that matches the query, but asList() will load them all into memory while fetch() leaves it up to you.
For your use case, it neither would be faster in terms of CPU, but fetch() should use less memory and not blow up in case you have a lot of DB records.
Judging from the source-code, asList() uses fetch() and aggregates the results for you, so I can't see much difference between the two.
One very useful difference would be if the following two conditions applied to your scenario:
You were using offset and limit in the query.
You were changing values on the object such that it would no longer be returned in the query.
So say you were doing a query on awesome=true, and you were using offset and limit to do multiple queries, returning 100 records at a time to make sure you didn't use up too much memory. If, in your iteration loop, you set awesome=false on an object and saved it, it would cause you to miss updating some records.
In a case like this, fetch() would be a better approach.