Pagination issue with real time data in Druid Scan query - druid

I have gone through following Druid Scan query documentation
https://druid.apache.org/docs/0.20.0/querying/scan-query.html . I didn't understand the part when it says. "note that if the underlying datasource is modified in between page fetches in ways that affect overall query results, then the different pages will not necessarily align with each other."
In my case data is added to Druid in real time which means suppose I queried for last one hour data(4-5PM), it might possible that earlier we had 40 records for that query but during the query we received 10 new records. My assumption is that all new records should get added post 40th record and it should not impact the current running paging offset. Please help me how realtime ingestion of data can impact the Druid pagination and what could be the possible fix for that.
offset : Together, "limit" and "offset" can be used to implement
pagination.
However, note that if the underlying datasource is modified in between page fetches in ways that affect overall query results, then
the different pages will not necessarily align with each other.

The docs describe that the offset/limit are application side values. From the database perspective, it is running the whole query again with every request and just returning the rows between offset and offset + limit.
So, if ordered by __time desc, new rows will appear at the top of the results and therefore shift the content of the pagination.
If sorted __time asc, and no out of time order rows are ingested between calls, then the pagination should be constant and new rows appear at the end.
Also remember that it is a good practice to limit the overall timeframe that you are querying.

Related

Elasticsearch 'size:' vs MongoDB batch_size

For my thesis I'm currently investigating the speed (down to milliseconds) of Elasticsearch and MongoDB.
I've noticed that, compared to MongoDB, Elasticsearch is very consistent when it comes to the speed at which it returns data and the total items found. Where other MongoDB takes a longer time to return data the more results are found, Elasticsearch's response time is almost always the same, regardless of the total amount of requests sent.
My hypothesis is that in Elasticsearch, when using the size operator, the number of documents that are actually looked up and retrieved after the search in the indexes is finished is exactly the amount set in the size operator. Where in MongoDB this is not the case, in MongoDB all documents that matched in the index are retrieved, and only the top X amount is eventually returned to the client based on the cursor's batch_size and eventually the max limit() that is set.
I have no way, other than to spend hours looking through the source code, to figure out if this hypothesis is correct, or if something else is going on that I must have missed.
Thanks for taking the time to read this, any responses are appreciated and will help me further my research.
To make it a bit clearer how Elasticsearch actually retrieves results: It uses query then fetch.
So if you search for N results, the first phase will query all the shards involved and return a list of their N results containing the score and the ID — not other information. In the second phase you fetch the top N global results by their ID. So you will retrieve more scores and IDs than you need, but you will only fetch the actual results.

Postgresql: queries with ordering by partitioning key

I have created in PostgreSQL a table partitioned (see here) by received column. Let's use a toy example:
CREATE TABLE measurement (
received timestamp without timezone PRIMARY KEY,
city_id int not null,
peaktemp int,
unitsales int
);
I have created one partition for each month for several years (measurement_y2012m01 ... measurement_y2016m03).
I have noticed that postgresql is not aware of the order of the partitions, so for a query like below:
select * from measurement where ... order by received desc limit 1000;
postgresql performs index scan over all partitions, even though it is very likely that the first 1000 results are located in the latest partition (or the first two or three).
Do you have an idea how to take advantage of partitions for such query? I want to emphasize that where clause may vary, I don't want to hardcode it.
The first idea is to iterate partitions in a proper order until 1000 records are fetched or all partitions are visited. But how to implement it in a flexible way? I want to avoid implementing the aforementioned iteration in the application, but I don't mind if the app needs to call a stored procedure.
Thanks in advance for your help!
Grzegorz
If you really don't know how many partitions to scan to get your desired 1000 rows in the output you could build up your resultset in a stored procedure and fetch results iterating over partitions until your limit condition is satisfied.
Starting with the most recent partition would be a wise thing to do.
select * from measurement_y2016m03 where ... order by received desc limit 1000;
You could store the immediate resultset in a record and issue a count over it and change the limit dynamically for the next scanned partition, so that if you fetch for example 870 rows in first partition, you could build up a second query with limit 130 and then perform count once again after that and increase the counter if it still doesn't satisfy your 1000 rows condition.
Why Postgres doesn't know when to stop during planning?
Planner is unaware of how many partitions are needed to satisfy your LIMIT clause. Thus, it has to order the entire set by appending results from each partition and then perform a limit (unless it already satisfies this condition during run time). The only way to do this in an SQL statement would be to restrict the lookup only to a few partitions - but that may not be the case for you. Also, increasing work_mem setting may speed things up for you if you're hitting disk during lookups.
Key note
Also, a thing to remember is that when you setup your partitioning, you should have a descending order of mostly accessed partitions. This would speed up your inserts, because Postgres checks conditions one by one and stops on first that satisfies.
Instead of iterating the partitions, you could guess at the range of received that will satisfy your query and expand it until you get the desired number of rows. Adding the range to WHERE will exclude the unnecessary partitions (assuming you have exclusion constraints set).
Edit
Correct, that's what I meant (could've phrased it better).
Simplicity seems like a pretty reasonable advantage. I don't see the performance being different, either way. This might actually be a little more efficient if you guess reasonably close to the desired range most of the time, but probably won't make a significant difference.
It's also a little more flexible, since you're not relying on the particular partitioning scheme in your query code.

Paginating results in MongoDB without relying on .skip()

I'm building an app that calls data from MongoDB. For purposes of this question, pretend that the user searches my app for a certain query, and MongoDB has 4,000 results to spit out that match the query.
After reading around a bit, I see that it's possible to paginate using the .skip() method, but MongoDB themselves suggest against using this as it requires the curser to iterate through all the records up until the one you're skipping to, which gets more and more expensive the higher in the list you go.
I've seen a few tutorials that rely on the _id property of the results to be sequential, but this doesn't apply here - my database has tens of thousands of records, and each has a unique id, and the 4000 results that apply to the user's query are definitely not going to be sequential.
Can anyone think of a way to do this, or is skip() the only option here?
Other considerations:
The pagination will work based on the position on the page. For instance, the first query should spit out 20 records to my app. When the user scrolls to the bottom of the page, I could potentially get the _id of the 20th element on the page and pass that to my query, find it in the list of 4,000 results, find the subsequent result and start the next set of 20 from there. Is that sort of thing possible, and would it be less CPU intensive than skip()?
Your trick in "other considerations" works only if you add a sort on _id, otherwise you can't guarantee order for follow up queries. If you want to sort on a different field, you need to index that field. I would also suggest you query for 21 elements so that you don't have to go back and find the next one after the 20th element (of course, you can still show only the first 20 elements).
MongoDB ranged pagination has a good example as well.

Unable to fetch more than 10k records

I am developing an app where I have more than 10k records added to a class in parse. Now I am trying to fetch those records using PFQuery( I am using the "skip" property). But I am unable to fetch records beyond 10k and I get the following error message
"Skips larger than 10000 are not allowed"
This is a big problem for me since I need all the data.
Has anybody come across such problem. Please share your views.
Thanks
The problem is indeed due to the cost of mongo skip operations. You can formulate a query such that you don't need the skip operator. My preferred method is to orderBy objectId and then add a condition that objectId > last yielded objectId. This type of query can be indexed and remain fast, unlike skip pagination, which has a O(N^2) cost in seeks.
My assumption would be that it's based on performance issues with MongoDB's skip implementation.
The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return result. As offset (e.g. pageNumber above) increases, cursor.skip() will become slower and more CPU intensive. With larger collections, cursor.skip() may become IO bound.

select records for each page in dataobjects.net

I have lots of records in the database, and I have a control that pages those records. How do I select records for each page? For example I need to select records from 51st record to 100th record. And I can't use LINQ expressions. I am using dataobjects 3.9 .
So I start as
Query q = new Query("select SomeClass objects");
Use this query:
Query q = new Query("select top 100 SomeClass objects");
As far as I remember, there is no way to specify .Skip-like condition in case with DO39, so you should do this manually (e.g. by applying .Skip to enumerable you've got).
There is an obvious performance impact in this case, but it isn't essential in terms of computational complexity. The only effect of this is that more rows will be sent by SQL Server to the client, but all the other the job it must do remains the same.
An example illustrating this:
if you'll ask Google to show you 1000th page of result, it will anyway
find all the document related to your query, compute match rank for
each of them, sort it to get at least first 1000 of pages with best
match ranks and only after all this job it will be able to give you
1000th page.
So if there is 1,000,000,000,000 of documents, the computational
complexity of sending 10K rows to the client is tiny in comparison
with all the other job done.
Also note that the whole idea of paging is to show a tiny fraction of the whole set of data. So if your user needs to paginate to e.g. 1000th page, there is something wrong with design. There are just two cases:
User must get a tiny fraction of data (i.e. perform some search)
User must get all the data (e.g. to make a backup)
There are no intermediate cases.