How can I achieve a paginated join, with the first page returned quickly? - postgresql

I am looking to join multiple big tables in the OLAP layer to power the UI. Since the tables are really large, response for each join query takes too long. I want to get results in less than 3 seconds. But the catch is I don't want the entire joined data at once because I am only displaying a small subset of the result in the UI at any particular point. Only user interaction would require me to show the next subset of the result.
I am looking for a strategy to create a system where I can perform the same join queries, but initially only a small subset is joined and used for powering the UI. Meanwhile, the rest of the smaller subsets of data is joined in the background and that gets pulled into the UI when required. Is this the right way to approach this problem, where I have to perform really big joins? If so, how can I design such a system?

You can use a WITH HOLD cursor:
START TRANSACTION;
DECLARE c CURSOR WITH HOLD FOR SELECT /* big query */;
FETCH 50 FROM c;
COMMIT;
The COMMIT will take a long time, as it materializes the whole result set, but the FETCH 50 can be reasonably fast (or not, depending on the query).
You can then continue fetching from the cursor. Don't forget to close the cursor when you are done:
CLOSE c;

Related

How do I efficiently execute large queries?

Consider the following demo schema
trades:([]symbol:`$();ccy:`$();arrivalTime:`datetime$();tradeDate:`date$(); price:`float$();nominal:`float$());
marketPrices:([]sym:`$();dateTime:`datetime$();price:`float$());
usdRates:([]currency$();dateTime:`datetime$();fxRate:`float$());
I want to write a query that gets the price, translated into USD, at the soonest possible time after arrivalTime. My beginner way of doing this has been to create intermediate tables that do some filtering and translating column names to be consistent and then using aj and ajo to join them up.
In this case there would only be 2 intermediate tables. In my actual case there are necessarily 7 intermediate tables and records counts, while not large by KDB standards, are not small either.
What is considered best practice for queries like this? It seems to me that creating all these intermediate tables is resource hungry. An alternative to the intermediate tables is 2 have a very complicated looking single query. Would that actually help things? Or is this consumption of resources just the price to pay?
For joining to the next closest time after an event take a look at this question:
KDB reverse asof join (aj) ie on next quote instead of previous one
Assuming that's what your looking for then you should be able to perform your price calculation either before or after the join (depending on the size of your tables it may be faster to do it after). Ultimately I think you will need two (potentially modified as per above) aj's (rates to marketdata, marketdata to trades).
If that's not what you're looking for then I could give some more specifics although some sample data would be useful.
My thoughts:
The more verbose/readible your code, the better for you to debug later and any future readers/users of your code.
Unless absolutely necessary, I would try and avoid creating 7 copies of the same table. If you are dealing with large tables memory could quickly become a concern. Particularly if the processing takes a long time, you could be creating large memory spikes. I try to keep to updating 1-2 variables at different stages e.g.:
res: select from trades;
res:aj[`ccy`arrivalTime;
res;
select ccy:currency, arrivalTime:dateTime, fxRate from usdRates
]
res:update someFunc fxRate from res;
Sean beat me to it, but aj for a time after/ reverse aj is relatively straight forward by switching bin to binr in the k code. See the suggested answer.
I'm not sure why you need 7 intermediary tables unless you are possibly calculating cross rates? In this case I would typically join ccy1 and ccy2 with 2 ajs to the same table and take it from there.
Although it may be unavoidable in your case if you have no control over the source data, similar column names / greater consistency across schemas is generally better. e.g. sym vs symbol

Pagination Options in KDB

I am looking to support a use case that returns kdb datasets back to users. The users connects to kdb using the Java API, runs the query synchronously and retrieves results.
However, issues are coming up when returning larger datasets and therefore I would like to return the data from kdb to the java process in pages/slices. Unfortunately users need to be able to run queries that return millions of rows and it would be easier to handle if they were passed back in slices of say 100,000 rows (Cassandra and other DBs do this sort of thing).
The potential approaches I have come up with are as follows:
Run the "where" part of the query on the database and return only the indices/date partitions (if applicable) of the data required. The java process would then use these indices to select the data required slice by slice . This approach would control memory usage on the kdb side as it would not have to load all HDB data required at once. However, overall this would increase the run time of the query as data would have to be searched/queried multiple times. This could work well for simple selects but complicated queries may need to go through an "onboarding" process which I want to avoid.
Store results of the query in a global variable in kdb which the java process can then query slice by slice. This simpler method could support any query but could potentially hit limits on the kdb side (memory/timeout) if too large a dataset is queried.
Other points to consider:
It should support users running queries on any type of process - gateway, hdb, rdb etc
It should support more than just simple selects e.g.
((1!select sym, price from trade where sym=`AAA) uj
1!select sym,price from order where sym=`AAA)
lj select avgBid:avg bid by sym from quote where sym=`AAA
The paging functionality should be removed from the end user
Does anyone have any views on if there are there any options available other than the ones listed above? Essentially I am looking for a select[m n] type approach that supports any query.

OLAP Approach for Backend redshift connection

We have a system where we do some aggregations in Redshift based on some conditions. We aggregate this data with complex joins which usually takes about 10-15 minutes to complete. We then show this aggregated data on Tableau to generate our reports.
Lately, we are getting many changes regarding adding a new dimension ( which usually requires join with a new table) or get data on some more specific filter. To entertain these requests we have to change our queries everytime for each of our subprocesses.
I went through OLAP a little bit. I just want to know if it would be better in our use case or is there any better way to design our system to entertain such adhoc requests which does not require developer to change things everytime.
Thanks for the suggestions in advance.
It would work, rather it should work. Efficiency is the key here. There are few things which you need to strictly monitor to make sure your system (Redshift + Tableau) remains up and running.
Prefer Extract over Live Connection (in Tableau)
Live connection would query the system everytime someone changes the filter or refreshes the report. Since you said the dataset is large and queries are complex, prefer creating an extract. This'll make sure data is available upfront whenever someone access your dashboard .Do not forget to schedule the extract refresh, other wise the data will be stale forever.
Write efficient queries
OLAP systems are expected to query a large dataset. Make sure you write efficient queries. It's always better to first get a small dataset and join them rather than bringing everything in the memory and then joining / using where clause to filter the result.
A query like (select foo from table1 where ... )a left join (select bar from table2 where) might be the key at times where you only take out small and relevant data and then join.
Do not query infinite data.
Since this is analytical and not transactional data, have an upper bound on the data that Tableau will refresh. Historical data has an importance, but not from the time of inception of your product. Analysing the data for the past 3, 6 or 9 months can be the key rather than querying the universal dataset.
Create aggregates and let Tableau query that table, not the raw tables
Suppose you're analysing user traits. Rather than querying a raw table that captures 100 records per user per day, design a table which has just one (or two) entries per user per day and introduce a column - count which'll tell you the number of times the event has been triggered. By doing this, you'll be querying sufficiently smaller dataset but will be logically equivalent to what you were doing earlier.
As mentioned by Mr Prashant Momaya,
"While dealing with extracts,your storage requires (size)^2 of space if your dashboard refers to a data of size - **size**"
Be very cautious with whatever design you implement and do not forget to consider the most important factor - scalability
This is a typical problem and we tackled it by writing SQL generators in Python. If the definition of the metric is the same (like count(*)) but you have varying dimensions and filters you can declare it as JSON and write a generator that will produce the SQL. Example with pageviews:
{
metric: "unique pageviews"
,definition: "count(distinct cookie_id)"
,source: "public.pageviews"
,tscol: "timestamp"
,dimensions: [
['day']
,['day','country']
}
can be relatively easy translated to 2 scripts - this:
drop table metrics_daily.pageviews;
create table metrics_daily.pageviews as
select
date_trunc('day',"timestamp") as date
,count(distinct cookie_id) as "unique_pageviews"
from public.pageviews
group by 1;
and this:
drop table metrics_daily.pageviews_by_country;
create table metrics_daily.pageviews_by_country as
select
date_trunc('day',"timestamp") as date
,country
,count(distinct cookie_id) as "unique_pageviews"
from public.pageviews
group by 1,2;
the amount of complexity of a generator required to produce such sql from such config is quite low but in increases exponentially as you need to add new joins etc. It's much better to keep your dimensions in the encoded form and just use a single wide table as aggregation source, or produce views for every join you might need and use them as sources.

IBMDB2 select query for millions of data

i am new at db2 i want to select around 2 million data with single query like that
which will select and display first 5000 data and in back process it will select other 5000 data and keep on same till end of the all data help me out with this how to write query or using function
Sounds like you want what's known as blocking. However, this isn't actually handled (not the way you're thinking of) at the database level - it's handled at the application level. You'd need to specify your platform and programming language for us to help there. Although if you're expecting somebody to actually read 2 million rows, it's going to take a while... At one row a second, that's 23 straight days.
The reason that SQL doesn't really perform this 'natively' is that it's (sort of) less efficient. Also, SQL is (by design) set up to operate over the entire set of data, both conceptually and syntactically.
You can use one of the new features, that incorporates paging from Oracle or MySQL: https://www.ibm.com/developerworks/mydeveloperworks/blogs/SQLTips4DB2LUW/entry/limit_offset?lang=en
At the same time, you can influence the optimizer by indicating OPTIMIZED FOR n ROWS, and FETCH FIRST n ROWS ONLY. If you are going to read only, it is better to specify this clause in the query "FOR READ ONLY", this will increase the concurrency, and the cursor will not be update-able. Also, assign a good isolation level, for this case you could eventually use "uncommitted read" (with UR). A Previous Lock table will be good.
Do not forget the common practices like: index or cluster index, retrieve only the necessary columns, etc. and always analyze the access plan via the Explain facility.

Some basic questions on RDBMSes

I've skimmed thru Date and Silberschatz but can't seem to find answers to these specific questions of mine.
If 2 database users issue a query -- say, 'select * from AVERYBIGTABLE;' -- where would the results of the query get stored in general... i.e., independent of the size of the result set?
a. In the OS-managed physical/virtual memory of the DBMS server?
b. In a DBMS-managed temporary file?
Is the query result set maintained per connection?
If the query result set is indeed maintained per connection, then what if there's connection pooling in effect (by a layer of code sitting above the DBMS)? Won't, then, the result set be maintained per query (instead of per connection)?
If the database is changing in realtime while its users concurrently issue select queries, what happens to the queries that have already been executed but not yet (fully) 'consumed' by the query issuers? For example, assume the result set has 50,000 rows; the user is currently iterating at 100th, when parallely another user executes an insert/delete such that it would lead to more/less than 50,000 rows if the earlier query were to be re-issued by any user of the DBMS?
On the other hand, in case of a database that does not change in realtime, if 2 users issue identical queries each with identical but VERY LARGE result sets, would the DBMS maintain 2 identical copies of the result set, or would it have a single shared copy?
Many thanks in advance.
Some of this may be specific to Oracle.
The full results of the query do not need to copied each user gets a cursor (like a pointer) that maintains which rows have been retrieved, and what rows still need to be fetched. The database will cache as much of data as it can as it reads the data out of the tables. Same principal as two users have read only file handle on file.
The cursors are maintained per connection, the data for the next row may or may not already be in memory.
Connections for the most part are single threaded, only 1 client can use a connection at a time. If the same query is executed twice on the same connection then the cursor position is reset.
If a cursor is open on table that is being updated then the old rows are copied into a separate space (undo in Oracle) and is maintained for the life of the cursor, or at least until it runs out of space to maintain it. (Oracle will give a snapshot too old error)
The database will never duplicate the data stored in cache, in Oracle's case with cursor sharing there would a single cached cursor and each client cursor would only have to maintain its position in the cached cursor.
Oracle Database Concepts
See 8 Memory for questions 1, 2, 5
See 13 Data Concurrency and Consistency (Questions 3, 4)
The reason you don't find this in Date etc is because they could change between DBMS products, there is nothing in the relational model theory about pooling connections to the database or how to maintain the result sets from a query (like caching etc). The only point which is partially covered is 4 - where the read level would come into play (eg read uncommitted), but this only applies until the result set has been produced.