I have a table of the following form:
[mytable]
id, min, max, funobj
-----------------------------------------
1 15 23 {some big object}
1 23 41 {another big object}
1 19 27 {next big object}
Now suppose I have a view created like this:
CREATE VIEW functionvalues AS
SELECT id, evaluate(funobj)
FROM mytable
where evaluate is a set-returning function evaluateing the large funobj. The result of the view could be something like this:
id, evaluate
--------------
1 15
1 16
1 ...
1 23
2 23
2 24
2 ...
2 41
...
I do not have any information on the specific values evaluate will return, but I know that they will alway be between the min- and max-values given in mytable (including boundaries)
Finally, I (or better, a third party application) makes a query on the view:
SELECT * FROM functionvalues
WHERE evaluate BETWEEN somevalue AND anothervalue
In this case, Postgres does evaluate the function evaluate for every row in mytable, whereas, depending on the where clause, the function does not have to be evaluated if it's max and min are not between the given values. As evaluate is a rather slow function, this gives me a very bad performance.
The better way would be to query the table directly by using
SELECT *
FROM (
SELECT id, evaluate(funobj)
FROM mytable
WHERE
max BETWEEN somevalue AND anothervalue
OR min BETWEEN somevalue AND anothervalue
OR (min < somevalue AND max > anothervalue)
) AS innerquery
WHERE evaluate BETWEEN somevalue AND anothervalue
Is there any way I can tell postgres to use a query like the above (by clever indices or something like that) without changing the way the third party application queries the view?
P.S.: Feel free to suggest a better title to this question, the one I gave is rather... well... unspecific.
I have no complete answer, but some of your catchwords ring a distant bell in my head:
you have a view
you want a more intelligent view
you want to "rewrite" the view definition
That calls for the PostgreSQL Rule System, especially the part "Views and the Rules System". Perhaps you can use that for your advantage.
Be warned: This is treacherous stuff. First you will find it great, then you will pet it, then it will rip of your arm without a warning while still purring. Follow the links in here.
Postgres cannot push the restrictions down the query tree into the function; the function always has to scan and return the entire underlying table. And rejoin it with the same table. sigh.
"Breaking up" the function's body and combining it with the rest of the query would require a macro-like feature instead of a function.
A better way would probably be to not use an unrestricted set-returning function, but to rewrite the function as a a scalar function, taking only one data row as an argument, and yielding its value.
There is also the problem of sorting-order: the outer query does not know about the order delivered by the function, so explicit sort and merge steps will be necessary, except maybe for very small result sets (for function results statistics are not available, only the cost and estimated rowcount, IIRC.)
Sometimes, the right answer is "faster hardware". Given the way the PostgreSQL optimizer works, your best bet might be to move the table and its indexes onto a solid state disk.
Documentation for Tablespaces
Second, tablespaces allow an administrator to use knowledge of the
usage pattern of database objects to optimize performance. For
example, an index which is very heavily used can be placed on a very
fast, highly available disk, such as an expensive solid state device.
At the same time a table storing archived data which is rarely used or
not performance critical could be stored on a less expensive, slower
disk system.
In October, 2011, you can get a really good 128 gig SSD drive for less than $300, or 300 gigs for less than $600.
If you're looking for an improvement of two orders of magnitude, and you already know the bottleneck is your evaluate() function, then you will probably have to accept smaller gains from many sources. If I were you, I'd look at whether any of these things might help.
solid-state disk (speedup factor of 50 over HDD, but you say you're not IO bound, so let's estimate "2")
faster CPU (speedup of 1.2)
more RAM (speedup of 1.02)
different algorithms in evaluate() (say, 2)
different data structures in evaluate() (in a database function, probably 0)
different C compiler optimizations (0)
different C compiler (1.1)
rewrite critical parts of evaluate() in assembler (2.5)
different dbms platform
different database technology
Those estimates suggest a speedup by a factor of only 13. (But they're little more than guesswork.)
I might even consider targeting the GPU for calculation.
Related
Consider the following demo schema
trades:([]symbol:`$();ccy:`$();arrivalTime:`datetime$();tradeDate:`date$(); price:`float$();nominal:`float$());
marketPrices:([]sym:`$();dateTime:`datetime$();price:`float$());
usdRates:([]currency$();dateTime:`datetime$();fxRate:`float$());
I want to write a query that gets the price, translated into USD, at the soonest possible time after arrivalTime. My beginner way of doing this has been to create intermediate tables that do some filtering and translating column names to be consistent and then using aj and ajo to join them up.
In this case there would only be 2 intermediate tables. In my actual case there are necessarily 7 intermediate tables and records counts, while not large by KDB standards, are not small either.
What is considered best practice for queries like this? It seems to me that creating all these intermediate tables is resource hungry. An alternative to the intermediate tables is 2 have a very complicated looking single query. Would that actually help things? Or is this consumption of resources just the price to pay?
For joining to the next closest time after an event take a look at this question:
KDB reverse asof join (aj) ie on next quote instead of previous one
Assuming that's what your looking for then you should be able to perform your price calculation either before or after the join (depending on the size of your tables it may be faster to do it after). Ultimately I think you will need two (potentially modified as per above) aj's (rates to marketdata, marketdata to trades).
If that's not what you're looking for then I could give some more specifics although some sample data would be useful.
My thoughts:
The more verbose/readible your code, the better for you to debug later and any future readers/users of your code.
Unless absolutely necessary, I would try and avoid creating 7 copies of the same table. If you are dealing with large tables memory could quickly become a concern. Particularly if the processing takes a long time, you could be creating large memory spikes. I try to keep to updating 1-2 variables at different stages e.g.:
res: select from trades;
res:aj[`ccy`arrivalTime;
res;
select ccy:currency, arrivalTime:dateTime, fxRate from usdRates
]
res:update someFunc fxRate from res;
Sean beat me to it, but aj for a time after/ reverse aj is relatively straight forward by switching bin to binr in the k code. See the suggested answer.
I'm not sure why you need 7 intermediary tables unless you are possibly calculating cross rates? In this case I would typically join ccy1 and ccy2 with 2 ajs to the same table and take it from there.
Although it may be unavoidable in your case if you have no control over the source data, similar column names / greater consistency across schemas is generally better. e.g. sym vs symbol
In Postgres, I can say select avg(size) from images and select max(size) from images.
But when I want the mode, I may not do this:
select mode(uploaded_by_id) from images
Instead I must do this:
select mode() within group (order by uploaded_by_id desc) from images
The syntax seems a little funky to me. Does anyone know why the other syntax was not permitted?
NOTE: I know that allowing order by enables the user to define which mode to take in the case of a tie, but I don't see why that needs to prohibit the other syntax entirely.
Thanks!
There is no "machine formula" for computing the mode the way there are for those other things. For the min or max, you just track of the min or max seen so far. For average, you can just keep track of the sum and count seen so far, for example. With the mode, you need to have all the data at your fingertips.
Using an ordered-set aggregate provides for such a use case automatically, including spooling the data to temp files on disk as it becomes large.
You could instead write code to aggregate the data into memory and then process it from there (as the other answer references), but this would become slow and prone to crashing as the amount of memory needed starts to exceed the amount available.
After looking at the documentation it appears as though they moved away from a simple function in favour of the window function, theyre citing speed advantages as a reason for this.
https://wiki.postgresql.org/wiki/Aggregate_Mode
If you wanted to you could just create a function yourself but it seems as though the window function is the fastest way to get a NOT NULL result back from the db.
I have to determine the complexity level (simple/medium/complex etc) of a sql by counting number of occurrences of specific keywords, sub-queries, derived tables, functions etc that constitute the sql. Additionally, I have to syntactically validate the sql.
I searched on the net and found that Perl has 2 classes named SQL::Statement and SQL::Parser which could be leveraged to achieve the same. However, I found that these classes have several limitations (such as CASE WHEN constructs not supported etc).
That been said, is it better to build a custom concise sql parser with Lex/Yacc or Flex/Bison instead ? Which approach would be better and quick ?
Please share your thoughts on this. Also, can anyone point me to any resources online that discusses the same.
Thanks
Teradata has many non ANSI features and you're considering re-implementing the parser for it.
Instead use the database server and put an 'explain' in front of your statements and process the result.
explain select * from dbc.dbcinfo;
1) First, we lock a distinct DBC."pseudo table" for read on a RowHash
to prevent global deadlock for DBC.DBCInfoTbl.
2) Next, we lock DBC.DBCInfoTbl in view dbcinfo for read.
3) We do an all-AMPs RETRIEVE step from DBC.DBCInfoTbl in view
dbcinfo by way of an all-rows scan with no residual conditions
into Spool 1 (group_amps), which is built locally on the AMPs.
The size of Spool 1 is estimated with low confidence to be 432
rows (2,374,272 bytes). The estimated time for this step is 0.01
seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.01 seconds.
This will also validate your SQL.
why is an assign statement more efficient than not using assign?
co-workers say that:
assign
a=3
v=7
w=8.
is more efficient than:
a=3.
v=7.
w=8.
why?
You could always test it yourself and see... but, yes, it is slightly more efficient. Or it was the last time I tested it. The reason is that the compiler combines the statements and the resulting r-code is a bit smaller.
But efficiency is almost always a poor reason to do it. Saving a micro-second here and there pales next to avoiding disk IO or picking a more efficient algorithm. Good reasons:
Back in the dark ages there was a limit of 63k of r-code per program. Combining statements with ASSIGN was a way to reduce the size of r-code and stay under that limit (ok, that might not be a "good" reason). One additional way this helps is that you could also often avoid a DO ... END pair and further reduce r-code size.
When creating or updating a record the fields that are part of an index will be written back to the database as they are assigned (not at the end of the transaction) -- grouping all assignments into a single statement helps to avoid inconsistent dirty reads. Grouping the indexed fields into a single ASSIGN avoids writing the index entries multiple times. (This is probably the best reason to use ASSIGN.)
Readability -- you can argue that grouping consecutive assignments more clearly shows your intent and is thus more readable. (I like this reason but not everyone agrees.)
basically doing:
a=3.
v=7.
w=8.
is the same as:
assign a=3.
assign v=7.
assign w=8.
which is 3 separate statements so a little more overhead. Therefore less efficient.
Progress does assign as one statement whether there is 1 or more variables being assigned. If you do not say Assign then it is assumed so you will do 3 statements instead of 1. There is a 20% - 40% reduction in R Code and a 15% - 20% performance improvement when using one assign statement. Why this is can only be speculated on as I can not find any source with information on why this is. For database fields and especially key/index fields it makes perfect sense. For variables I can only assume it has to do with how progress manages its buffers and copies data to and from buffers.
ASSIGN will combine multiple statements into one. If a, v and w are fields in your db, that means it will do something like INSERT INTO (a,v,w)...
rather than
INSERT INTO (a)...
INSERT INTO (v)
etc.
i am new at db2 i want to select around 2 million data with single query like that
which will select and display first 5000 data and in back process it will select other 5000 data and keep on same till end of the all data help me out with this how to write query or using function
Sounds like you want what's known as blocking. However, this isn't actually handled (not the way you're thinking of) at the database level - it's handled at the application level. You'd need to specify your platform and programming language for us to help there. Although if you're expecting somebody to actually read 2 million rows, it's going to take a while... At one row a second, that's 23 straight days.
The reason that SQL doesn't really perform this 'natively' is that it's (sort of) less efficient. Also, SQL is (by design) set up to operate over the entire set of data, both conceptually and syntactically.
You can use one of the new features, that incorporates paging from Oracle or MySQL: https://www.ibm.com/developerworks/mydeveloperworks/blogs/SQLTips4DB2LUW/entry/limit_offset?lang=en
At the same time, you can influence the optimizer by indicating OPTIMIZED FOR n ROWS, and FETCH FIRST n ROWS ONLY. If you are going to read only, it is better to specify this clause in the query "FOR READ ONLY", this will increase the concurrency, and the cursor will not be update-able. Also, assign a good isolation level, for this case you could eventually use "uncommitted read" (with UR). A Previous Lock table will be good.
Do not forget the common practices like: index or cluster index, retrieve only the necessary columns, etc. and always analyze the access plan via the Explain facility.