Parsing SQL to determine complexity level - perl

I have to determine the complexity level (simple/medium/complex etc) of a sql by counting number of occurrences of specific keywords, sub-queries, derived tables, functions etc that constitute the sql. Additionally, I have to syntactically validate the sql.
I searched on the net and found that Perl has 2 classes named SQL::Statement and SQL::Parser which could be leveraged to achieve the same. However, I found that these classes have several limitations (such as CASE WHEN constructs not supported etc).
That been said, is it better to build a custom concise sql parser with Lex/Yacc or Flex/Bison instead ? Which approach would be better and quick ?
Please share your thoughts on this. Also, can anyone point me to any resources online that discusses the same.
Thanks

Teradata has many non ANSI features and you're considering re-implementing the parser for it.
Instead use the database server and put an 'explain' in front of your statements and process the result.
explain select * from dbc.dbcinfo;
1) First, we lock a distinct DBC."pseudo table" for read on a RowHash
to prevent global deadlock for DBC.DBCInfoTbl.
2) Next, we lock DBC.DBCInfoTbl in view dbcinfo for read.
3) We do an all-AMPs RETRIEVE step from DBC.DBCInfoTbl in view
dbcinfo by way of an all-rows scan with no residual conditions
into Spool 1 (group_amps), which is built locally on the AMPs.
The size of Spool 1 is estimated with low confidence to be 432
rows (2,374,272 bytes). The estimated time for this step is 0.01
seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.01 seconds.
This will also validate your SQL.

Related

How do I efficiently execute large queries?

Consider the following demo schema
trades:([]symbol:`$();ccy:`$();arrivalTime:`datetime$();tradeDate:`date$(); price:`float$();nominal:`float$());
marketPrices:([]sym:`$();dateTime:`datetime$();price:`float$());
usdRates:([]currency$();dateTime:`datetime$();fxRate:`float$());
I want to write a query that gets the price, translated into USD, at the soonest possible time after arrivalTime. My beginner way of doing this has been to create intermediate tables that do some filtering and translating column names to be consistent and then using aj and ajo to join them up.
In this case there would only be 2 intermediate tables. In my actual case there are necessarily 7 intermediate tables and records counts, while not large by KDB standards, are not small either.
What is considered best practice for queries like this? It seems to me that creating all these intermediate tables is resource hungry. An alternative to the intermediate tables is 2 have a very complicated looking single query. Would that actually help things? Or is this consumption of resources just the price to pay?
For joining to the next closest time after an event take a look at this question:
KDB reverse asof join (aj) ie on next quote instead of previous one
Assuming that's what your looking for then you should be able to perform your price calculation either before or after the join (depending on the size of your tables it may be faster to do it after). Ultimately I think you will need two (potentially modified as per above) aj's (rates to marketdata, marketdata to trades).
If that's not what you're looking for then I could give some more specifics although some sample data would be useful.
My thoughts:
The more verbose/readible your code, the better for you to debug later and any future readers/users of your code.
Unless absolutely necessary, I would try and avoid creating 7 copies of the same table. If you are dealing with large tables memory could quickly become a concern. Particularly if the processing takes a long time, you could be creating large memory spikes. I try to keep to updating 1-2 variables at different stages e.g.:
res: select from trades;
res:aj[`ccy`arrivalTime;
res;
select ccy:currency, arrivalTime:dateTime, fxRate from usdRates
]
res:update someFunc fxRate from res;
Sean beat me to it, but aj for a time after/ reverse aj is relatively straight forward by switching bin to binr in the k code. See the suggested answer.
I'm not sure why you need 7 intermediary tables unless you are possibly calculating cross rates? In this case I would typically join ccy1 and ccy2 with 2 ajs to the same table and take it from there.
Although it may be unavoidable in your case if you have no control over the source data, similar column names / greater consistency across schemas is generally better. e.g. sym vs symbol

Why does paginating with offset using PSQL make sense?

I've been looking into pagination (paginate by timestamp) with a PSQL dbms. My approach currently is to build a b+ index to greatly reduce the cost of finding the start of the next chunk. But everywhere I look in tutorials and on NPM modules like express-paginate (https://www.npmjs.com/package/express-paginate), people seem to get chunks using offset one way or the other or fetching all the data anyways but simply sending them in chunks which to me doesn't seem to be a complete optimization that pagination is for.
I can see that they're still making an optimization by lazy loading and streaming the chunks (thus saving bandwidth and any download/processing time on the client-side), but since offset on psql still requires scanning previous rows. In the worst case where a user wants to view all the data, doesn't this approach have a very high server cost since if you have per say n chunks, you're accessing the first chunk n times, the second chunk n-1 times, the third chunk n-2 times, etc. I understand that this is really in terms of IOs so it's not that expensive but it still bothers me?
Am I missing something very obvious here? I feel like I am because there seems to be a lot more established and experienced engineers who seem to be using this approach. I'm guessing there is some part of the equation or mechanism that I'm just missing from my understanding.
No, you understand this quite well.
The reason why so many people and tools still advocate pagination with OFFSET and LIMIT (or FETCH FIRST n ROWS ONLY, to use the standard's language) is that they don't know a lot about databases. It is easy to understand LIMIT and OFFSET even if you the word “index” to you has no other meaning than ”the last pages in a book”.
There is another reason: to implement key set pagination, you must have an ORDER BY clause in your query, that ORDER BY clause has to contain a unique column, and you have to create an index that supports that ordering.
Moreover, your database has to be able to handle conditions like
... WHERE (name, id) > ('last_found', 42)
and support a multi-column index scan for them.
Since many tools strive to support several database systems, they are likely to go for the simple but inefficient method that works with every query on most database systems.

Progress-4GL - What is the fastest way to summarize tables? (Aggregate Functions: Count, Sum etc.) - OpenEdge 10.2A

We are using OpenEdge 10.2A, and generating summary reports using progress procedures. We want to decrease the production time of the reports.
Since using Accumulate and Accum functions are not really faster than defining variables to get summarized values, and readibility of them is much worse, we don't really use them.
We have tested our data using SQL commands using ODBC connection and results are much faster than using procedures.
Let me give you an example. We run the below procedure:
DEFINE VARIABLE i AS INTEGER NO-UNDO.
ETIME(TRUE).
FOR EACH orderline FIELDS(ordernum) NO-LOCK:
ASSIGN i = i + 1.
END.
MESSAGE "Count = " (i - 1) SKIP "Time = " ETIME VIEW-AS ALERT-BOX.
The result is:
Count= 330805
Time= 1891
When we run equivalent SQL query:
SELECT count(ordernum) from pub.orderline
The execution time is 141.
In short, when we compare two results; sql time is more than 13 times faster then procedure time.
This is just an example. We can do the same test with other aggregate functions and time ratio does not change much.
And my question has two parts;
1-) Is it possible to get aggregate values using procedures as fast as using sql queries?
2-) Is there any other method to get summarized values faster other than using real time SQL queries?
The 4gl and SQL engines use very different approaches to sending the data to the client. By default SQL is much faster. To get similar performance from the 4gl you need to adjust several parameters. I suggest:
-Mm 32600 # messages size, default 1024, max 32600
-prefetchDelay # don't send the first record immediately, instead bundle it
-prefetchFactor 100 # try to fill message 100%
-prefetchNumRecs 10000 # if possible pack up to 10,000 records per message, default 16
Prior to 11.6 changing -Mm requires BOTH the client and the server to be changed. Starting with 11.6 only the server needs to be changed.
You need at least OpenEdge 10.2b06 for the -prefetch* parameters.
Although there are caveats (among other things joins will not benefit) these parameters can potentially greatly improve the performance of "NO-LOCK queries". A simple:
FOR EACH table NO-LOCK:
/* ... */
END.
can be greatly improved by the use of the parameters above.
Use of a FIELDS list can also help a lot because it reduces the amount of data and thus the number of messages that need to be sent. So if you only need some of the fields rather than the whole record you can code something like:
FOR EACH customer FIELDS ( name balance ) NO-LOCK:
or:
FOR EACH customer EXCEPT ( photo ) NO-LOCK:
You are already using FIELDS and your sample query is a simple NO-LOCK so it should benefit substantially from the suggested parameter settings.
The issue at hand seems to be to "decrease the production time of the reports.".
This raises some questions:
How slow are the reports now and how fast do you want them?
Have running time increased compared to for instance last year?
Has the data amount also increased?
Has something changed? Servers, storage, clients, etc?
It will be impossible to answer your question without more information. Data access from ABL will most likely be fast enough if:
You have correct indexes (indices) set up in your database.
You have "good" queries.
You have enough system resources (memory, cpu, disk space, disk speed)
You have a database running with a decent setup (-spin, -B parameters etc).
The time it takes for a simple command like FOR EACH <table> NO-LOCK: or SELECT COUNT(something) FROM <somewhere> might not indicate how fast or slow your real super complicated query might run.
Some additional suggestions:
It is possible to write your example as
DEFINE VARIABLE i AS INTEGER NO-UNDO.
ETIME(TRUE).
select count(*) into i from orderline.
MESSAGE "Count = " (i - 1) SKIP "Time = " ETIME VIEW-AS ALERT-BOX.
which should yield a moderate performance increase. (This is not using an ODBC connection. You can use a subset of SQL in plain 4GL procedures. It is debatable if this can be considered good style.)
There should be a significant performance increase by accessing the database through shared memory instead of TCP/IP, if you are running the code on the server (which you do) and you are not already doing so (which you didn't specify).
open query q preselect each EACH orderline no-lock.
message num-results("q") view-as alert-box.

IBMDB2 select query for millions of data

i am new at db2 i want to select around 2 million data with single query like that
which will select and display first 5000 data and in back process it will select other 5000 data and keep on same till end of the all data help me out with this how to write query or using function
Sounds like you want what's known as blocking. However, this isn't actually handled (not the way you're thinking of) at the database level - it's handled at the application level. You'd need to specify your platform and programming language for us to help there. Although if you're expecting somebody to actually read 2 million rows, it's going to take a while... At one row a second, that's 23 straight days.
The reason that SQL doesn't really perform this 'natively' is that it's (sort of) less efficient. Also, SQL is (by design) set up to operate over the entire set of data, both conceptually and syntactically.
You can use one of the new features, that incorporates paging from Oracle or MySQL: https://www.ibm.com/developerworks/mydeveloperworks/blogs/SQLTips4DB2LUW/entry/limit_offset?lang=en
At the same time, you can influence the optimizer by indicating OPTIMIZED FOR n ROWS, and FETCH FIRST n ROWS ONLY. If you are going to read only, it is better to specify this clause in the query "FOR READ ONLY", this will increase the concurrency, and the cursor will not be update-able. Also, assign a good isolation level, for this case you could eventually use "uncommitted read" (with UR). A Previous Lock table will be good.
Do not forget the common practices like: index or cluster index, retrieve only the necessary columns, etc. and always analyze the access plan via the Explain facility.

Automatic reformulation of condition in PostgreSQL view

I have a table of the following form:
[mytable]
id, min, max, funobj
-----------------------------------------
1 15 23 {some big object}
1 23 41 {another big object}
1 19 27 {next big object}
Now suppose I have a view created like this:
CREATE VIEW functionvalues AS
SELECT id, evaluate(funobj)
FROM mytable
where evaluate is a set-returning function evaluateing the large funobj. The result of the view could be something like this:
id, evaluate
--------------
1 15
1 16
1 ...
1 23
2 23
2 24
2 ...
2 41
...
I do not have any information on the specific values evaluate will return, but I know that they will alway be between the min- and max-values given in mytable (including boundaries)
Finally, I (or better, a third party application) makes a query on the view:
SELECT * FROM functionvalues
WHERE evaluate BETWEEN somevalue AND anothervalue
In this case, Postgres does evaluate the function evaluate for every row in mytable, whereas, depending on the where clause, the function does not have to be evaluated if it's max and min are not between the given values. As evaluate is a rather slow function, this gives me a very bad performance.
The better way would be to query the table directly by using
SELECT *
FROM (
SELECT id, evaluate(funobj)
FROM mytable
WHERE
max BETWEEN somevalue AND anothervalue
OR min BETWEEN somevalue AND anothervalue
OR (min < somevalue AND max > anothervalue)
) AS innerquery
WHERE evaluate BETWEEN somevalue AND anothervalue
Is there any way I can tell postgres to use a query like the above (by clever indices or something like that) without changing the way the third party application queries the view?
P.S.: Feel free to suggest a better title to this question, the one I gave is rather... well... unspecific.
I have no complete answer, but some of your catchwords ring a distant bell in my head:
you have a view
you want a more intelligent view
you want to "rewrite" the view definition
That calls for the PostgreSQL Rule System, especially the part "Views and the Rules System". Perhaps you can use that for your advantage.
Be warned: This is treacherous stuff. First you will find it great, then you will pet it, then it will rip of your arm without a warning while still purring. Follow the links in here.
Postgres cannot push the restrictions down the query tree into the function; the function always has to scan and return the entire underlying table. And rejoin it with the same table. sigh.
"Breaking up" the function's body and combining it with the rest of the query would require a macro-like feature instead of a function.
A better way would probably be to not use an unrestricted set-returning function, but to rewrite the function as a a scalar function, taking only one data row as an argument, and yielding its value.
There is also the problem of sorting-order: the outer query does not know about the order delivered by the function, so explicit sort and merge steps will be necessary, except maybe for very small result sets (for function results statistics are not available, only the cost and estimated rowcount, IIRC.)
Sometimes, the right answer is "faster hardware". Given the way the PostgreSQL optimizer works, your best bet might be to move the table and its indexes onto a solid state disk.
Documentation for Tablespaces
Second, tablespaces allow an administrator to use knowledge of the
usage pattern of database objects to optimize performance. For
example, an index which is very heavily used can be placed on a very
fast, highly available disk, such as an expensive solid state device.
At the same time a table storing archived data which is rarely used or
not performance critical could be stored on a less expensive, slower
disk system.
In October, 2011, you can get a really good 128 gig SSD drive for less than $300, or 300 gigs for less than $600.
If you're looking for an improvement of two orders of magnitude, and you already know the bottleneck is your evaluate() function, then you will probably have to accept smaller gains from many sources. If I were you, I'd look at whether any of these things might help.
solid-state disk (speedup factor of 50 over HDD, but you say you're not IO bound, so let's estimate "2")
faster CPU (speedup of 1.2)
more RAM (speedup of 1.02)
different algorithms in evaluate() (say, 2)
different data structures in evaluate() (in a database function, probably 0)
different C compiler optimizations (0)
different C compiler (1.1)
rewrite critical parts of evaluate() in assembler (2.5)
different dbms platform
different database technology
Those estimates suggest a speedup by a factor of only 13. (But they're little more than guesswork.)
I might even consider targeting the GPU for calculation.