What is the maximum NUM_OF_PARAMS in Perl DBI placeholders - perl

What is the maximum number of placeholders is allowed in a single statement? I.e. the upper limit of attribute NUM_OF_PARAMS.
I'm experiencing odd issue where I try to tune the maximum number of multiple rows insert, ie set the number to 20,000 gives me an error because $sth->{NUM_OF_PARAMS} becomes negative.
Reducing the max inserts to 5000 works fine.
Thanks.

As far as I am aware the only limitation in DBI is that the value is placed into a Perl scalar so it is what can be held in that. However, for DBDs it is totally different. I doubt many, if any databases support 20000 parameters. BTW, NUM_OF_PARAMS is readonly so I've no idea what you mean by "set the number to 20,000". I presume you just mean you create a SQL statement with 20000 parameters and then read NUM_OF_PARAMS and it gives you a negative value. If the latter I suggest you report (with an example) that on rt.cpan.org as it does not sound right at all.
I cannot imagine creating a SQL statement with 20000 parameters is going to be very efficient in any database. Far better to try and reduce that to a range or something like it if you can. In ODBC, 20000 parameters would mean 20000 IPDs and APDs and they are quite big structures. Since DB2 cli library is very like ODBC I would imagine you are going to eat up loads of memory.

Given that 20,000 causes negative problems and 5,000 doesn't, there's a signed 16-bit integer somewhere in the system, and the upper bound is therefore approximately 16383.
However, the limit depends on the underlying DBMS and the API used by the DBD module for the DBMS (and possibly the DBD code itself); it is not affected by DBI.
Are you sure that's the best way to deal with your problem?

Related

Why does paginating with offset using PSQL make sense?

I've been looking into pagination (paginate by timestamp) with a PSQL dbms. My approach currently is to build a b+ index to greatly reduce the cost of finding the start of the next chunk. But everywhere I look in tutorials and on NPM modules like express-paginate (https://www.npmjs.com/package/express-paginate), people seem to get chunks using offset one way or the other or fetching all the data anyways but simply sending them in chunks which to me doesn't seem to be a complete optimization that pagination is for.
I can see that they're still making an optimization by lazy loading and streaming the chunks (thus saving bandwidth and any download/processing time on the client-side), but since offset on psql still requires scanning previous rows. In the worst case where a user wants to view all the data, doesn't this approach have a very high server cost since if you have per say n chunks, you're accessing the first chunk n times, the second chunk n-1 times, the third chunk n-2 times, etc. I understand that this is really in terms of IOs so it's not that expensive but it still bothers me?
Am I missing something very obvious here? I feel like I am because there seems to be a lot more established and experienced engineers who seem to be using this approach. I'm guessing there is some part of the equation or mechanism that I'm just missing from my understanding.
No, you understand this quite well.
The reason why so many people and tools still advocate pagination with OFFSET and LIMIT (or FETCH FIRST n ROWS ONLY, to use the standard's language) is that they don't know a lot about databases. It is easy to understand LIMIT and OFFSET even if you the word “index” to you has no other meaning than ”the last pages in a book”.
There is another reason: to implement key set pagination, you must have an ORDER BY clause in your query, that ORDER BY clause has to contain a unique column, and you have to create an index that supports that ordering.
Moreover, your database has to be able to handle conditions like
... WHERE (name, id) > ('last_found', 42)
and support a multi-column index scan for them.
Since many tools strive to support several database systems, they are likely to go for the simple but inefficient method that works with every query on most database systems.

Progress-4GL - What is the fastest way to summarize tables? (Aggregate Functions: Count, Sum etc.) - OpenEdge 10.2A

We are using OpenEdge 10.2A, and generating summary reports using progress procedures. We want to decrease the production time of the reports.
Since using Accumulate and Accum functions are not really faster than defining variables to get summarized values, and readibility of them is much worse, we don't really use them.
We have tested our data using SQL commands using ODBC connection and results are much faster than using procedures.
Let me give you an example. We run the below procedure:
DEFINE VARIABLE i AS INTEGER NO-UNDO.
ETIME(TRUE).
FOR EACH orderline FIELDS(ordernum) NO-LOCK:
ASSIGN i = i + 1.
END.
MESSAGE "Count = " (i - 1) SKIP "Time = " ETIME VIEW-AS ALERT-BOX.
The result is:
Count= 330805
Time= 1891
When we run equivalent SQL query:
SELECT count(ordernum) from pub.orderline
The execution time is 141.
In short, when we compare two results; sql time is more than 13 times faster then procedure time.
This is just an example. We can do the same test with other aggregate functions and time ratio does not change much.
And my question has two parts;
1-) Is it possible to get aggregate values using procedures as fast as using sql queries?
2-) Is there any other method to get summarized values faster other than using real time SQL queries?
The 4gl and SQL engines use very different approaches to sending the data to the client. By default SQL is much faster. To get similar performance from the 4gl you need to adjust several parameters. I suggest:
-Mm 32600 # messages size, default 1024, max 32600
-prefetchDelay # don't send the first record immediately, instead bundle it
-prefetchFactor 100 # try to fill message 100%
-prefetchNumRecs 10000 # if possible pack up to 10,000 records per message, default 16
Prior to 11.6 changing -Mm requires BOTH the client and the server to be changed. Starting with 11.6 only the server needs to be changed.
You need at least OpenEdge 10.2b06 for the -prefetch* parameters.
Although there are caveats (among other things joins will not benefit) these parameters can potentially greatly improve the performance of "NO-LOCK queries". A simple:
FOR EACH table NO-LOCK:
/* ... */
END.
can be greatly improved by the use of the parameters above.
Use of a FIELDS list can also help a lot because it reduces the amount of data and thus the number of messages that need to be sent. So if you only need some of the fields rather than the whole record you can code something like:
FOR EACH customer FIELDS ( name balance ) NO-LOCK:
or:
FOR EACH customer EXCEPT ( photo ) NO-LOCK:
You are already using FIELDS and your sample query is a simple NO-LOCK so it should benefit substantially from the suggested parameter settings.
The issue at hand seems to be to "decrease the production time of the reports.".
This raises some questions:
How slow are the reports now and how fast do you want them?
Have running time increased compared to for instance last year?
Has the data amount also increased?
Has something changed? Servers, storage, clients, etc?
It will be impossible to answer your question without more information. Data access from ABL will most likely be fast enough if:
You have correct indexes (indices) set up in your database.
You have "good" queries.
You have enough system resources (memory, cpu, disk space, disk speed)
You have a database running with a decent setup (-spin, -B parameters etc).
The time it takes for a simple command like FOR EACH <table> NO-LOCK: or SELECT COUNT(something) FROM <somewhere> might not indicate how fast or slow your real super complicated query might run.
Some additional suggestions:
It is possible to write your example as
DEFINE VARIABLE i AS INTEGER NO-UNDO.
ETIME(TRUE).
select count(*) into i from orderline.
MESSAGE "Count = " (i - 1) SKIP "Time = " ETIME VIEW-AS ALERT-BOX.
which should yield a moderate performance increase. (This is not using an ODBC connection. You can use a subset of SQL in plain 4GL procedures. It is debatable if this can be considered good style.)
There should be a significant performance increase by accessing the database through shared memory instead of TCP/IP, if you are running the code on the server (which you do) and you are not already doing so (which you didn't specify).
open query q preselect each EACH orderline no-lock.
message num-results("q") view-as alert-box.

What about expected performance in Pentaho?

I am using Pentaho to create ETL's and I am very focused on performance. I develop an ETL process that copy 163.000.000 rows from Sql server 2088 to PostgreSQL and it takes 17h.
I do not know how good or bad is this performance. Do you know how to measure if the time that takes some process is good? At least as a reference to know if I need to keep working heavily on performance or not.
Furthermore, I would like to know if it is normal that in the first 2 minutes of ETL process it load 2M rows. I calculate how long will take to load all the rows. The expected result is 6 hours, but then the performance decrease and it takes 17h.
I have been investigating in goole and I do not find any time references neither any explanations about performance.
Divide and conquer, and proceed by elimination.
First, add a LIMIT to your query so it takes 10 minutes instead of 17 hours, this will make it a lot easier to try different things.
Are the processes running on different machines? If so, measure network bandwidth utilization to make sure it isn't a bottleneck. Transfer a huge file, make sure the bandwidth is really there.
Are the processes running on the same machine? Maybe one is starving the other for IO. Are source and destination the same hard drive? Different hard drives? SSDs? You need to explain...
Examine IO and CPU usage of both processes. Does one process max out one cpu core?
Does a process max out one of the disks? Check iowait, iops, IO bandwidth, etc.
How many columns? Two INTs, 500 FLOATs, or a huge BLOB with a 12 megabyte PDF in each row? Performance would vary between these cases...
Now, I will assume the problem is on the POSTGRES side.
Create a dummy table, identical to your target table, which has:
Exact same columns (CREATE TABLE dummy LIKE table)
No indexes, No constraints (I think it is the default, double check the created table)
BEFORE INSERT trigger on it which returns NULL and drop the row.
The rows will be processed, just not inserted.
Is it fast now? OK, so the problem was insertion.
Do it again, but this time using an UNLOGGED TABLE (or a TEMPORARY TABLE). These do not have any crash-resistance because they don't use the journal, but for importing data it's OK.... if it crashes during the insert you're gonna wipe it out and restart anyway.
Still No indexes, No constraints. Is it fast?
If slow => IO write bandwidth issue, possibly caused by something else hitting the disks
If fast => IO is OK, problem not found yet!
With the table loaded with data, add indexes and constraints one by one, find out if you got, say, a CHECK that uses a slow SQL function, or a FK into a table which has no index, that kind of stuff. Just check how long it takes to create the constraint.
Note: on an import like this you would normally add indices and constraints after the import.
My gut feeling is that PG is checkpointing like crazy due to the large volume of data, due to too-low checkpointing settings in the config. Or some issue like that, probably random IO writes related. You put the WAL on a fast SSD, right?
17H is too much. Far too much. For 200 Million rows, 6 hours is even a lot.
Hints for optimization:
Check the memory size: edit the spoon.bat, find the line containing -Xmx and change it to half your machine memory size. Details varies with java version. Example for PDI V7.1.
Check if the query from the source database is not too long (because too complex, or server memory size, or ?).
Check the target commit size (try 25000 for PostgresSQL), the Use batch update for inserts in on, and also that the index and constraints are disabled.
Play with the Enable lazy conversion in the Table input. Warning, you may produce difficult to identify and debug errors due to data casting.
In the transformation property you can tune the Nr of rows in rowset (click anywhere, select Property, then the tab Miscelaneous). On the same tab check the transformation is NOT transactional.

Perl script fails when selecting data from big PostgreSQL table

I'm trying to run a SELECT statement on PostgreSQL database and save its result into a file.
The code runs on my environment but fails once I run it on a lightweight server.
I monitored it and saw that the reason it fails after several seconds is due to a lack of memory (the machine has only 512MB RAM). I didn't expect this to be a problem, as all I want to do is to save the whole result set as a JSON file on disk.
I was planning to use fetchrow_array or fetchrow_arrayref functions hoping to fetch and process only one row at a time.
Unfortunately I discovered there's no difference when it comes to the true fetch operations between the two above and fetchall_arrayref when you use DBD::Pg. My script fails at the $sth->execute() call, even before it has a chance to do call any fetch... function.
This suggests to me that the implementation of execute in DBD::Pg actually fetches ALL the rows into memory, leaving only the actual format its returned to the fetch... functions.
A quick look at the DBI documentation gives a hint:
If the driver supports a local row cache for SELECT statements, then this attribute holds the number of un-fetched rows in the cache. If the driver doesn't, then it returns undef. Note that some drivers pre-fetch rows on execute, whereas others wait till the first fetch.
So in theory I would just need to set the RowCacheSize parameter. I've tried but this feature doesn't seem to be implemented by DBD::Pg
Not used by DBD::Pg
I find this limitation a huge general problem (execute() call pre-fetches all rows?) and more inclined to believe that I'm missing something here, than that this is actually a true limitation of interacting with PostgreSQL databases using Perl.
Update (2014-03-09): My script works now thanks to using a workaround as described in my comment to Borodin's answer. The maintainer of DBD::Pg library got back to me on the issue actually saying the root cause is deeper and lies within libpq postgresql internal library (used by DBD::Pg). Also, I think very similar issue to the one described here affects pgAdmin. Being postgresql native tool it still doesn't give in the Options chance to define the default limit of the result set row size. This is probably why it makes Query tool sometimes waiting a good while before presenting results from bulky queries, potentially breaking the app in some cases too.
In the section Cursors, the documentation for the database driver says this
Therefore the "execute" method fetches all data at once into data structures located in the front-end application. This fact must to be considered when selecting large amounts of data!
So your supposition is correct. However the same section goes on to describe how you can use cursors in your Perl application to read the data in chunks. I believe this would fix your problem.
Another alternative is to use OFFSET and LIMIT clauses on your SELECT statement to emulate cursor functionality. If you write
my $select = $dbh->prepare('SELECT * FROM table OFFSET ? LIMIT 1');
then you can say something like (all of this is untested)
my $i = 0;
while ($select->execute($i++)) {
my #data = $select->fetchrow_array;
# Process data
}
to read your tables one row at a time.
You may find that you need to increase the chunk size to get an acceptable level of efficiency.

IBMDB2 select query for millions of data

i am new at db2 i want to select around 2 million data with single query like that
which will select and display first 5000 data and in back process it will select other 5000 data and keep on same till end of the all data help me out with this how to write query or using function
Sounds like you want what's known as blocking. However, this isn't actually handled (not the way you're thinking of) at the database level - it's handled at the application level. You'd need to specify your platform and programming language for us to help there. Although if you're expecting somebody to actually read 2 million rows, it's going to take a while... At one row a second, that's 23 straight days.
The reason that SQL doesn't really perform this 'natively' is that it's (sort of) less efficient. Also, SQL is (by design) set up to operate over the entire set of data, both conceptually and syntactically.
You can use one of the new features, that incorporates paging from Oracle or MySQL: https://www.ibm.com/developerworks/mydeveloperworks/blogs/SQLTips4DB2LUW/entry/limit_offset?lang=en
At the same time, you can influence the optimizer by indicating OPTIMIZED FOR n ROWS, and FETCH FIRST n ROWS ONLY. If you are going to read only, it is better to specify this clause in the query "FOR READ ONLY", this will increase the concurrency, and the cursor will not be update-able. Also, assign a good isolation level, for this case you could eventually use "uncommitted read" (with UR). A Previous Lock table will be good.
Do not forget the common practices like: index or cluster index, retrieve only the necessary columns, etc. and always analyze the access plan via the Explain facility.