IBM i (System i) V7R1 upgrade causes cursor not open errors on ODBC connection - db2

For years, at least 8, our company has been running a process daily that has never failed. Nothing on the client side has changed, but we recently upgraded to V7R1 on the System i. The very first run of the old process fails with a Cursor not open message reported back to the client, and that's all that's in the job log as well. I have seen Error -501, SQLSTATE 24501 on occasions.
I got both IBM and DataDirect (provider of the ODBC driver) involved. IBM stated it was a client issue, DataDirect dug through logs and found that when requesting the next block of records from a cursor this error occurs. They saw no indication that the System i alerted the client that the cursor was closed.
In troubleshooting, I noticed that the ODBC driver has an option for WITH HOLD which by default is checked. If I uncheck it, this particular issue goes away, but it introduces another issue (infinite loops) which is even more serious.
There's no single common theme that causes these errors, the only thing that I see that causes this is doing some processing while looping through a fairly large resultset. It doesn't seem to be related to timing, or to a particular table or table type. The outside loops are sometimes large tables with many datatypes, sometimes tiny tables with nothing but CHAR(10) and CHAR(8) data types.
I don't really expect an answer on here since this is a very esoteric situation, but there's always some hope.
There were other issues that IBM has already addressed by having us apply PTFs to take us to 36 for the database level. I am by no means a System i expert, just a Java programmer who has to deal with this issue that has nothing to do with Java at all.
Thanks

This is for anyone else out there who may run across a similar issue. It turns out it was a bug in the QRWTSRVR code that caused the issue. The driver opened up several connections within a single job and used the same name for cursors in at least 2 of those connections. Once one of those cursors was closed QRWTSRVR would mistakenly attempt to use the closed cursor and return the error. Here is the description from the PTF cover letter:
DESCRIPTION OF PROBLEM FIXED FOR APAR SE62670 :
A QRWTSRVR job with 2 cursors named C01 takes a MSGSQL0501
error when trying to fetch from the one that is open. The DB2
code is trying to use the cursor which is pseudo closed.
The PTF SI57756 fixed the issue. I do not know that this PTF will be generally released, but if you find this post because of a similar issue hopefully this will assist you in getting it corrected.

This is how I fix DB problems on the iseries.
Start journaling the tables on the iseries or change the connection to the iseries to commit = *NONE.
for the journaling I recommend using two journals each with its own receiver.
one journal for tables with relatively few changes like a table of US States or a table that gets less than 10 updates a month. This is so you can determine when the data was changed for an audit. Keep all the receivers for this journal on-line for ever.
one journal for tables with many changes through out the day. Delete the receivers for these journals when you can no longer afford the space they take up.
If the journal or commit *none doesn't fix it. You'll need to look at the sysixadv table long running queries can wreck an ODBC connection.
SELECT SYS_TNAME, TBMEMBER, INDEX_TYPE, LASTADV, TIMESADV, ESTTIME,
REASON, "PAGESIZE", QUERYCOST, QUERYEST, TABLE_SIZE, NLSSNAME,
NLSSDBNAME, MTIUSED, MTICREATED, LASTMTIUSE, QRYMICRO, EVIVALS,
FIRSTADV, SYS_DNAME, MTISTATS, LASTMTISTA, DEPCNT FROM sysixadv
ORDER BY ESTTIME desc
also order by timesadv desc
fix those queries maybe create the advised index.

Which ODBC driver are you using?
If you're using the IBM i Access ODBC driver, then this problem may be fixed by APAR SE61342. The driver didn't always handle the return code from the server that indicated that the result set was closed and during the SQLCloseCursor function, the driver would send a close command to the server, which would return an error, since the server had already closed the cursor. Note, you don't have to be at SP11 to hit this condition, it just made it easier to hit, since I enabled pre-fetch in more cases in that fixpack. An easy test to see if that is the problem is to disable pre-fetch for the DSN or pass PREFETCH=0 on the connection string.
If you're using the DB2 Connect driver, I can't really offer much help, sorry.

Related

PostgreSQL: Backend processes are active for a long time

now I am hitting a very big road block.
I use PostgreSQL 10 and its new table partitioning.
Sometimes many queries don't return and at the time many backend processes are active when I check backend processes by pg_stat_activity.
First, I thought theses process are just waiting for lock, but these transactions contain only SELECT statements and the other backend doesn't use any query which requires ACCESS EXCLUSIVE lock. And these queries which contain only SELECT statements are no problem in terms of plan. And usually these work well. And computer resources(CPU, memory, IO, Network) are also no problem. Therefore, theses transations should never conflict. And I thoughrouly checked the locks of theses transaction by pg_locks and pg_blocking_pids() and finnaly I couldn't find any lock which makes queries much slower. Many of backends which are active holds only ACCESS SHARE because they use only SELECT.
Now I think these phenomenon are not caused by lock, but something related to new table partition.
So, why are many backends active?
Could anyone help me?
Any comments are highly appreciated.
The blow figure is a part of the result of pg_stat_activity.
If you want any additional information, please tell me.
EDIT
My query dosen't handle large data. The return type is like this:
uuid UUID
,number BIGINT
,title TEXT
,type1 TEXT
,data_json JSONB
,type2 TEXT
,uuid_array UUID[]
,count BIGINT
Because it has JSONB column, I cannot caluculate the exact value, but it is not large JSON.
Normally theses queries are moderately fast(around 1.5s), so it is absolutely no problem, however when other processes work, the phenomenon happens.
If statistic information is wrong, the query are always slow.
EDIT2
This is the stat. There are almost 100 connections, so I couldn't show all stat.
For me it looks like application problem, not postresql's one. active status means that your transaction still was not commited.
So why do you application may not send commit to database?
Try to review when do you open transaction, read data, commit transaction and rollback transaction in your application code.
EDIT:
By the way, to be sure try to check resource usage before problem appear and when your queries start hanging. Try to run top and iotop to check if postgres really start eating your cpu or disk like crazy when problem appears. If not, I will suggest to look for problem in your application.
Thank you everyone.
I finally solved this problem.
I noticed that a backend process holded too many locks. So, when I executed the query SELECT COUNT(*) FROM pg_locks WHERE pid = <pid>, the result is about 10000.
The parameter of locks_per_transactions is 64 and max_connections is about 800.
So, if the number of query that holds many locks is large, the memory shortage occurs(see calculation code of shared memory inside PostgreSQL if you are interested.).
And too many locks were caused when I execute query like SELECT * FROM (partitioned table). Imangine you have a table foo that is partitioned and the number of the table is 1000. And then you can execute SELECT * FROM foo WHERE partion_id = <id> and the backend process will hold about 1000 table locks(and index locks). So, I change the query from SELECT * FROM foo WHERE partition_id = <id> to SELECT * FROM foo_(partitioned_id). As the result, the problem looks solved.
You say
Sometimes many queries don't return
...however when other processes work, the phenomenon happens. If statistic
information is wrong, the query are always slow.
They don't return/are slow when directly connecting to the Postgres instance and running the query you need, or when running the queries from an application? The backend processes that are running, are you able to kill them successfully with pg_terminate_backend($PID) or does that have issues? To rule out issues with the statement itself, make sure statement_timeout is set to a reasonable amount to kill off long-running queries. After that is ruled out, perhaps you are running into a case of an application hanging and never allowing the send calls from PostgreSQL to finish. To avoid a situation like that, if you are able to (depending on OS) you can tune the keep-alive time: https://www.postgresql.org/docs/current/runtime-config-connection.html#GUC-TCP-KEEPALIVES-IDLE (by default is 2 hours)
Let us know if playing with any of that gives any more insight into your issue.
Sorry for late post, As #Konstantin pointed out, this might be because of your application(which is why I asked for your EDIT2). Adding a few excerpts,
table partition has no effect on these locks, that is a totally different concept and does not hold up locks in your case.
In your application, check if the connection properly close() after read() and is in finally block (From Java perspective). I am not sure of your application tier.
Check if SELECT..FOR UPDATE or any similar statement is written erroneously recently which is causing this.
Check if any table has grown in size recently and the column is not Indexed. This is very important and frequent cause of select statements running for some minutes. I'd also suggest using timeouts for select statements in your application. https://www.postgresql.org/docs/9.5/gin-intro.html This can give you a headstart.
Another thing that is fishy to me is the JSONB column, maybe your Jsonb values are pretty long, or the queries are unnecessarily selecting JSONB value even if not required?
Finally, If you don't need some special features of Jsonb data type, then you use JSON data type which is faster (magical maximum, sometimes 50x!)
It looks like the pooled connections not getting closed properly and a few queries might be taking huge time to respond back. As pointed out in other answers, it is the problem with the application and could be connection leak. Most possibly, it might be because of pending transactions over some already pending and unresolved transactions, leading to a number of unclosed transactions.
In addition, PostgreSQL generally has one or more "helper" processes like the stats collector, background writer, autovaccum daemon, walsender, etc, all of which show up as "postgres" instances.
One thing I would suggest you check in which part of the code you have initiated the queries. Try to DRY run your queries outside the application and have some benchmarking of queries performance.
Secondly, you can keep some timeout for certain queries if not all.
Thirdly, you can do kill the idle transactions after certain timeouts by using:
SET SESSION idle_in_transaction_session_timeout = '5min';
I hope it might work. Cheers!

Does SQL0289N affect other users?

I am getting this error:
com.ibm.db2.jcc.a.SqlException: DB2 SQL Error: SQLCODE=-289, SQLSTATE=57011, SQLERRMC=XXX32KTMP, DRIVER=3.51.90
on a select statement that has a couple of dozen sub-selects.
SQL0289N usually means the current table space size is not enough for allocating new pages for new data.
I want to modify my select such that it does not use as much table space.
While modifying the select I presumably will get this error several more times until I am successful.
My questions are:
A) Does this error only affect my select?
B) Are other users of the database more like to have a problem because I am running this select?
The context of those questions is that I want to know if I have to move my work to a different database to be reasonably sure that I am not impacting other users.
I am wary because the error description is not clear if it is running out of memory that is shared between all users, or memory that is only allocated to my connection.
Note: I am NOT asking how to increase table space or what this error means. I am NOT asking for help modifying my select (hence, I did not show the select). Any answers to that effect would be off topic.
Without knowing how exactly the tablespace in question is defined and why your query needs it it is hard to give you a definite answer.
In the best case the error affects any SQL statement, executed in any session, that requires the use of the same tablespace, especially if it is a system temporary tablespace.
In the worst case, e.g. if it is an SMS tablespace and it shares the file system with other tablespaces and log files, it might even bring the entire DB2 instance down.
Tuning your statement in a different database does not necessarily mean that it will resolve the problem in the original database.

DB2 Cursor Stability - deadlocks

I've always thought that READ_COMMITTED (DB2 calls it Cursor Stability (CS)) meant that you do NOT lock on reads, and that you only read the committed data. A situation just came up that makes me realize that DB2 is locking, albeit very briefly, on reads in CS.
Because it's brief, that makes it vulnerable to deadlocking. We have a client code that updated row X, then calls a web service that, among many other things, needs to read row X. Because the client code need to be able to rollback if the web service fails, it must hold that lock during the web service call.
With the service running at READ_COMMITTED, I thought it would not wait for the record, but would just read the old data and continue (which is fine).
Is there a way, short of running in READ_UNCOMMITTED (which gives me the willies) to make my service NOT go into lock-wait on a read event? I am primarily operating theough java & hibernate, but if you know tricks purely from a DB2 side even that would be informative.
The default behaviour in CURSOR STABILITY (the DB2 way), readers are blocked. However, with the new Oracle-like features, this behavior can be changed.
Depending your DB2 version, and if you migrated, probably you simply need to change to the CUR_COMMIT level.
Please take a look at this article, that explains very clearly the behavior.
More information is available in the product documentation.

SQL Server inserts slow down over time

Not sure if this is a question for here or the DBA forum, but here is some background to my problem. I have an application written in C# which uses the Gentle Framework to interface with our database. The issue I am running into has shown up on two different servers, both running SQL Server 2008 R2. One server is running Windows Server 2003 with 16GB of RAM, and the other is running Windows Server 2008 R2 with 64GB of RAM. Also we are running gigabit intranet so I doubt it is either a resource or network issue.
That being said my issue is... when inserting into the database, each insert is taking a little more time starting with about 30ms and building up to about 1900ms before suddenly taking 189549ms (a little over 3 minutes). After that 3 minute insert, the time drops down to about 10ms and starts building up again. Here is a link to my log file showing the time in ms and the query. Unfortunately, due to what has been called "proprietary", I can't share the exact inserts with you but I can answer general questions about the details.
Some additional details:
The log file linked only shows queries which took over 10ms, there are many more queries in the log between the inserts, however they are only taking 1-2ms. I can link one with all of the queries.
I have looked at other questions regarding a similar issues but they have either been about RAID or multiple inserts vs multiple values statements
These are parameterized queries
The tables are indexed and it is not possible to remove these because other applications are using these tables and depend on those indexes.
I don't think I have much choice in changing how records are inserted because Gentle is generating the SQL.
I found this question which I think is close.
I believe that Gentle is doing all of this in 1 transaction, and I have been told that "it should be done in 1 transaction, and we are not going to break it apart"

SQL Server & ADO NET : how to automatically cancel long running user query?

I have a .NET Core 2.1 application that allows users to search a large database, with the possibility of using lots of parameters. The data access is done through ADO.NET. Some of the queries generated result in long running queries (several hours). Obviously, the user gives up on waiting, but the query chugs along in SQL Server.
I realize that the root cause is the design of the app, but I would like a quick solution for now, if possible.
I have tried many solutions, but none seem to work as expected.
What I have tried:
CommandTimeout
CommandTimeout works as expected with ExecuteNonQuery but does not work with ExecuteReader, as discussed in this forum
When you execute command.ExecuteReader(), you don't get this exception because the server responds on time. The application doesn't respond because it reads data to the memory, and the ExecuteReader() method doesn't return control until all the data is read.
I have also tried using SqlDataAdapter, but this does not work either.
SQL Server query governor
SQL Server's query governor works off of the estimated execution plan, and while it does work sometimes, it does not always catch inefficient queries.
SQL Server execution time-out
Tools > Options > Query Execution > SQL Server > General
I'm not sure what this does, but after entering a value of 1, SQL Server still allows queries to run as long as they need. I tried restarting the server instance, but that did not make any difference.
Again, I realize that the cause of this problem is the way that the queries are generated, but with so many parameters and so much data, fine tuning a solution in the design of the application may take some time. As of now, we are manually killing any spid associated with this app that has run over 10 or so minutes.
EDIT:
I abandoned the hope of finding a simple solution. If you're having a similar issue, here is what we did to address it:
We created a .net core console app that polls the database for queries running over a certain allotted time. The app looks at the login name and the amount of time it's been running and determines whether to kill the process.
https://learn.microsoft.com/en-us/dotnet/api/system.data.sqlclient.sqlcommand.cancel?view=netframework-4.7.2
Looking through the documentation on SqlCommand.Cancel, I think it might solve your issue.
If you were to create and start a Timer before you call ExecuteReader(), you could then keep track of how long the query is running, and eventually call the Cancel method yourself.
(Note: I wanted to add this as a comment but I don't have the reputation to be allowed to yet)