SQL Query intermittently timing out, Any ALTER immediately fixes it - tsql

I have a pretty basic SQL Query I'm running (SQL Server 2016). I'm pulling about 15 columns from some simple inner and left joins on 8 tables that all have WITH (NOLOCK) on the join. My where clause is checking for a few string, uniqueidentifier, and bit values. The stored procedure has no calculations, no loops/cursors/case statements, etc.. It is very straightforward.
For some reason our application keeps intermittently freezing up because the SQL call is timing out. When I grab the call in profiler and run it in SSMS it runs in sub-second time, but from the application it just wont finish.
However, if I script out an ALTER command on the query and just add a blank line and execute the change, the problem goes away and the calls run instantaneously. The change I make is nothing substantive, just the act of changing the query in some way seems to unlock it.
Does anyone have any idea what could be causing this? I don't see any locks when it is timing out. I was thinking maybe a bad execution plan being cached?
The only other oddity is that this is running from an old legacy application, our last one in old classic asp. But it doesn't seem related to the web server or architecture, just the database.

Related

Postgres server-side cursor with LEFT JOIN does not return on Heroku PG

I have a Heroku app that uses a psycopg server-side cursor together with a LEFT JOIN query running on Heroku PG 13.5.
The query basically says “fetch items from one table, that don’t appear in another table”.
My data volume is pretty stable, and this has been working well for some time.
This week these queries stopped returning. In pg_stat_activity they appeared as active indefinitely (17+ hours), similarly in heroku pg:ps. There appeared to be no deadlocks. All the Heroku database metrics and logs appeared healthy.
If I run the same queries directly in the console (without a cursor) they return in a few seconds.
I was able to get it working again in the cursor by making the query a bit more efficient (switching from LEFT JOIN to NOT EXISTS; dropping one of the joins).
My questions are:
Why might the original query perform fine in the console, but not return with a psycopg server-side cursor?
How might I debug this?
What might have changed this week to trigger the issue?
I can say that:
However I write the query (LEFT JOIN, Subquery, NOT EXISTS), the query plan involves a Nested Loop Anti Join
I don’t believe this is related to the Heroku outage the following day (and which didn’t affect Heroku PG)
Having Googled extensively, the closest thing I can find to a hypothesis to explain this is a post on the PG message boards from 2003 entitled left join in cursor where the response is “Some plan node types don't cope very well with being run backwards.”
Any advice appreciated!
If you are using a cursor, PostgreSQL estimates that only 10% of the query result will be fetched quickly and prefers plans that return the first few rows quickly, at the expense of the total query cost.
You can disable this optimization by setting the PostgreSQL parameter cursor_tuple_fraction to 1.0.

Function in same Session w/diff params become increasingly slow

Issue:
PostgreSQL 12.2 (Ubuntu 12.2-2.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit
Client is Data Grip and same behavior from my report server which uses the driver that comes with Jaspersoft
I am running the same function for reporting with different parameters multiple times under the same session. Also does the same thing using the same parameters.
These are being one run one after the other and not at the same time.
The result only has a few rows but does read from quite a few tables, no writes.
It is just table joins and selects no inputs or updates to the tables themselves (would like to be able to post query but can't for security reasons).
After I run the function a few times it starts to slow down and gets to an unacceptable level. For example one of the functions goes from 1 second to over 90+ seconds (that is where I stopped testing).
Troubleshooting:
I have gone to the server and terminated the session and after that it starts to run normally for a few runs.
The standard report does use a temp table but I have removed that for testing.
I have run the following after it starts having the issue to try and fix the issue.
VACUUM all touched tables; -- I know this should not be required as there are no major changes to these tables but was trying pretty much anything.
Language plpgsql.
SET SESSION AUTHORIZATION DEFAULT;
RESET ALL;
DEALLOCATE ALL;
CLOSE ALL;
UNLISTEN *;
SELECT pg_advisory_unlock_all();
DISCARD PLANS;
DISCARD SEQUENCES;
DISCARD TEMP;
There does not seem to be a good way to terminate idle connections without a separate application or script run through a job.
Turns out there is a known issue but seldom talked about issue with parameters in Execution Plans where starting with the 6th run in the same session it will take a lot longer to run. The solution is to just set the execution plan to force_custom_plan and it works fine or did in my case.
ALTER FUNCTION xy SET plan_cache_mode = force_custom_plan;

DB2 Tables Not Loading when run in Batch

I have been working on a reporting database in DB2 for a month or so, and I have it setup to a pretty decent degree of what I want. I am however noticing small inconsistencies that I have not been able to work out.
Less important, but still annoying:
1) Users claim it takes two login attempts to connect, first always fails, second is a success. (Is there a recommendation for what to check for this?)
More importantly:
2) Whenever I want to refresh the data (which will be nightly), I have a script that drops and then recreates all of the tables. There are 66 tables, each ranging from 10's of records to just under 100,000 records. The data is not massive and takes about 2 minutes to run all 66 tables.
The issue is that once it says it completed, there is usually at least 3-4 tables that did not load any data in them. So the table is deleted and then created, but is empty. The log shows that the command completed successfully and if I run them independently they populate just fine.
If it helps, 95% of the commands are just CAST functions.
While I am sure I am not doing it the recommended way, is there a reason why a number of my tables are not populating? Are the commands executing too fast? Should I lag the Create after the DROP?
(This is DB2 Express-C 11.1 on Windows 2012 R2, The source DB is remote)
Example of my SQL:
DROP TABLE TEST.TIMESHEET;
CREATE TABLE TEST.TIMESHEET AS (
SELECT NAME00, CAST(TIMESHEET_ID AS INTEGER(34))TIMESHEET_ID ....
.. (for 5-50 more columns)
FROM REMOTE_DB.TIMESHEET
)WITH DATA;
It is possible to configure DB2 to tolerate certain SQL errors in nested table expressions.
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.5.0/com.ibm.data.fluidquery.doc/topics/iiyfqetnint.html
When the federated server encounters an allowable error, the server allows the error and continues processing the remainder of the query rather than returning an error for the entire query. The result set that the federated server returns can be a partial or an empty result.
However, I assume that your REMOTE_DB.TIMESHEET is simply a nickname, and not a view with nested table expressions, and so any errors when pulling data from the source should be surfaced by DB2. Taking a look at the db2diag.log is likely the way to go - you might even be hitting a Db2 issue.
It might be useful to change your script to TRUNCATE and INSERT into your local tables and see if that helps avoid the issue.
As you say you are maybe not doing things the most efficient way. You could consider using cache tables to take a periodic copy of your remote data https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.5.0/com.ibm.data.fluidquery.doc/topics/iiyvfed_tuning_cachetbls.html

Deal with Postgresql Error -canceling statement due to conflict with recovery- in psycopg2

I'm creating a reporting engine that makes a couple of long queries over a standby server and process the result with pandas. Everything works fine but sometimes I have some issues with the execution of those queries using a psycopg2 cursor: the query is cancelled with the following message:
ERROR: cancelling statement due to conflict with recovery
Detail: User query might have needed to see row versions that must be removed
I was investigating this issue
PostgreSQL ERROR: canceling statement due to conflict with recovery
https://www.postgresql.org/docs/9.0/static/hot-standby.html#HOT-STANDBY-CONFLICT
but all solutions suggest fixing the issue making modifications to the server's configuration. I can't make those modifications (We won the last football game against IT guys :) ) so I want to know how can I deal with this situation from the perspective of a developer. Can I resolve this issue using python code? My temporary solution is simple: catch the exception and retry all the failed queries. Maybe could be done better (I hope so).
Thanks in advance
There is nothing you can do to avoid that error without changing the PostgreSQL configuration (from PostgreSQL 9.1 on, you could e.g. set hot_standby_feedback to on).
You are dealing with the error in the correct fashion – simply retry the failed transaction.
The table data on the hot standby slave server is modified while a long running query is running. A solution (PostgreSQL 9.1+) to make sure the table data is not modified is to suspend the replication on the slave and resume after the query.
select pg_xlog_replay_pause(); -- suspend
select * from foo; -- your query
select pg_xlog_replay_resume(); --resume
I recently encountered a similar error and was also in the position of not being a dba/devops person with access to the underlying database settings.
My solution was to reduce the time of the query where ever possible. Obviously this requires deep knowledge of your tables and data, but I was able to solve my problem with a combination of a more efficient WHERE filter, a GROUPBY aggregation, and more extensive use of indexes.
By reducing the amount of server side execute time and data, you reduce the chance of a rollback error occurring.
However, a rollback can still occur during your shortened window, so a comprehensive solution would also make use of some retry logic for when a rollback error occurs.
Update: A colleague implemented said retry logic as well as batching the query to make the data volumes smaller. These three solutions have made the problem go away entirely.
I got the same error. What you CAN do (if the query is simple enough), is deviding the data into smaller chunks as a workaround.
I did this within a python loop to call the query multiple times with the LIMIT and OFFSET parameter like:
query_chunk = f"""
SELECT *
FROM {database}.{datatable}
LIMIT {chunk_size} OFFSET {i_chunk * chunk_size}
"""
where database and datatable are the names of your sources..
The chunk_size is individually and to set this to a not too high value is crucial for the query to finish.

SQL queries running slowly or stuck after DBCC DBReindex or Alter Index

All,
SQL 2005 sp3, database is about 70gb in size. Once in a while when I reindex all of my indexes in all of my tables, the front end seems to freeze up or run very slowly. These are queries coming from the front end, not stored procedures in sql server. The front end is using JTDS JDBC connection to access the SQL Server. If we stop and restart the web services sending the queries the problem seems to go away. It is my understandning that we have a connection pool in which we re-use connections and dont establish a new connection each time.
This problem does not happen every time we reindex. I have tried both ways with dbcc dbreindex and alter index online = on and sort in tempdb = on.
Any insight into why this problem occurs once in a while and how to prevent this problem would be very helpful.
Thanks in advance,
Gary Abbott
When this happens next time, look into sys.dm_exec_requests to see what is blocking the requests from the clients. The blocking_session_id will indicate who is blocking, and the wait_type and wait_resource will indicate what is blocking on. You can also use the Activity Monitor to the same effect.
On a pre-grown database an online index rebuild will not block normal activity 9select/insert/update/delete). The load on the server may increase as a result of the online index rebuild and this could result in overall slower responses, but should not cause blocking.
If the database is not pre-grown though then the extra allocations of the index rebuild will trigger database growth events, which can be very slow if left default at 10% increments and without instant file initialisation enabled. During a database growth event all activity is frozen in that database, and this may be your problem even if the indexes are rebuilt online. Again, Activity Monitor and sys.dm_exec_requests would both clearly show this as happening.