I'm developing a pipeline that communicates a lot to my postgres database.
I'm trying to figure out how I can create a "dry run" mode for my pipeline that essentially uses the fewest resources possible just to see if the pipeline can and would run properly.
I understand that I could just not commit my changes to my database at the end of my script, but it takes about an hour for my pipeline to run all the way through. So it'd be cool if I could just simulate a lot of my postgres queries without actually taking the time to run all of them.
I'll give you an extremely simple example so you can see what I mean...
Let's say the only things in my pipeline are refreshing a bunch of materialized views. But they take a long time to refresh, and there's a lot of them. So I'd like to simply ask the postgres db... "hey, if I ran these refreshes right now, would it go through correctly?"
I feel like this is kinda what happens during the "query planning" phase, but this is just an intuition.
I also acknowledge that the only way to know for absolute sure is to actually run it, but let's just say for now that I know and accept the risks that come along with that.
For INSERT, UPDATE, DELETE and SELECT statements, you could prepend EXPLAIN to the statement. That won't return the proper result set, but when it succeeds, the statement should run (unless there is a runtime error).
However, that does not work with REFRESH MATERIALIZED VIEW...
I guess the best option you have is to use a database that has all the tables, views, functions etc., but is empty. Most statements will complete very quickly on an empty database.
Related
I'm debugging a DB performance issue. There's a lead that the issue was introduced after a certain deploy, e.g. when DB started to serve some new queries.
I'm looking to correlate deployment time with the performance issues, and would like to identify the queries that are causing this.
Using pg_stat_statements has been very handy so far. Unfortunately it does not store the time stamp of the first occurrence of each query.
Is there any auxiliary tables I could look into to see the time of first occurrence of queries?
Ideally, if this information would have been available in pg_stat_statements, I'd make a query like this:
select queryid from where date(first_run) = '2020-04-01';
Additionally, it'd be cool to see last_run as well, so to filter out some old queries that no longer execute at all, but remain in pg_stat_statements. That's more of a nice thing that's a necessity though.
This information is not stored anywhere, and indeed it would not be very useful. If the problem statement is a new one, you can easily identify it in your application code. If it is not a new statement, but something made the query slower, the first time the query was executed won't help you.
Is your source code not under version control?
Short story: A report running against a Progress database (OpenEdge Release 10.1C03) takes hours to complete. I suspect that it does not take advantage of existing data indexes. Would like to understand how it scans the data to then try to add an index that will make it run faster.
Source code of the report is not available. The code is native Progress 4GL, not SQL.
If it were an SQL database I would try to do a dump of SQL queries and would then go from that. With 4GL I did not find any such functionality. Is it possible to somehow peek at what gets executed at the low level?
What else can be done if there is no source code?
Thanks!
There are several things you can do:
If I recall correctly 10.1C should have the _usertablestat and _userindexstat virtual system tables available. These allow you to observe, at runtime, what tables and indexes are being accessed by a particular session. You can either write your own 4GL program to query them or you can use the screens in PROMON, R&D, 3 "Other Displays", 5 "I/O Operations by User by Table" and 6 "I/O Operations by User by Index". That will show you what tables and indexes are actually in use and how much use they are getting. If the observed data seems wrong it will probably give you a clue. (If the VSTs are missing it might be because the db was upgraded from an older version -- add them with proutil dbname -C updatevsts.)
You could also use the session startup parameters -clientlog "filename" and -logentrytypes QryInfo to obtain more detailed information about the queries being executed.
Keep in mind that Progress is not SQL. Unlike most SQL databases the 4gl uses a static, compile-time, optimizer. Index selection happens when the code is compiled. So unless you can recompile (and you seem to not have source so that seems unlikely) you won't be able to improve things by adding missing indexes. You might, however, at least be able to show the person who does have source where the problem is.
Another tool that can help is the profiler. This will identify where in the code the time is being spent. That can also be good information to provide to the original vendor if they need help finding the problem. For more information on the profiler: http://dbappraise.com/ppt/profiler.pptx
I need to synchronize two tables across databases, whenever either one changes, on either update, delete or insert. The tables are NOT identical.
So far the easiest and best solution i have been able to find is adding SQL triggers.
Which i slowly started adding, and seems to be working fine. But before i continue finishing it, I want to be sure that this is a good idea? And in general good practice.
If not, what a better option for this scenario?
Thank you in advance
Regards
Daniel.
Triggers will work, but there are quite a few different options available to consider.
Are all data modifications to these tables done through stored procedures? If so, consider putting the logic in the stored procedures instead of in a trigger.
Do the updates have to be real-time? If not, consider a job that regularly synchronizes the tables instead of a trigger. This probably gets tricky with deletes, though. Not impossible, just tricky.
We had one situation where the tables were very similar, but had slightly different column names or orders. In that case, we created a view to the original table that let the application use the view instead of the second copy of the table. We were also able to use a Synonym one time to point to the original table, but that requires that the table structures be the same.
Generally speaking, a lot of people try to avoid unnecessary triggers as they're just too easy to miss when doing other work in the database. That doesn't make them bad, but can lead to interesting times when trying to troubleshoot problems.
In your scenario, I'd probably briefly explore other options before continuing with the triggers. Just watch out for cascading trigger effects where your one update results in the second table updating, passing the update back to the first table, then the second, etc. You can guard for this a little with nesting levels. Otherwise you run the risk of hitting that maximum recursion level and throwing errors.
I've got a really long running SQL query (data import, etc). It's crap - it uses cursors and it running slowly. It's doing it, so I'm not too worried about performance.
Anyways, can I pause it for a while (instead of canceling the query)?
It chews up a a bit of CPU so i was hoping to pause it, do some other stuff ... then resume it.
I'm assuming the answer is 'NO' because of how rows and data gets locked, etc.
I'm using Sql Server 2008, btw.
The best approximation I know for what you're looking for is
BEGIN
WAITFOR DELAY 'TIME';
EXECUTE XXXX;
END;
GO
Not only can you not pause it, doing so would be bad. SQL queries hold locks (for transactional integrity), and if you paused the query, it would have to hold any locks while it was paused. This could really slow down other queries running on the server.
Rather than pause it, I would write the query so that it can be terminated, and pick up from where it left off when it is restarted. This requires work on your part as a query author, but it's the only feasible approach if you want to interrupt and resume the query. It's a good idea for other reasons as well: long running queries are often interrupted anyway.
Click the debug button instead of execute. SQL 2008 introduced the ability to debug queries on the fly. Put a breakpoint at convenient locations
When working on similar situations, where I was trying to go through an entire list of data, which could be huge, and could tell which ones I have visited already, I would run the processing in chunks.
update or whatever
where (still not done)
limit 1000
And then I would just keep running the query until there are no rows being modified. This breaks the locks up into reasonable time chunks and can allow you to do thinks like move tens of millions of rows between tables while a system is in production.
Jacob
Instead of pausing the script, perhaps you could use resource governor. That way you could allow the script to run in the background without severely impacting performance of other tasks.
MSDN-Resource Governor
I have a .NET Core 2.1 application that allows users to search a large database, with the possibility of using lots of parameters. The data access is done through ADO.NET. Some of the queries generated result in long running queries (several hours). Obviously, the user gives up on waiting, but the query chugs along in SQL Server.
I realize that the root cause is the design of the app, but I would like a quick solution for now, if possible.
I have tried many solutions, but none seem to work as expected.
What I have tried:
CommandTimeout
CommandTimeout works as expected with ExecuteNonQuery but does not work with ExecuteReader, as discussed in this forum
When you execute command.ExecuteReader(), you don't get this exception because the server responds on time. The application doesn't respond because it reads data to the memory, and the ExecuteReader() method doesn't return control until all the data is read.
I have also tried using SqlDataAdapter, but this does not work either.
SQL Server query governor
SQL Server's query governor works off of the estimated execution plan, and while it does work sometimes, it does not always catch inefficient queries.
SQL Server execution time-out
Tools > Options > Query Execution > SQL Server > General
I'm not sure what this does, but after entering a value of 1, SQL Server still allows queries to run as long as they need. I tried restarting the server instance, but that did not make any difference.
Again, I realize that the cause of this problem is the way that the queries are generated, but with so many parameters and so much data, fine tuning a solution in the design of the application may take some time. As of now, we are manually killing any spid associated with this app that has run over 10 or so minutes.
EDIT:
I abandoned the hope of finding a simple solution. If you're having a similar issue, here is what we did to address it:
We created a .net core console app that polls the database for queries running over a certain allotted time. The app looks at the login name and the amount of time it's been running and determines whether to kill the process.
https://learn.microsoft.com/en-us/dotnet/api/system.data.sqlclient.sqlcommand.cancel?view=netframework-4.7.2
Looking through the documentation on SqlCommand.Cancel, I think it might solve your issue.
If you were to create and start a Timer before you call ExecuteReader(), you could then keep track of how long the query is running, and eventually call the Cancel method yourself.
(Note: I wanted to add this as a comment but I don't have the reputation to be allowed to yet)