Scheduling a slow query - tsql

I have a query which runs via cursors and it is slow because it monitors lots of fields.
I would like to run this query automatically every week. it takes around 2 minutes. so do you think a job scheduling? or a query to send an email will do? or do you have any other options??? Thank you!!!
In the below query it checks only for L_name but I have 50 more fields like that.
SELECT
a.invnumber, a.Accnum,
i.audit_field, i.field_after, name, i.maxdate AS Modified_date
FROM
#Iam a
JOIN
(SELECT
a.invnumber, a.Accnum, a.field_after, audit_field, maxdate
FROM
#Iam_audit a WITH(nolock)
INNER JOIN
(SELECT
Accnum, invnumber, MAX(Modified_Date) AS maxdate
FROM
#Iam_audit a2 WITH(nolock)
WHERE
a2.Audit_field = 'name'
GROUP BY
Accnum, invnumber) AS aa ON aa.Accnum = a.Accnum
AND aa.invnumber = a.invnumber
AND aa.maxdate = a.modified_Date
WHERE
a.Audit_Field = 'name') i ON i.audit_field = 'name'
AND i.Accnum = a.Accnum
AND i.invnumber = a.invnumber
AND a.name <> i.field_after

If you are using SQL Server, then scheduling a job using the SQL Agent would definitely be the way to go.
As stated in the comments, 2 minutes is not a very long time for the query. Simply schedule the job for a time when the system is not under heavy use.
But if you want to move to the more modern implementation of SQL I would look to implement set-based data operations rather than cursors. This may or may not improve the query timings, but is something I would definitely look at doing for a production server.
Cursors are generally looked down upon for production operations, although, you may be able to justify them in some cases.

Related

PostgreSQL how to GROUP BY single field from returned table

So I have complicated query, to simplify let it be like
SELECT
t.*,
SUM(a.hours) AS spent_hours
FROM (
SELECT
person.id,
person.name,
person.age,
SUM(contacts.id) AS contact_count
FROM
person
JOIN contacts ON contacts.person_id = person.id
) AS t
JOIN activities AS a ON a.person_id = t.id
GROUP BY t.id
Such query works fine in MySQL, but Postgres needs to know that GROUP BY field is unique, and despite it actually is, in this case I need to GROUP BY all returned fields from returned t table.
I can do that, but I don't believe that will work efficiently with big data.
I can't JOIN with activities directly in first query, as person can have several contacts which will lead query counting hours of activity several time for every joined contact.
Is there a Postgres way to make this query work? Maybe force to treat Postgres t.id as unique or some other solution that will make same in Postgres way?
This query will not work on both database system, there is an aggregate function in the inner query but you are not grouping it(unless you use window functions). Of course there is a special case for MySQL, you can use it with disabling "sql_mode=only_full_group_by". So, MySQL allows this usage because of it' s database engine parameter, but you cannot do that in PostgreSQL.
I knew MySQL allowed indeterminate grouping, but I honestly never knew how it implemented it... it always seemed imprecise to me, conceptually.
So depending on what that means (I'm too lazy to look it up), you might need one of two possible solutions, or maybe a third.
If you intent is to see all rows (perform the aggregate function but not consolidate/group rows), then you want a windowing function, invoked by partition by. Here is a really dumbed down version in your query:
.
SELECT
t.*,
SUM (a.hours) over (partition by t.id) AS spent_hours
FROM t
JOIN activities AS a ON a.person_id = t.id
This means you want all records in table t, not one record per t.id. But each row will also contain a sum of the hours for all values that value of id.
For example the sum column would look like this:
Name Hours Sum Hours
----- ----- ---------
Smith 20 120
Jones 30 30
Smith 100 120
Whereas a group by would have had Smith once and could not have displayed the hours column in detail.
If you really did only want one row per t.id, then Postgres will require you to tell it how to determine which row. In the example above for Smith, do you want to see the 20 or the 100?
There is another possibility, but I think I'll let you reply first. My gut tells me option 1 is what you're after and you want the analytic function.

How can I get the total run time of a query in redshift, with a query?

I'm in the process of benchmarking some queries in redshift so that I can say something intelligent about changes I've made to a table, such as adding encodings and running a vacuum. I can query the stl_query table with a LIKE clause to find the queries I'm interested in, so I have the query id, but tables/views like stv_query_summary are much too granular and I'm not sure how to generate the summarization I need!
The gui dashboard shows the metrics I'm interested in, but the format is difficult to store for later analysis/comparison (in other words, I want to avoid taking screenshots). Is there a good way to rebuild that view with sql selects?
To add to Alex answer, I want to comment that stl_query table has the inconvenience that if the query was in a queue before the runtime then the queue time will be included in the run time and therefore the runtime won't be a very good indicator of performance for the query.
To understand the actual runtime of the query, check on stl_wlm_query for the total_exec_time.
select total_exec_time
from stl_wlm_query
where query='query_id'
There are some usefuls tools/scripts in https://github.com/awslabs/amazon-redshift-utils
Here is one of said scripts stripped out to give you query run times in milliseconds. Play with the filters, ordering etc to show the results you are looking for:
select userid, label, stl_query.query, trim(database) as database, trim(querytxt) as qrytext, starttime, endtime, datediff(milliseconds, starttime,endtime)::numeric(12,2) as run_milliseconds,
aborted, decode(alrt.event,'Very selective query filter','Filter','Scanned a large number of deleted rows','Deleted','Nested Loop Join in the query plan','Nested Loop','Distributed a large number of rows across the network','Distributed','Broadcasted a large number of rows across the network','Broadcast','Missing query planner statistics','Stats',alrt.event) as event
from stl_query
left outer join ( select query, trim(split_part(event,':',1)) as event from STL_ALERT_EVENT_LOG group by query, trim(split_part(event,':',1)) ) as alrt on alrt.query = stl_query.query
where userid <> 1
-- and (querytxt like 'SELECT%' or querytxt like 'select%' )
-- and database = ''
order by starttime desc
limit 100

Amazon redshift query planner

I'm facing a situation with Amazon Redshift that I haven't been able to explain to myself yet. Query planner seems not to be able to handle same table in subquery of two derived tables in a join.
I have essentially three tables, Source_A, Source_B, Target_1, Target_2 and a query like
SELECT a,b,c,d FROM
(
SELECT a,b FROM Source_A where date > (SELECT max(date) FROM Target_1)
)
INNER JOIN
(
SELECT c,d FROM Source_B where date > (SELECT max(date) FROM Target_2)
)
ON Source_A.a = Source_B.c
The query works fine as long as tables Target_1 and Target_2 are different tables. If I change the query so that Target_2 = Target_1, something happens. After the change, the query starts to take about 10 times longer time. And when I look at the performance monitor I can see that all this extra time is taken so that only the Leader Node is active.
When I take EXPLAIN of both options I see practically no difference in the output. All the steps are the same. But the is the difference that the EXPLAIN takes seconds in one and almost half an hour with the other one where the Target tables are the same.
So to summarise what I think I have observed is -- that on join, if I use same table in a subquery of each derived tables, the query planner goes nuts.

Performance issue in merge statement

I have a merge statement like below
MERGE DESTINATION AS DST
USING ( SELECT <Some_Columns> FROM TABLEA WITH(NOLOCK) INNER JOIN TableB .....
) AS SRC
ON(
<some conditions>
)
WHEN MATCHED THEN
UPDATE SET column1 = src.column1
...............
,Modified_By = #PackageName
,Modified_Date = GETDATE()
WHEN NOT MATCHED THEN
INSERT (<Some_Columns>)
VALUES(<Some_Columns>)
OUTPUT
$action, inserted.key'inserted'
INTO #tableVar
;
For the first set of records (around 300,000) it is working perfectly and executing in just 30 seconds.
But for the second set of records (around 300,000) it is taking more than an hour.
Two days back I have loaded 50 sets like that and the same query was working lightning fast, but from today it is damn slow. I have no idea what is going wrong.
Note: Query
SELECT FROM TABLEA WITH(NOLOCK) INNER JOIN TableB .....
is taking 20 seconds in all scenerios.
While the MERGE syntax is nicer, and it seems to promise better atomicity and no race conditions (but doesn't, unless you add HOLDLOCK, as this Dan Guzman post demonstrates), I still feel like it is better to hang onto the old-style, separate insert/update methodology. The primary reason is not that the syntax is hard to learn (which it is), or that it is difficult to work around the concurrency issues (it isn't), but rather that there are several unresolved bugs - even in SQL Server 2012 still - involving this operation. I point over a dozen of them out in this post that talks about yet another MERGE bug that has been fixed recently. I also go into more detail in a cautionary tip posted here and several others chime in here.
As I suggested in a comment, I don't want to sound the alarms that the sky is falling, but I'm really not all that interested in switching to MERGE until there are far fewer active bugs. So I would recommend sticking with an old-fashioned UPSERT for now, even though the actual syntax you're using might not be the source of the performance problem you're having anyway.
UPDATE dest SET column1 = src.column1
FROM dbo.DestinationTable AS dest
INNER JOIN (some join) AS src
ON (some condition);
INSERT dest(...) SELECT cols
FROM (some join) AS src
WHERE NOT EXISTS
(
SELECT 1 FROM dbo.DestinationTable
WHERE key = src.key
);
For the performance issue, you'll want to look into your session's waits (Plan Explorer* can help with that, by firing up an Extended Events session for you), or at the very least, blocking_session_id and wait_type_desc in sys.dm_exec_requests while the query is running.
*Disclaimer: I used to work for SQL Sentry.

PostgreSQL slow COUNT() - is trigger the only solution?

I have a table with posts, which are categorized by:
type
tag
language
All of those "categories" are stored in next tables (posts_types) and connected via next tables (posts_types_assignment).
COUNTing in PostgreSQL is really slow (i have more than 500k records in that table) and i need to get the number of posts categorized by any combination of type/tag/lang.
If i would solve it through triggers, it would be full of many multi-level loops, which really doesn't look like nice and is hard to maintenance.
Is there any other solution how to effectively get actual number of posts categorized in any type/tag/language?
Let me get this straight.
You have a table posts. You have a table posts_types. The two have a many to many join on posts_types_assignment. And you have some query like this that is slow:
SELECT count(*)
FROM posts p
JOIN posts_types_assigment pta1
ON p.id = pta1.post_id
JOIN posts_types pt1
ON pt1.id = pta1.post_type_id
AND pt1.type = 'language'
AND pt1.name = 'English'
JOIN posts_types_assigment pta2
ON p.id = pta2.post_id
JOIN posts_types pt2
ON pt2.id = pta2.post_type_id
AND pt2.type = 'tag'
AND pt2.name = 'awesome'
And you would like to know why it is painfully slow.
My first note is that PostgreSQL would have to do a lot less work if you had the identifiers in the posts table rather than in the joins. But that is a moot issue, the decision has been made.
My more useful note is that I believe that PostgreSQL has a similar query optimizer to Oracle. In which case to limit the combinatorial explosion of possible query plans that it has to consider, it only considers plans that start with some table, and then repeatedly joins on one more data set at a time. However no such query plan will work here. You can start with pt1, get 1 record, then go to pta1, get a bunch of records, join p, wind up with the same number of records, then join pta2, and now you get a huge number of records, then join to pt2, get just a few records. Joining to pta2 is the slow step, because the database has no idea which records you want, and therefore has to create a temporary result set for every combination of a post and a piece of metadata (type, language or tag) on it.
If this is indeed your problem, then the right plan looks like this. Join pt1 to pta1, put an index on it. Join pt2 to pta2, then join to the result of the first query, then join to p. Then count. This means that we don't get huge result sets.
If this the case, there is no way to tell the query optimizer that this once you want it to think up a new type of execution plan. But there is a way to force it.
CREATE TEMPORARY TABLE t1
AS
SELECT pta*
FROM posts_types pt
JOIN posts_types_assignment pta
ON pt.id = pta.post_type_id
WHERE pt.type = 'language'
AND pt.name = 'English';
CREATE INDEX idx1 ON t1 (post_id);
CREATE TEMPORARY TABLE t2
AS
SELECT pta*
FROM posts_types pt
JOIN posts_types_assignment pta
ON pt.id = pta.post_type_id
JOIN t1
ON t1.post_id = pta.post_id
WHERE pt.type = 'language'
AND pt.name = 'English';
SELECT COUNT(*)
FROM posts p
JOIN t1
ON p.id = t1.post_id;
Barring random typos, etc, this is likely to perform somewhat better. If it doesn't, double check the indexes on your tables.
As btilly notes, and if he has correctly guessed the schema, the table design does not help - it seems (at first sight, at least) that, for example, to have three tables posts_tag(post_id,tag) post_lang(post_id,lang) post_type(post_id,type) would be more natural and much more efficient.
Apart from that (or in addition to that), one could think of a table or materialized view that summarizes all the possible countings, with columns (lang,type,tag,nposts). Of course, to compute this in full would be VERY slow, but (apart from the first time) it can be done either in full "in background", at some intervals (if the data does not vary much, and if you don't require exact counts), or eagerly with triggers.
See for example here