Convert query with cursor to more optimized methodology - sql-server-2008-r2

I have a table with many columns but two of them that are of interest in this case. One of the columns represents a subversion commit numnber, and the other one prepresents a timestamp of when an automated process ran using the data from the aforementioned commit number. Theres many rows with the same commit number, and any number greater than or equal to 1 of the timestamp. I need to get a list of distinct commit numbers, and the earliest timestamp in the table for each one.
I can do this with a cursor that iterates between the distinct commit numbers, and finds the top 1 timestamp for each commit, but this is very slow because there are 56 million rows in the table. I feel certain there must be a more efficient way.
Below you can see my TSql.
DECLARE #CommitDates TABLE (CommitNumber int, LastUpdate date)
declare #commit int
DECLARE db_cursor CURSOR FOR
SELECT DISTINCT [CommitNumber] FROM ProcessHistory ORDER BY [CommitNumber] DESC
OPEN db_cursor
fetch next from db_cursor into #commit
while ##FETCH_STATUS=0
BEGIN
INSERT INTO #CommitDates ([CommitNumber], [LastUpdate])
select top 1 [CommitNumber],LastUpdate from ProcessHistory WHERE [CommitNumber]=#commit ORDER BY LastUpdate ASC
fetch next from db_cursor into #commit
END
CLOSE db_cursor
deallocate db_cursor
SELECT * from #CommitDates
Expected results: be able to know quickly what the first date a given commit number appears in the table without having to pull up the subversion log viewer. In this case, i would define "quickly" as executing in no more than 60 seconds.
Actual results: it takes more than 7 minutes 30 seconds to execute this code which returns only 176 rows as of today.

Well i feel silly, just figured it out:
SELECT [CommitNumber],MIN([LastUpdate]) FROM ProcessHistory GROUP BY [CommitNumber]
Executes in literally 00:00:02

Related

Using SQL "seek" with a UUID for sorting in a PL/pgSQL Query

I have a table that looks like the following:
CREATE TABLE tmp (
id uuid primary key,
other_id uuid,
...
);
This table has millions of entries, and I need to: loop through them all, check and compare the values of some of its fields with the values of another table, and correct the values.
I did not want to use the standard ORDER BY ... LIMIT ... OFFSET ... approach as its performance suffers greatly for big offsets. Hence, I tried to used the "seek index" approach, example here.
My problem is that I am getting off-by-one errors, and I am not sure (conceptually) how to solve these in PL/pgSQL code. Something like this:
-- Get initial offset
SELECT id INTO _id_offset
FROM tmp
WHERE ...
ORDER BY id DESC
LIMIT 1
WHILE ... LOOP -- Loop until some fixed high value to prevent infinite loop, just in case
SELECT id, other_id, ... INTO rows_to_update
FROM tmp
WHERE id < _id_offset AND (...) -- Latter part is same condition as above
ORDER BY id DESC
FETCH NEXT _batch_size ROWS ONLY
-- Get next offset
SELECT id INTO _id_offset
FROM rows_to_update
ORDER BY id ASC -- ASC to get the "last" id from above. Cannot simply use _batch_size offset as there may be fewer entries left.
LIMIT 1
-- Update relevant records, check # of updated records to see
-- if we can terminate loop early, update loop condition
...
END LOOP;
Unsurprisingly, the first and last entry are skipped due to the < condition. It would have been rather simple to correct this behaviour in application code, but I'm not sure how it should look like in PL/pgSQL.
Is there a simpler way to loop over an entire table in an efficient manner using PL/pgSQL?

Add dates ranges to a table for individual values using a cursor

I have a calendar table called CalendarInformation that gives me a list of dates from 2015 to 2025. This table has a column called BusinessDay that shows what dates are weekends or holidays. I have another table called OpenProblemtimeDiffTable with a column called number for my problem number and a date for when the problem was opened called ProblemNew and another date for the current column called Now. What I want to do is for each problem number grab its date ranges and find the dates between and then sum them up to give me the number of business days. Then I want to insert these values in another table with the problem number associated with the business day.
Thanks in advance and I hope I was clear.
TRUNCATE TABLE ProblemsMoreThan7BusinessDays
DECLARE #date AS date
DECLARE #businessday AS INT
DECLARE #Startdate as DATE, #EndDate as DATE
DECLARE CONTACT_CURSOR CURSOR FOR
SELECT date, businessday
FROM CalendarInformation
OPEN contact_cursor
FETCH NEXT FROM Contact_cursor INTO #date, #businessday
WHILE (##FETCH_STATUS=0)
BEGIN
SELECT #enddate= now FROM OpenProblemtimeDiffTable
SELECT #Startdate= problemnew FROM OpenProblemtimeDiffTable
SET #Date=#Startdate
PRINT #enddate
PRINT #startdate
SELECT #businessday= SUM (businessday) FROM CalendarInformation WHERE date > #startdate AND date <= #Enddate
INSERT INTO ProblemsMoreThan7BusinessDays (businessdays, number)
SELECT #businessday, number
FROM OpenProblemtimeDiffTable
FETCH NEXT FROM CONTACT_CURSOR INTO #date, #businessday
END
CLOSE CONTACT_CURSOR
DEALLOCATE CONTACT_CURSOR
I tried this code using a cursor and I'm close, but I cannot get the date ranges to change for each row.
So if I have a problemnumber with date ranges between 02-07-2018 and 05-20-2019, I would want in my new table the sum of business days from the calendar along with the problem number. So my output would be column number PROB0421 businessdays (with the correct sum). Then the next problem PRB0422 with date ranges of 11-6-18 to 5-20-19. So my output would be PROB0422 with the correct sum of business days.
Rather than doing this in with a cursor, you should approach this in a set based manner. That you already have a calendar table makes this a lot easier. The basic approach is to select from your data table and join into your calendar table to return all the rows in the calendar table that sit within your date range. From here you can then aggregate as you require.
This would look something like the below, though apply it to your situation and adjust as required:
select p.ProblemNow
,p.Now
,sum(c.BusinessDay) as BusinessDays
from dbo.Problems as p
join dbo.calendar as c
on c.CalendarDate between p.ProblemNow and p.Now
and c.BusinessDay = 1
group by p.ProblemNow
,p.Now
I think you can do this without a cursor. Should only require a single insert..select statement.
I assume your "businessday" column is just a bit or flag-type field that is 1 if the date is a business day and 0 if not? If so, this should work (or something close to it if I'm not understanding your environment properly).:
insert ProblemsMoreThan7BusinessDays
(
businessdays
, number
)
select
number
, sum( businessday ) -- or count(*)
from OpenProblemtimeDiffTable op
inner join CalendarInformation ci on op.problem_new >= ci.[date]
and op.[now] <= ci.[date]
and ci.businessday = 1
group by
problem_number
I usually try to avoid the use of cursors and working with data in a procedural manner, especially if I can handle the task as above. Dont think of the data as 1000's of individual rows, but think of the data as only two sets of data. How do they relate?

historical result of SELECT statement

I want to query a large number of rows and displayed to the user, however the user will see only for example 10 rows and I will use LIMIT and OFFSET, he will press 'next' button in the user interface and the next 10 rows will be fetched.
The database will be updated all the time, is there any way to guarantee that the user will see the data of the next 10 rows as they were in the first select, so any changes in the data will not be reflected if he choose to see the next 10 rows of result.
This is like using SELECT statement as a snapshot of the past, any subsequent updates after the first SELECT will not be visible for the subsequent SELECT LIMIT OFFSET.
You can use cursors, example:
drop table if exists test;
create table test(id int primary key);
insert into test
select i from generate_series(1, 50) i;
declare cur cursor with hold for
select * from test order by id;
move absolute 0 cur; -- get 1st page
fetch 5 cur;
id
----
1
2
3
4
5
(5 rows)
truncate test;
move absolute 5 cur; -- get 2nd page
fetch 5 cur; -- works even though the table is already empty
id
----
6
7
8
9
10
(5 rows)
close cur;
Note that it is rather expensive solution. A declared cursor creates a temporary table with a complete result set (a snapshot). This can cause significant server load. Personally I would rather search for alternative approaches (with dynamic results).
Read the documentation: DECLARE, FETCH.

sqlite3 trigger to auto-add new month record

I need to automatically insert a row in a stats table that is identified by the month number, if the new month does not exist as a row.
'cards' is a running count of individual IDs that stores a current value (gets reset at rollover time), a rollover count and a running total of all events on that ID
'stats keeps a running count of all IDs events, and how many rollovers occurred in a given month.
CREATE TABLE IDS (ID_Num VARCHAR(30), Curr_Count INT, Rollover_Count INT, Total_Count INT);
CREATE TABLE stats(Month char(10), HitCount int, RolloverCount int);
CREATE TRIGGER update_Tstats BEFORE UPDATE OF Total_Count ON IDS
WHEN 0=(SELECT HitCount from stats WHERE Month = strftime('%m','now'))
(Also tried a "IS NULL" at the other end of the WHEN clause...still no joy)
BEGIN
INSERT INTO stats (Month, HitCount, RolloverCount) VALUES (strftime('%m', 'now'),0,0);
END;
I did have it working to a point, but as rollover was updated twice per cycle (value changed up and down via SQL query I have in a python script), it gave me doubleups in the stats rollover count. So now I'm running a double query in my script. However, this all fall over if the current month number does not exist in the stats table.
All I need to do is check if a blank record exists for the current month for the python script UPDATE queries to run against, and if not, INSERT one. The script itself can't do a 'run once' type of query on initial runup, because it may run for days, including spanning a new month changeover.
Any assistance would be hugely appreciated.
To check whether a record exists, use EXISTS:
CREATE TRIGGER ...
WHEN NOT EXISTS (SELECT 1 FROM stats WHERE Month = ...)
BEGIN
INSERT INTO stats ...
END;

SQLite - a smart way to remove and add new objects

I have a table in my database and I want for each row in my table to have an unique id and to have the rows named sequently.
For example: I have 10 rows, each has an id - starting from 0, ending at 9. When I remove a row from a table, lets say - row number 5, there occurs a "hole". And afterwards I add more data, but the "hole" is still there.
It is important for me to know exact number of rows and to have at every row data in order to access my table arbitrarily.
There is a way in sqlite to do it? Or do I have to manually manage removing and adding of data?
Thank you in advance,
Ilya.
It may be worth considering whether you really want to do this. Primary keys usually should not change through the lifetime of the row, and you can always find the total number of rows by running:
SELECT COUNT(*) FROM table_name;
That said, the following trigger should "roll down" every ID number whenever a delete creates a hole:
CREATE TRIGGER sequentialize_ids AFTER DELETE ON table_name FOR EACH ROW
BEGIN
UPDATE table_name SET id=id-1 WHERE id > OLD.id;
END;
I tested this on a sample database and it appears to work as advertised. If you have the following table:
id name
1 First
2 Second
3 Third
4 Fourth
And delete where id=2, afterwards the table will be:
id name
1 First
2 Third
3 Fourth
This trigger can take a long time and has very poor scaling properties (it takes longer for each row you delete and each remaining row in the table). On my computer, deleting 15 rows at the beginning of a 1000 row table took 0.26 seconds, but this will certainly be longer on an iPhone.
I strongly suggest that you re-think your design. In my opinion your asking yourself for troubles in the future (e.g. if you create another table and want to have some relations between the tables).
If you want to know the number of rows just use:
SELECT count(*) FROM table_name;
If you want to access rows in the order of id, just define this field using PRIMARY KEY constraint:
CREATE TABLE test (
id INTEGER PRIMARY KEY,
...
);
and get rows using ORDER BY clause with ASC or DESC:
SELECT * FROM table_name ORDER BY id ASC;
Sqlite creates an index for the primary key field, so this query is fast.
I think that you would be interested in reading about LIMIT and OFFSET clauses.
The best source of information is the SQLite documentation.
If you don't want to take Stephen Jennings's very clever but performance-killing approach, just query a little differently. Instead of:
SELECT * FROM mytable WHERE id = ?
Do:
SELECT * FROM mytable ORDER BY id LIMIT 1 OFFSET ?
Note that OFFSET is zero-based, so you may need to subtract 1 from the variable you're indexing in with.
If you want to reclaim deleted row ids the VACUUM command or pragma may be what you seek,
http://www.sqlite.org/faq.html#q12
http://www.sqlite.org/lang_vacuum.html
http://www.sqlite.org/pragma.html#pragma_auto_vacuum