Optimizing removal of SQL duplicates using ROW_NUMBER - tsql

I'm attempting to remove redundant rows from an SQL table, [InfoBucket], with columns:
[ID] (varchar(16)), [column1], ... [columnN], [Speed] (bigint)
([column1...N] are datatypes ranging from integers to varchar() objects of varying lengths.)
There are rows in the table that have the same value in the [ID] and some [column1...N] columns.
I'm taking all these duplicates and deleting all but the row that has the greatest [Speed].
There are approximately 400 million rows in the [InfoBucket].
To split the work into manageable chunks, I have another table, [UniqueIDs], with one column:
[ID] (varchar(16))
and which is populated like so:
begin
insert into [UniqueIDs]
select distinct [ID] from [InfoBucket]
end
go
There are approximately 15 million rows in [UniqueIDs].
I have been using using Martin Smiths excellent answer to a similar question:
My procedure currently looks like this:
begin
declare #numIDs int
set #numIDs = 10000
;with toRemove as
(
select ROW_NUMBER over (partition by
[ID],
[column1],
...
[columnN]
order by [Speed] desc) as 'RowNum'
from [InfoBucket]
where [ID] in
(
select top (#numIDs) [ID] from [UniqueIDs] order by [ID]
)
)
delete toRemove
where RowNum > 1
option (maxdop 1)
;
;with IDsToRemove as
(
select top (#numIDs) [ID] from [UniqueIDs] order by [ID]
)
delete IDsToRemove
option (maxdop 1)
end
go
There are nonclustered indexes on [ID] in both [InfoBucket] and [UniqueIDs], and the "partition by ..." in the over clause only includes the columns that need to be compared.
Now, my problem is that it takes a little over six minutes for this procedure to run. Adjusting the value of #numIDs changes the running time in a linear fashion (ie. when #numIDs has a value of 1,000 the procedure runs for approximately 36 seconds (6 min. / 10) and when #numIDs has a value of 1,000,000 the procedure runs for approximately 10 hours (6 min. * 100); this means that removing all duplicates in [InfoBucket] takes days.
I've tried adding a uniqueidentifier column, [UI_ID] to [InfoBucket] and creating a clustered index on it (so [InfoBucket] had one clustered index on [UI_ID] and one nonclustered on [ID]) but that actually increased running time.
Is there any way I can further optimize this?

The key is to find the sweet spot for deleting the rows. Play with #numIds to find the fastest increment, and then just let it churn.
It's 400 million rows, it's not going to complete the whole process in minutes, maybe hours, it's going to take time. As long as the table does not fill faster then you can remove dupes, you are ok.
Find the sweet spot, then schedule it to run often and off peak. Then check the process from time to time to make sure the sweet spot stays sweet.
The only other thing i can think of, is to calculate the dupes outside of deleting them. This will save some time. Especially if you can calculate the dupes in one sql statement then put that data into yet another table (ie DupeIdsToDelete, then run a delete loop against those IDs)

Related

Is there a way to add the same row multiple times with different ids into a table with postgresql?

I am trying to add the same data for a row into my table x number of times in postgresql. Is there a way of doing that without manually entering the same values x number of times? I am looking for the equivalent of the go[count] in sql for postgres...if that exists.
Use the function generate_series(), e.g.:
insert into my_table
select id, 'alfa', 'beta'
from generate_series(1,4) as id;
Test it in db<>fiddle.
Idea
Produce a resultset of a given size and cross join it with the record that you want to insert x times. What would still be missing is the generation of proper PK values. A specific suggestion would require more details on the data model.
Query
The sample query below presupposes that your PK values are autogenerated.
CREATE TABLE test ( id SERIAL, a VARCHAR(10), b VARCHAR(10) );
INSERT INTO test (a, b)
WITH RECURSIVE Numbers(i) AS (
SELECT 1
UNION ALL
SELECT i + 1
FROM Numbers
WHERE i < 5 -- This is the value `x`
)
SELECT adhoc.*
FROM Numbers n
CROSS JOIN ( -- This is the single record to be inserted multiple times
SELECT 'value_a' a
, 'value_b' b
) adhoc
;
See it in action in this db fiddle.
Note / Reference
The solution is adopted from here with minor modifications (there are a host of other solutions to generate x consecutive numbers with SQL hierachical / recursive queries, so the choice of reference is somewhat arbitrary).

Best way to delete large no of random rows in PostgreSQL

I have a table which contains about 900K rows.I want to delete about 90% of the rows. Tried using TABLESAMPLE to select them randomly but didn't get much performance improvement. Here are the queries which i have tried and there times
sql> DELETE FROM users WHERE id IN (
SELECT id FROM users ORDER BY random() LIMIT 5000
)
[2017-11-22 11:35:39] 5000 rows affected in 1m 11s 55ms
sql> DELETE FROM users WHERE id IN (
SELECT id FROM users TABLESAMPLE BERNOULLI (5)
)
[2017-11-22 11:55:07] 5845 rows affected in 1m 13s 666ms
sql> DELETE FROM users WHERE id IN (
SELECT id FROM users TABLESAMPLE SYSTEM (5)
)
[2017-11-22 11:57:59] 5486 rows affected in 1m 4s 574ms
Only deleting 5% data takes about an min. So this is going to take very long for large data. Pls suggest if I am doing things right or if there is any better way to do this.
Deleting a large number of rows is always going to be slow. The way how you identify them won't make much difference.
Instead of deleting a large number it's usually a lot faster, to create a new table that contains those rows that you want to keep, e.g.:
create table users_to_keep
as
select *
from users
tablesample system (10);
then truncate the original table and insert the rows that you stored away:
truncate table users;
insert into users
select *
from users_to_keep;
If you want, you can do that in a single transaction.
As a_horse_with_no_name pointed out, the random selection itself is a relatively minor factor. And much of the cost associated with a deletion (e.g. foreign key checks) is not something you can avoid.
The only thing which stands out as an unnecessary overhead is the id-based lookup in the DELETE statement; you just visited the row during the random selection step, and now you're looking it up again, presumably via an index on id.
Instead, you can perform the lookup using the row's physical location, represented by the hidden ctid column:
DELETE FROM users WHERE ctid = ANY(ARRAY(
SELECT ctid FROM users TABLESAMPLE SYSTEM (5)
))
This gave me a ~6x speedup in an artificial test, though it will likely be dwarfed by other costs in most real-world scenarios.

PostgreSQL - How to make a condition with records between the current record date and the same date plus 5 min?

I have something like this. With this part of code I detect if a vehicle stopped at least 5 minutes.
And works but, with a large amount of data, it starts to be slow.
I did a lot of tests and I'm sure that my problem is in the not exists block.
My table:
CREATE TABLE public.messages
(
id bigint PRIMARY KEY DEFAULT nextval('messages_id_seq'::regclass),
messagedate timestamp with time zone NOT NULL,
vehicleid integer NOT NULL,
driverid integer NOT NULL,
speedeffective double precision NOT NULL,
-- ... few nonsense properties
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.messages OWNER TO postgres;
CREATE INDEX idx_messages_1 ON public.messages
USING btree (vehicleid, messagedate);
And my query:
SELECT
*
FROM
messages m
WHERE
m.speedeffective > 0
and m.next_speedeffective = 0
and not exists( -- my problem
select id
from messages
where
vehicleid = m.vehicleid
and speedeffective > 5 -- I forgot this condition
and messagedate > m.messagedate
and messagedate <= m.messagedate + interval '5 minutes'
)
I can't figure out how to build the condition in a more performant way.
Edit DAY2:
I added a previous table like this to use in the second table:
WITH messagesx as (
SELECT
vehicleid,
messagedate
FROM
messages
WHERE
speedeffective > 5
)
and now works better. I think that I'm missing a little detail.
Typically, a 'NOT EXISTS' will slow down your query as it requires a full scan of the table for each of the outer rows. Try to incorporate the same functionality within a join (I'm trying to rewrite the query here, without knowing the table, so I might make a mistake here):
SELECT
*
FROM
messages m1
LEFT JOIN
messages m2
ON m1.vehicleid = m2.vehicleid AND m1.messagedate < m2.messagedate AND m1.messagedate <= m2.messagedate+interval '5 minutes'
WHERE
speedeffective > 0
and next_speedeffective = 0
and m2.vehicleid IS NULL
Take note that the NOT EXISTS is rewritten as the non-hit of the join condition.
Based on this answer: https://stackoverflow.com/a/36445233/5000827
and reading about NOT IN, NOT EXISTS and LEFT JOIN (where join is NULL)
For PostgreSQL, NOT EXISTS and LEFT JOIN are anti-join and works at the same way. (This is the reason why the #CountZukula answer's result is almost the same than mine)
The problem was on the kind of operation: Nest or Hash.
So, based on this: https://www.postgresql.org/docs/9.6/static/routine-vacuuming.html
PostgreSQL's VACUUM command has to process each table on a regular basis for several reasons:
To recover or reuse disk space occupied by updated or deleted rows.
To update data statistics used by the PostgreSQL query planner.
To update the visibility map, which speeds up index-only scans.
To protect against loss of very old data due to transaction ID wraparound or multixact ID wraparound.
I made a VACUUM ANALYZE to messages table and the same query works way fast.
So, with the VACUUM PostgreSQL can decide better.

SQL limit query

I'm having an issue with limiting the SQL query. I'm using SQL 2000 so I can't use any of the functions like ROW_NUMBER(),CTE OR OFFSET_ROW FETCH.
I have tried the Select TOP limit * FROM approach and excluded the already shown results but this way the query is so slow because sometimes my result query fetches more than 10000 records.
Also I have tried the following approach:
SELECT * FROM (
SELECT DISTINCT TOP 100 PERCENT i.name, i.location, i.image ,
( SELECT count(DISTINCT i.id) FROM image WHERE i.id<= im.id ) AS recordnum
FROM images AS im
order by im.location asc, im.name asc) as tmp
WHERE recordnum between 5 AND 15
same problem here plus issue because I couldn't add ORDER option in sub query from record um. I have placed both solution in stored procedure but still the query execution is still so slow.
So my question is:
IS there an efficient way to limit the query to pull 20 records per page in SQL 2000 for large amounts of data i.e more than 10000?
Thanks.
Now the subquery is only run once
where im2.id is null will skip the first 40 rows
SELECT top 25 im1.*
FROM images im1
left join ( select top 40 id from images order by id ) im2
on im1.id = im2.id
where im2.id is null
order by im1.id
Query-wise, there is no great performing way. If performance is critical and the data will always be grouped/ordered the same, you could add a int column and set the value by trigger based on the grouping/ordering. Index it and it should be extremely fast for reads; writes will be a bit slower.
Also, make sure you have indexes on the Id columns on image and images.

SQLite - a smart way to remove and add new objects

I have a table in my database and I want for each row in my table to have an unique id and to have the rows named sequently.
For example: I have 10 rows, each has an id - starting from 0, ending at 9. When I remove a row from a table, lets say - row number 5, there occurs a "hole". And afterwards I add more data, but the "hole" is still there.
It is important for me to know exact number of rows and to have at every row data in order to access my table arbitrarily.
There is a way in sqlite to do it? Or do I have to manually manage removing and adding of data?
Thank you in advance,
Ilya.
It may be worth considering whether you really want to do this. Primary keys usually should not change through the lifetime of the row, and you can always find the total number of rows by running:
SELECT COUNT(*) FROM table_name;
That said, the following trigger should "roll down" every ID number whenever a delete creates a hole:
CREATE TRIGGER sequentialize_ids AFTER DELETE ON table_name FOR EACH ROW
BEGIN
UPDATE table_name SET id=id-1 WHERE id > OLD.id;
END;
I tested this on a sample database and it appears to work as advertised. If you have the following table:
id name
1 First
2 Second
3 Third
4 Fourth
And delete where id=2, afterwards the table will be:
id name
1 First
2 Third
3 Fourth
This trigger can take a long time and has very poor scaling properties (it takes longer for each row you delete and each remaining row in the table). On my computer, deleting 15 rows at the beginning of a 1000 row table took 0.26 seconds, but this will certainly be longer on an iPhone.
I strongly suggest that you re-think your design. In my opinion your asking yourself for troubles in the future (e.g. if you create another table and want to have some relations between the tables).
If you want to know the number of rows just use:
SELECT count(*) FROM table_name;
If you want to access rows in the order of id, just define this field using PRIMARY KEY constraint:
CREATE TABLE test (
id INTEGER PRIMARY KEY,
...
);
and get rows using ORDER BY clause with ASC or DESC:
SELECT * FROM table_name ORDER BY id ASC;
Sqlite creates an index for the primary key field, so this query is fast.
I think that you would be interested in reading about LIMIT and OFFSET clauses.
The best source of information is the SQLite documentation.
If you don't want to take Stephen Jennings's very clever but performance-killing approach, just query a little differently. Instead of:
SELECT * FROM mytable WHERE id = ?
Do:
SELECT * FROM mytable ORDER BY id LIMIT 1 OFFSET ?
Note that OFFSET is zero-based, so you may need to subtract 1 from the variable you're indexing in with.
If you want to reclaim deleted row ids the VACUUM command or pragma may be what you seek,
http://www.sqlite.org/faq.html#q12
http://www.sqlite.org/lang_vacuum.html
http://www.sqlite.org/pragma.html#pragma_auto_vacuum