I'm using this query to find duplicate dates but not sure how to sum each duplicate dates, average it and remove duplicate dates.
DB Schema
SQL Query
SELECT date_time, COUNT(date_time)
GROUP BY date_time
HAVING COUNT(date_time) > 1
ORDER BY COUNT(date_time)
I would create a new table to replace the old one. That is easier and might even perform better:
CREATE TABLE mytable2 (LIKE mytable);
INSERT INTO mytable2 (date_time, datapoint_1, datapoint_2)
SELECT m.date_time, avg(m.datapoint_1), avg(m.datapoint_2)
FROM mytable AS m
GROUP BY m.date_time;
Then you can drop mytable and rename mytable2 to replace it.
To prevent new rows from creating duplicates, you could change the way you insert data:
-- to keep track of counts
ALTER TABLE mytable ADD numval integer DEFAULT 1;
-- to prevent duplicates
ALTER TABLE mytable ADD UNIQUE (date_time);
-- to insert new rows
INSERT INTO mytable (date_time, datapoint_1, datapoint_2)
VALUES ('2021-06-30', 42.0, -34.9)
ON CONFLICT (date_time)
DO UPDATE SET numval = mytable.numval + 1,
datapoint_1 = mytable.datapoint_1 + excluded.datapoint_1,
datapoint_2 = mytable.datapoint_2 + excluded.datapoint_2;
-- to select the averages
SELECT date_time,
datapoint_1 / numval AS datapoint_1,
datapoint_2 / numval AS datapoint_2
FROM mytable;
When you use GROUP BY you can also use aggregate functions to reduce multiple lines to a single one (COUNT, that you used is one of such functions). In your case the query would be:
SELECT date_time, avg(datapoint_1), avg(datapoint_2)
GROUP BY date_time
For every distinct date_time you will get a single row with the average of datapoint_1 and datapoint_2.
I have a list of values:
and a 'template' row in a table.
I want to do this:
for value in valuelist:
insert into table1 (field1, field2, field3, field4)
select value1, value2, value3, (value)
from table1
where ID = (ID of template row)
I know how I would do this in code, like c# for instance, but I'm not sure how to 'loop' this while passing in a new value to the insert statement. (i know that code makes no sense, just trying to convey what I'm trying to accomplish.
There is no need to loop here, SQL is a set based language and you apply your operations to entire sets of data all at once as opposed to looping through row by row.
insert statements can come from either an explicit list of values or from the result of a regular select statement, for example:
insert into table1(col1, col2)
select col3
from table2;
There is nothing stopping you selecting your data from the same place you are inserting to, which will duplicate all your data:
insert into table1(col1, col2)
select col1
from table1;
If you want to edit one of these column values - say by incrementing the value currently held, you simply apply this logic to your select statement and make sure the resultant dataset matches your target table in number of columns and data types:
insert into table1(col1, col2)
select col1
,col2+1 as col2
from table1;
Optionally, if you only want to do this for a subset of those values, just add a standard where clause:
insert into table1(col1, col2)
select col1
,col2+1 as col2
from table1
where col1 = <your value>;
Now if this isn't enough for you to work it out by yourself, you can join your dataset to you values list to get a version of the data to be inserted for each value in that list. Because you want each row to join to each value, you can use a cross join:
declare #v table(value int);
insert into #v values(56957),(85697),(56325),(45698),(21367),(56397),(14758),(39656);
insert into table1(col1, col2, value)
select t.col1
from table1 as t
cross join #v as v
I have a table that will have 500,000+ records.
Each record has a LineNumber field which is not unique and not part of the primary key.
Each record has a CreatedOn field.
I need to update all 500,000+ records to identify repeat records.
A repeat records is defined by a record that has the same LineNumber within the last seven days of its CreatedOn field.
In the diagram above row 4 is a repeat because it occurred only five days since row 1.
Row 6 is not a repeat even though it occurs only four days since row 4, but row 4 itself is already a repeat, so Row 6 can only be compared to Row 1 which is nine days prior to Row 6, therefore Row 6 is not a repeat.
I do not know how to update the IsRepeat field with stepping through each record one-by-one via a cursor or something.
I do not believe cursors is the way to go, but I'm stuck with any other possible solution.
I have considered maybe Common Table Expressions may be of help but I have no experience with them and have no idea where to start.
Basically this same process needs to be done on the table every day as the table is truncated and re-populated every single day. Once the table is re-populated, I have to go through and re-mark each record if it is a repeat or not.
Some assistance would be most appreciated.
Here is a script to create a table and insert test data
USE [Test]
/****** Object: Table [dbo].[Job] Script Date: 08/18/2009 07:55:25 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[Job]') AND type in (N'U'))
DROP TABLE [dbo].[Job]
USE [Test]
/****** Object: Table [dbo].[Job] Script Date: 08/18/2009 07:55:25 ******/
IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[Job]') AND type in (N'U'))
CREATE TABLE [dbo].[Job](
[JobID] [int] IDENTITY(1,1) NOT NULL,
[LineNumber] [nvarchar](20) NULL,
[IsRepeat] [bit] NULL,
[CreatedOn] [smalldatetime] NOT NULL,
INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-01 07:52:08')
INSERT INTO dbo.Job VALUES ('1019',NULL,'2009-07-01 08:30:01')
INSERT INTO dbo.Job VALUES ('1028',NULL,'2009-07-01 09:30:35')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-01 10:51:10')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-02 09:22:30')
INSERT INTO dbo.Job VALUES ('1027',NULL,'2009-07-02 10:27:28')
INSERT INTO dbo.Job VALUES (NULL,NULL,'2009-07-02 11:15:33')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-02 13:01:13')
INSERT INTO dbo.Job VALUES ('1014',NULL,'2009-07-03 12:05:56')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-03 13:57:34')
INSERT INTO dbo.Job VALUES ('1025',NULL,'2009-07-03 15:38:54')
INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-04 16:32:20')
INSERT INTO dbo.Job VALUES ('1025',NULL,'2009-07-05 13:46:46')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-05 15:08:35')
INSERT INTO dbo.Job VALUES ('1000',NULL,'2009-07-05 15:19:50')
INSERT INTO dbo.Job VALUES ('1011',NULL,'2009-07-05 16:37:19')
INSERT INTO dbo.Job VALUES ('1019',NULL,'2009-07-05 17:14:09')
INSERT INTO dbo.Job VALUES ('1009',NULL,'2009-07-05 20:55:08')
INSERT INTO dbo.Job VALUES (NULL,NULL,'2009-07-06 08:29:29')
INSERT INTO dbo.Job VALUES ('1002',NULL,'2009-07-07 11:22:38')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-07 12:25:23')
INSERT INTO dbo.Job VALUES ('1023',NULL,'2009-07-08 09:32:07')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-08 09:46:33')
INSERT INTO dbo.Job VALUES ('1016',NULL,'2009-07-08 10:09:08')
INSERT INTO dbo.Job VALUES ('1023',NULL,'2009-07-09 10:45:04')
INSERT INTO dbo.Job VALUES ('1027',NULL,'2009-07-09 11:31:23')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-09 13:10:06')
INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-09 15:04:06')
INSERT INTO dbo.Job VALUES ('1010',NULL,'2009-07-09 17:32:16')
INSERT INTO dbo.Job VALUES ('1012',NULL,'2009-07-09 19:51:28')
INSERT INTO dbo.Job VALUES ('1000',NULL,'2009-07-10 15:09:42')
INSERT INTO dbo.Job VALUES ('1025',NULL,'2009-07-10 16:15:31')
INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-10 21:55:43')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-11 08:49:03')
INSERT INTO dbo.Job VALUES ('1022',NULL,'2009-07-11 16:47:21')
INSERT INTO dbo.Job VALUES ('1026',NULL,'2009-07-11 18:23:16')
INSERT INTO dbo.Job VALUES ('1010',NULL,'2009-07-11 19:49:31')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-12 11:57:26')
INSERT INTO dbo.Job VALUES ('1003',NULL,'2009-07-13 08:32:20')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-13 09:31:32')
INSERT INTO dbo.Job VALUES ('1021',NULL,'2009-07-14 09:52:54')
INSERT INTO dbo.Job VALUES ('1021',NULL,'2009-07-14 11:22:31')
INSERT INTO dbo.Job VALUES ('1023',NULL,'2009-07-14 11:54:14')
INSERT INTO dbo.Job VALUES (NULL,NULL,'2009-07-14 15:17:08')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-15 13:27:08')
INSERT INTO dbo.Job VALUES ('1010',NULL,'2009-07-15 14:10:56')
INSERT INTO dbo.Job VALUES ('1011',NULL,'2009-07-15 15:20:50')
INSERT INTO dbo.Job VALUES ('1028',NULL,'2009-07-15 15:39:18')
INSERT INTO dbo.Job VALUES ('1012',NULL,'2009-07-15 16:06:17')
INSERT INTO dbo.Job VALUES ('1017',NULL,'2009-07-16 11:52:08')
Ignores LineNumber is null. How should IsRepeat be handled in that case?
It works for test data. Whether it will be efficient enough for production volumes?
In the case of duplicate (LineNumber, CreatedOn) on pairs, arbitrarily choose one. (The one with minimum JobId)
Basic idea:
Get all JobId pairs that
are at least seven days apart, by
line number.
Count the number of
rows that are more than seven days
from the left side, upto and
including the right side. (CNT)
Then we know if JobId x is not a repeat, the next not a repeat is the pair with X on
the left side, and CNT = 1
Use recursive CTE to start with the first row for each LineNumber
Recursive element uses the pair with counts to get the next row.
Finally update, setting all IsRepeat to 0 for non-repeats and 1 for everything else.
; with AllPairsByLineNumberAtLeast7DaysApart (LineNumber
, LeftJobId
, RightJobId
, BeginCreatedOn
, EndCreatedOn) as
(select l.LineNumber
, l.JobId
, r.JobId
, dateadd(day, 7, l.CreatedOn)
, r.CreatedOn
from Job l
inner join Job r
on l.LineNumber = r.LineNumber
and dateadd(day, 7, l.CreatedOn) < r.CreatedOn
and l.JobId <> r.JobId)
-- Count the number of rows within from BeginCreatedOn
-- up to and including EndCreatedOn
-- In the case of CreatedOn = EndCreatedOn,
-- include only jobId <= jobid, to handle ties in CreatedOn
, AllPairsCount(LineNumber, LeftJobId, RightJobId, Cnt) as
(select ap.LineNumber, ap.LeftJobId, ap.RightJobId, count(*)
from AllPairsByLineNumberAtLeast7DaysApart ap
inner join Job j
on j.LineNumber = ap.LineNumber
and ap.BeginCreatedOn <= j.createdOn
and (j.CreatedOn < ap.EndCreatedOn
or (j.CreatedOn = ap.EndCreatedOn
and j.JobId <= ap.RightJobId))
group by ap.LineNumber, ap.LeftJobId, ap.RightJobId)
, Step1 (LineNumber, JobId, CreatedOn, RN) as
(select LineNumber, JobId, CreatedOn
, row_number() over
(partition by LineNumber order by CreatedOn, JobId)
from Job)
, Results (JobId, LineNumber, CreatedOn) as
-- Start with the first rows.
(select JobId, LineNumber, CreatedOn
from Step1
where RN = 1
and LineNumber is not null
-- get the next row
union all
select j.JobId, j.LineNumber, j.CreatedOn
from Results r
inner join AllPairsCount apc on apc.LeftJobId = r.JobId
inner join Job j
on j.JobId = apc.RightJobId
and apc.CNT = 1)
update j
set IsRepeat = case when R.JobId is not null then 0 else 1 end
from Job j
left outer join Results r
on j.JobId = R.JobId
where j.LineNumber is not null
After I turned off the computer last night I realized I had made things more complicated than they needed to be. A more straightforward (and on the test data, slightly more effecient) query:
Basic Idea:
Generated PotentialStep (FromJobId, ToJobId) These are the pairs where if FromJobId
is not a repeat, than ToJobId is also not a repeat. (First row by LineNumber more
than seven days from FromJobId)
Use a recursive CTE to start from the first JobId for each LineNumber and then step,
using PontentialSteps, to each Non Repeating JobId
; with PotentialSteps (FromJobId, ToJobId) as
(select FromJobId, ToJobId
from (select f.JobId as FromJobId
, t.JobId as ToJobId
, row_number() over
(partition by f.LineNumber order by t.CreatedOn, t.JobId) as RN
from Job f
inner join Job t
on f.LineNumber = t.LineNumber
and dateadd(day, 7, f.CreatedOn) < t.CreatedOn) t
where RN = 1)
, NonRepeats (JobId) as
(select JobId
from (select JobId
, row_number() over
(partition by LineNumber order by CreatedOn, JobId) as RN
from Job) Start
where RN = 1
union all
select J.JobId
from NonRepeats NR
inner join PotentialSteps PS
on NR.JobId = PS.FromJobId
inner join Job J
on PS.ToJobId = J.JobId)
update J
set IsRepeat = case when NR.JobId is not null then 0 else 1 end
from Job J
left outer join NonRepeats NR
on J.JobId = NR.JobId
where J.LineNumber is not null
SET Jobs.IsRepeat = 0 -- mark all of them IsRepeat = false
SET Jobs.IsRepeat = 1
(SELECT TOP 1 i.LineNumber FROM Jobs i WHERE i.LineNumber = Jobs.LineNumber
AND i.CreatedOn <> Jobs.CreatedOn and i.CreatedOn BETWEEN Jobs.CreatedOn - 7
AND Jobs.CreatedOn)
NOTE: I hope this helps you somewhat. Let me know, if you find any discrepancy that you will come across on a larger data set.
I'm not proud of this, it makes many assumptions (e.g. that CreatedOn is date only, and (LineNUmber,CreatedOn) is a key. Some tuning may be required, only works with test data.
In other words, I created this more for intellectual curiosity rather than because I think it's a genuine solution. Final select could be an update to set IsRepeat in the base table, based on existence on rows in V4. Final note before letting people see evil - could people please post test data in comments for data sets that it doesn't work for. It might be possible to turn this into a real solution:
with V1 as (
select t1.LineNumber,t1.CreatedOn,t2.CreatedOn as PrevDate from
T1 t1 inner join T1 t2 on t1.LineNumber = t2.LineNumber and t1.CreatedOn > t2.CreatedOn and DATEDIFF(DAY,t2.CreatedOn,t1.CreatedOn) < 7
), V2 as (
select v1.LineNumber,v1.CreatedOn,V1.PrevDate from V1
union all
select v1.LineNumber,v1.CreatedOn,v2.PrevDate from v1 inner join v2 on V1.LineNumber = v2.LineNumber and v1.PrevDate = v2.CreatedOn
), V3 as (
select LineNumber,CreatedOn,MIN(PrevDate) as PrevDate from V2 group by LineNumber,CreatedOn
), V4 as (
select LineNumber,CreatedOn from V3 where DATEDIFF(DAY,PrevDate,CreatedOn) < 7
CASE WHEN V4.LineNumber is Null then 0 else 1 end as IsRepeat
left join
T1.LineNumber = V4.LineNumber and
T1.CreatedOn = V4.CreatedOn
order by T1.CreatedOn,T1.LineNumber
option (maxrecursion 7)
I am searching for a query to select the maximum date (a datetime column) and keep its id and row_id. The desire is to DELETE the rows in the source table.
Source Data
id date row_id(unique)
1 11/11/2009 1
1 12/11/2009 2
1 13/11/2009 3
2 1/11/2009 4
Expected Survivors
1 13/11/2009 3
2 1/11/2009 4
What query would I need to achieve the results I am looking for?
Tested on PostgreSQL:
delete from table where (id, date) not in (select id, max(date) from table group by id);
There are various ways of doing this, but the basic idea is the same:
- Indentify the rows you want to keep
- Compare each row in your table to the ones you want to keep
- Delete any that don't match
yourTable AS [source]
yourTable AS [keep]
ON [keep].id = [source].id
AND [keep].date = (SELECT MAX(date) FROM yourTable WHERE id = [keep].id)
[keep].id IS NULL
SELECT id, MAX(date) AS date FROM yourTable GROUP BY id
AS [keep]
ON [keep].id = [yourTable].id
AND [keep].date = [yourTable].date
[keep].id IS NULL
yourTable AS [source]
[source].row_id != (SELECT TOP 1 row_id FROM yourTable WHERE id = [source].id ORDER BY date DESC)
yourTable AS [source]
NOT EXISTS (SELECT id FROM yourTable GROUP BY id HAVING id = [source].id AND MAX(date) != [source].date)
Because you are using SQL Server 2000, you'er not able to use the Row Over technique of setting up a sequence and to identify the top row for each unique id.
So, your proposed technique is to use a datetime column to get the top 1 row to remove duplicates. That might work, but there is a possibility that you might still get duplicates having the same datetime value. But that's easy enough to check for.
First check the assumption that all rows are unique based on the id and date columns:
CREATE TABLE #TestTable (rowid INT IDENTITY(1,1), thisid INT, thisdate DATETIME)
INSERT INTO #TestTable (thisid,thisdate) VALUES (1, '11/11/2009')
INSERT INTO #TestTable (thisid,thisdate) VALUES (1, '12/11/2009')
INSERT INTO #TestTable (thisid,thisdate) VALUES (1, '12/12/2009')
INSERT INTO #TestTable (thisid,thisdate) VALUES (2, '1/11/2009')
INSERT INTO #TestTable (thisid,thisdate) VALUES (2, '1/11/2009')
SELECT COUNT(*) AS thiscount
FROM #TestTable
GROUP BY thisid, thisdate
This example returns a value of 2 - indicating that you will still end up with duplicates even after using the date column to remove duplicates. If you return 0, then you have proven that your proposed technique will work.
When de-duping production data, I think one should take some precautions and test before and after. You should create a table to hold the rows you plan to remove so you can recover them easily if you need to after the delete statement has been executed.
Also, it's a good idea to know beforehand how many rows you plan to remove so you can verify the count before and after - and you can gauge the magnitude of the delete operation. Based on how many rows will be affected, you can plan when to run the operation.
To test before the de-duping process, find the occurrences.
-- Get occurrences of duplicates
SELECT COUNT(*) AS thiscount
GROUP BY thisid
ORDER BY thisid
That gives you the rows with more than one row with the same id. Capture the rows from this query into a temporary table and then run a query using the SUM to get the total number of rows that are not unique based on your key.
To get the number of rows you plan to delete, you need the count of rows that are duplicate based on your unique key, and the number of distinct rows based on your unique key. You subtract the distinct rows from the count of occurrences. All that is pretty straightforward - so I'll leave you to it.
Try this
declare #t table (id int, dt DATETIME,rowid INT IDENTITY(1,1))
INSERT INTO #t (id,dt) VALUES (1, '11/11/2009')
INSERT INTO #t (id,dt) VALUES (1, '11/12/2009')
INSERT INTO #t (id,dt) VALUES (1, '11/13/2009')
INSERT INTO #t (id,dt) VALUES (2, '11/01/2009')
delete from #t where rowid not in(
select t.rowid from #t t
inner join(
select MAX(dt)maxdate
from #t
group by id) X
on t.dt = X.maxdate )
select * from #t
id dt rowid
1 2009-11-13 00:00:00.000 3
2 2009-11-01 00:00:00.000 4
delete from temp where row_id not in (
select t.row_id from temp t
right join
(select id,MAX(dt) as dt from temp group by id) d
on t.dt = d.dt and t.id = d.id)
I have tested this answer..
INSERT INTO #t (id,dt) VALUES (1, '11/11/2009')
INSERT INTO #t (id,dt) VALUES (1, '11/12/2009')
INSERT INTO #t (id,dt) VALUES (1, '11/13/2009')
INSERT INTO #t (id,dt) VALUES (2, '11/01/2009')
select * from #t
select dense_rank() over(partition by id order by dt desc)NO,DT,ID,rowid from #t )