TSQL How to manage find or handle duplicates

TSQL How to manage find or handle duplicates - tsql

I have a query which is returning duplicates - How can I amend the query to prevent this or put a fix in place so this doe snot happen. I have tried using distinct but to no avail.
This is the query
select sigfind.*
from [RM-JOB]
inner join (
select [JOB-NO], [LINK-JOB-NO]
from [RM-LINK-JOBS]
where [RM-LINK-JOBS].[reason-code] = 'FRA'
) linkedjob on [RM-JOB].[JOB-NO] = linkedjob.[JOB-NO]
inner join [RM-JOB] [RMLinkedJob] on linkedjob.[LINK-JOB-NO] = [RMLinkedJob].[JOB-NO]
and [linkedjob].[JOB-NO] = [RM-JOB].[JOB-NO]
inner join (
SELECT ar.JobNo, ar.[ActionRef], AR.ActionStatus, AR.ActionDueDate [LINKED VISIT TARGET DATE], ar.SignificantFindings [SignificantFindings], ar.SyncDate, ar.dateFirRiskIdentified [Date of FRA] , ard.caseref [Linked Case Number]
FROM [CYHSQL01].[TM_FireRiskAssessment].[dbo].[ActionRequired] ar
inner join [CYHSQL01].[TM_FireRiskAssessment].[dbo].[ActionRequireddetails] ard
on AR.ResultID = ARD.ResultID and ard.actionref = ar.[ActionRef]
) sigfind on sigfind.JobNo = linkedjob.[JOB-NO]
where [RMLinkedJob].[JOB-TYPE] in ('FSRE')
and [RM-JOB].[JOB-STATUS] in ('06','90')
This is what the output is;
JobNo ActionRef ActionStatu LINKED VISIT TARGET DATE SignificantFindings SyncDate Date of FRA Linked Case Number
10265985 CP1 Closed 2019-01-18 Shoe cabinet is in the communal hallway door is a potential trip hazard when evacuating. 2019-10-18 11:11:41.457 2019-10-18 NULL
10265985 CP1 Closed 2019-01-18 Shoe cabinet is in the communal hallway door is a potential trip hazard when evacuating. 2019-10-18 11:11:41.457 2019-10-18 NULL
10265985 CP2 Closed 2019-10-25 Large screen tv has been fly tipped in the communal hallway. 2019-10-18 11:11:41.460 2019-10-18 NULL
10265985 CP2 Closed 2019-10-25 Large screen tv has been fly tipped in the communal hallway. 2019-10-18 11:11:41.460 2019-10-18 NULL
10265985 CP1 Closed 2019-01-18 Shoe cabinet is in the communal hallway door is a potential trip hazard when evacuating. 2019-10-18 11:11:41.457 2019-10-18 NULL
10265985 CP1 Closed 2019-01-18 Shoe cabinet is in the communal hallway door is a potential trip hazard when evacuating. 2019-10-18 11:11:41.457 2019-10-18 NULL
10265985 CP1 Closed 2019-01-18 Shoe cabinet is in the communal hallway door is a potential trip hazard when evacuating. 2019-10-18 11:11:41.457 2019-10-18 NULL
10265985 CP1 Closed 2019-01-18 Shoe cabinet is in the communal hallway door is a potential trip hazard when evacuating. 2019-10-18 11:11:41.457 2019-10-18 NULL
10265985 CP1 Closed 2019-01-18 Shoe cabinet is in the communal hallway door is a potential trip hazard when evacuating. 2019-10-18 11:11:41.457 2019-10-18 NULL
10265985 CP1 Closed 2019-01-18 Shoe cabinet is in the communal hallway door is a potential trip hazard when evacuating. 2019-10-18 11:11:41.457 2019-10-18 NULL
10265986 CP1 Closed 2019-10-28 There is a 5 litre petrol can stored under the stairs. This must be removed as a matter of urgency as it’s potentially highly flammable. 2019-10-21 13:05:56.533 2019-10-21 NULL
10265986 CP1 Closed 2019-10-28 There is a 5 litre petrol can stored under the stairs. This must be removed as a matter of urgency as it’s potentially highly flammable. 2019-10-21 13:05:56.533 2019-10-21 NULL
10265986 CP2 Open 2019-11-18 Shoe rack is being used in the communal area. This is a potential trip hazard in an evacuation situation. 2019-10-21 13:05:56.543 2019-10-21 NULL
10265986 CP2 Open 2019-11-18 Shoe rack is being used in the communal area. This is a potential trip hazard in an evacuation situation. 2019-10-21 13:05:56.543 2019-10-21 NULL
10265986 CP3 Open 2019-12-23 Christmas wreaths are a potential fire risk so must be removed. 2019-10-21 13:05:56.553 2019-10-21 Test Reference 12345
10265986 CP3 Open 2019-12-23 Christmas wreaths are a potential fire risk so must be removed. 2019-10-21 13:05:56.553 2019-10-21 Test Reference 12345
10265985 CP1 Closed 2019-01-18 Shoe cabinet is in the communal hallway door is a potential trip hazard when evacuating. 2019-10-18 11:11:41.457 2019-10-18 NULL
10265985 CP1 Closed 2019-01-18 Shoe cabinet is in the communal hallway door is a potential trip hazard when evacuating. 2019-10-18 11:11:41.457 2019-10-18 NULL
Could I use a windowing function to remove duplicates? I would be using JobNo, ActionRef and Syncdate to produce a number 1 and filter on that?
Thanks in advance for any help.

Your duplicates may be caused by join conditions, although I have to admit I have troubles to understand your SQL code. If adding distinct between select and sigfind.* does not work try to put whole code into WITH clause and then select distinct.
with myResult as (
Your Query here
)
select distinct column1, column2, ..., columnX from myResult
It might be better to go thru your query logic, fix it a try to make it more readable .

Related

Deadlock in PostgreSQL with subquery [duplicate]

This question already has answers here:
Postgres UPDATE with ORDER BY, how to do it?
(2 answers)
Closed 3 years ago.
We're getting deadlocks in a situation where I thought they wouldn't happen due to sorting.
2019-09-11T20:21:59.505804531Z 2019-09-11 20:21:59.505 UTC [67] ERROR: deadlock detected
2019-09-11T20:21:59.505824424Z 2019-09-11 20:21:59.505 UTC [67] DETAIL: Process 67 waits for ShareLock on transaction 1277067; blocked by process 35.
2019-09-11T20:21:59.505829400Z Process 35 waits for ShareLock on transaction 1277065; blocked by process 67.
2019-09-11T20:21:59.505833648Z Process 67: UPDATE "records" SET "last_data_at" = '2019-09-11 20:21:58.493184' WHERE "records"."id" IN (SELECT "records"."id" FROM "records" WHERE "records"."id" IN ($1, $2) ORDER BY id asc)
2019-09-11T20:21:59.505843428Z Process 35: UPDATE "records" SET "last_data_at" = '2019-09-11 20:21:58.496318' WHERE "records"."id" IN (SELECT "records"."id" FROM "records" WHERE "records"."id" IN ($1, $2) ORDER BY id asc)
Here, since the ids from the (admittedly unnecessary) subquery will be sorted, I'd think a deadlock shouldn't be possible. Does IN not follow the ordering of the passed array? If not, how can I fix this?
(The subquery is coming from our ORM.)

What's the ORM you're using?
You could use advisory locking to mitigate the deadlocks:
UPDATE
"records"
SET
"last_data_at" = '2019-09-11 20:21:58.496318'
WHERE
"records"."id" IN ($1, $2)
--This function will return TRUE if getting
--a lock is possible for current transaction
AND pg_try_advisory_xact_lock("records"."id")
Honestly, IMHO relying on an order by clause to avoid deadlocks seems a bit fragile solution.
More info about advisory locking functions here.

Is there a reliable way to extract years from a postgres interval?

In Postgres, if I do the following:
select (now() - created_at) from my_table
I get results like this:
854 days 12:04:50.29658
Whereas, if I do:
select age(now(), created_at) fro my_table
I get results like this:
2 years 4 mons 3 days 12:04:50.29658
According to pg_typeof(...) they are both of type interval
But if I try to extract the years:
select extract(years from age(now(), created_at)) from my_table
I get:
2
Whereas, with:
select extract(years from (now() - created_at)) from my_table
I get:
0
Is there a consistent way to extract the number of years from an interval value (no matter how it was generated)?
Note: I don't have write access to the db, so can't define stored procedures, etc. Needs to be a select statement.
------ UPDATE ------
justify_interval(...) was suggested below, but unfortunately it seems to be inaccurate in its calculations.
E.g:
select age('2018-01-03'::timestamp, '2016-01-05'::timestamp);
gives the correct answer:
1 year 11 mons 29 days
Whereas:
select justify_interval('2018-01-03'::timestamp - '2016-01-05'::timestamp);
gives:
2 years 9 days
I believe this is because it (incorrectly) assumes that all months have 30 days in them
(see justify_days
here: https://www.postgresql.org/docs/current/static/functions-datetime.html)

The function justify_interval does what you want. Use it in combination with EXTRACT to get the years:
SELECT EXTRACT(years FROM
justify_interval(INTERVAL '1 year 900 days 700 hours'));
date_part
-----------
3
(1 row)
If 30 days = 1 month isn't accurate enough for you, you'll have to use EXTRACT to get the number of days and divide by 365.25.
There is a theoretical limit how exact you can be, because the number of years in an interval somewhat depends on between which dates that interval is.
The two-element age function gives a precise result for the number of years between two dates.

How to group by similar values with pg_trgm

I have the following table
id error
- ----------------------------------------
1 Error 1234eee5, can not write to disk
2 Error 83457qwe, can not write to disk
3 Error 72344ee, can not write to disk
4 Fatal barier breach on object 72fgsff
5 Fatal barier breach on object 7fasdfa
6 Fatal barier breach on object 73456xcc5
I want to be able to get a result that counts by similarity, where similarity of > 80% means two errors are equal. I've been using pg_trgm extension, and its similarity function works perfectly for me, the only thing I can figure out how to produce the grouping result below.
Error Count
------------------------------------- ------
Error 1234eee5, can not write to disk, 3
Fatal barier breach on object 72fgsff, 3

Basically you could join a table with itself to find similar strings, however this approach will end in a terribly slow query on a larger dataset. Also, using similarity() may cause inaccuracy in some cases (you need to find the appropriate limit value).
You should try to find patterns. For example, if all variable words in strings begin with a digit, you can mask them using regexp_replace():
select id, regexp_replace(error, '\d\w+', 'xxxxx') as error
from errors;
id | error
----+-------------------------------------
1 | Error xxxxx, can not write to disk
2 | Error xxxxx, can not write to disk
3 | Error xxxxx, can not write to disk
4 | Fatal barier breach on object xxxxx
5 | Fatal barier breach on object xxxxx
6 | Fatal barier breach on object xxxxx
(6 rows)
so you can easily group the data by error message:
select regexp_replace(error, '\d\w+', 'xxxxx') as error, count(*)
from errors
group by 1;
error | count
-------------------------------------+-------
Error xxxxx, can not write to disk | 3
Fatal barier breach on object xxxxx | 3
(2 rows)
The above query is only an example as the specific solution depends on the data format.
Using pg_trgm
The solution based on the OP's idea (see the comments below). The limit 0.8 for similarity() is certainly too high. It seems that it should be somewhere about 0.6.
The table for unique errors (I've used a temporary table but it also be a regular one of course):
create temp table if not exists unique_errors(
id serial primary key,
error text,
ids int[]);
The ids column is to store id of rows of the base table which contain similar errors.
do $$
declare
e record;
found_id int;
begin
truncate unique_errors;
for e in select * from errors loop
select min(id)
into found_id
from unique_errors u
where similarity(u.error, e.error) > 0.6;
if found_id is not null then
update unique_errors
set ids = ids || e.id
where id = found_id;
else
insert into unique_errors (error, ids)
values (e.error, array[e.id]);
end if;
end loop;
end $$;
The final results:
select *, cardinality(ids) as count
from unique_errors;
id | error | ids | count
----+---------------------------------------+---------+-------
1 | Error 1234eee5, can not write to disk | {1,2,3} | 3
2 | Fatal barier breach on object 72fgsff | {4,5,6} | 3
(2 rows)

For this particular case you could just group by left(error, 5), which would lead to two groups, one containing all the strings starting with Error, the other group containing all the strings starting with Fatal. This criteria would have to be updated if you are planning to add more error types.

How to extract the year without it being repeated

Hello colleagues I have a question about a query if it could be done I have a table called sale and a field called sales_date, so the field is full more or less like this
--------------------------------------------------
sales_date
--------------------------------------------------
2013-02-03
2013-02-05
2014-06-07
2015-03-04
2015-01-04
2016-04-07
2016-09-03
2016-04-09
And I would like to know how to do a select and show me only the years without repeating
--------------------------------------------------
sales_date
--------------------------------------------------
2013
2014
2015
2016
Thanks any help, my database is in Postgresql version 9.5.

You can use extract() to get the year and distinct to remove the duplicates:
select distinct extract(year from sales_date) as sales_date
from sale;

Reporting Services - get previous column value if current is NULL

I am trying to reduce the size of the data a T-SQL query returns to the Reporting Services. For example, let's say we have the following row set:
ID Country City
1 Germany Berlin
2 Germany Berlin
3 Germany Berlin
4 Germany Berlin
5 Germany Hamburg
6 Germany Hamburg
7 Germany Hamburg
8 Germany Hamburg
9 Germany Berlin
10 Germany Berlin
It can be transform easily to this:
ID Country City
1 Germany Berlin
2 NULL NULL
3 NULL NULL
4 NULL NULL
5 NULL Hamburg
6 NULL NULL
7 NULL NULL
8 NULL NULL
9 NULL Berlin
10 NULL NULL
As I may have thousands of thousands of duplicated values (and hundreds of columns), I know that transforming the data using NULLs like this reduce dramatically the size of the returned data.
Is it possible, to implement a formula, which get's the previous row column value, if the current one is NULL?
I want to test if it will be faster to just render huge data or to work with smaller data but apply such expression.

Why not just use Group BY
SELECT Min(Id) as min, Max(ID) as max,Country,City
FROM myTable
GROUP BY Country,City
Or
SELECT count(Id) as rowRepeatNum,Country,City
FROM myTable
GROUP BY Country,City
The first approach would work if your Ids were sequential for this although I'm not sure if the Id is important.
The second way can give you an number to loop down to generate repeat rows in your application fairly quickly and return significantly less rows.
Can give me more info on your use case?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

TSQL How to manage find or handle duplicates - tsql

Related

Deadlock in PostgreSQL with subquery [duplicate]

Is there a reliable way to extract years from a postgres interval?

How to group by similar values with pg_trgm

How to extract the year without it being repeated

Reporting Services - get previous column value if current is NULL

Categories

Resources