Delete the duplicate rows but keep latest using JOINS throwing Error - postgresql

I have tried all the ways to delete the duplicate data except one but all of them are taking a lot of time to query. But JOINS was the only one that took very less time. But here the issue is I am able to use select query. but delete is not working in pgadmin4 (PostgreSQL 14.0). how will I be able to resolve this issue.
delete s1 FROM persons s1,
persons s2
where
s1.personid > s2.personid
AND s1.lastname = s2.lastname
order by personid desc
limit 100
It throws error saying "syntax error at or near "s1".
How can I solve this issue?
+-------------------------------------------+
|personid| firstname | lastname | email |
|------------------------------------------ |
| 1 shanny edward shane#123 |
| ------------------------------------------|
| 2 abc way abc#123 |
+-------------------------------------------+

You may use the following exists logic when deleting:
DELETE
FROM persons p1
WHERE EXISTS (
SELECT 1
FROM persons p2
WHERE p2.lastname = p1.lastname AND
p2.personid < p1.personid
);
The above logic will spare, for each last name group of records, a single record having the smallest personid value.

Related

ADF - Dataflow, using Join to send new values

there are two tables
tbl_1 as a source data
ID | Submission_id
--------------------
1 | A00_1
2 | A00_2
3 | A00_3
4 | A00_4
5 | A00_5
6 | A00_6
7 | A00_7
tbl_2 as destination. In this table, Submission_id is unique key.
ID | Submission_id
--------------------
1 | A00_1
2 | A00_2
3 | A00_3
4 | A00_4
tbl_1 as input value and tbl_2 as destination (sink). Expected result is only A00_5, A00_6 & A00_7 sent to tbl_2. So, this picture below is the Join
for AlterRow,
expected ouput
tbl_2
ID | Submission_id
--------------------
1 | A00_1
2 | A00_2
3 | A00_3
4 | A00_4
5 | A00_5 -->(new)
6 | A00_6 -->(new)
7 | A00_7 -->(new)
But, output result from alterRow are all Submission_id. It should be only not equal comparison that has been stated in the alter row condition,
notEquals(DC__Submission_ID_BigInt, SrcStgDestination#{_Submission_ID}).
How to solve this problem in Azure DataFlow use 'Join' ?
I tried doing the same procedure and got the same result (all rows getting inserted). We were able to perform join in the desired way but couldn’t proceed further to get the required output. You can use the approach given below instead, which is achieved using JOINS.
In general, when we want to get records from table1 which are not present in table2, we execute the following query (in sql server).
select t1.id,t1.submission_id from t1 left outer join t2 on t1.submission_id = t2.submission_id where t2.submission_id is NULL
In the Dataflow, we were able to achieve the join successfully (same procedure as yours). Now instead using alter row transformation, I used filter transformation (to achieve t2.submission_id is NULL condition). I used the following expression (condition) to filter.
isNull(d1#submission_id) && isNull(d1#id)
Now proceed to configure the sink (tbl_2). The preview would show the records as in the below image.
Publish and run the dataflow activity in your pipeline to get the desired results.

Postgres best way to delete duplicates in large table with no primary key

I have a table that logs scan events wherein I store the first and last event. Each night at midnight the all the scan events from the previous day are added to the table, duplicates are dropped, and a query is run to delete anything other than the scan event with the minimum and maximum timestamp.
One of the problems is that the data provider recycles scan ID's every 45 days, so this table does not have a primary key. Here is an example of what the table looks like in it's final state:
|scaneventID|scandatetime |status |scanfacilityzip|
+-----------+-------------------+---------+---------------+
|isdijh23452|2020-01-01 13:45:12|Intake |12345 |
|isdijh23452|2020-01-03 19:32:18|Processed|45867 |
|awgjnh09864|2020-01-01 10:24:16|Intake |84676 |
|awgjnh09864|2020-01-02 02:15:52|Processed|84676 |
But before the cleanup queries are run it can look like this:
|scaneventID|scandatetime |status |scanfacilityzip|
+-----------+-------------------+---------+---------------+
|isdijh23452|2020-01-01 13:45:12|Intake |12345 |
|isdijh23452|2020-01-01 13:45:12|Intake |12345 |
|isdijh23452|2020-01-01 19:30:32|Received |12345 |
|isdijh23452|2020-01-02 04:50:22|Confirmed|12345 |
|isdijh23452|2020-01-03 19:32:18|Processed|45867 |
|awgjnh09864|2020-01-01 10:24:16|Intake |84676 |
|awgjnh09864|2020-01-01 19:30:32|Received |84676 |
|awgjnh09864|2020-01-01 19:30:32|Received |84676 |
|awgjnh09864|2020-01-02 02:15:52|Processed|84676 |
Because there are sometimes data overlap from the vendor and there's nothing we can really do about that. I currently run the following queries to delete duplicates:
DELETE FROM scans T1
USING scans T2
WHERE EXTRACT(DAY FROM current_timestamp-T1.scandatetime) < 2
AND T1.ctid < T2.ctid
AND T1.scaneventID = T2.scaneventID
AND T1.scandatetime = T2.scandatetime
;
And to retain only the min/max timestamps:
delete from scans
where EXTRACT(DAY FROM current_timestamp-scandatetime) < 2 and
scandatetime <> (select min(tt.scandatetime) from scans tt where tt.scaneventID = scans.scaneventID) and
scandatetime <> (select max(tt.scandatetime) from scans tt where tt.scaneventID = scans.scaneventID)
;
However the table is quite large (100's of millions of scans over multiple years) so these run quite slowly. How can I speed this up?

Drools - Finding a single matching condition for a table of products ranked by consumers

I have a table displaying information for the top four ratings of produce in a store. I want to be able to find specific products in this rating table. Here is a structure of the table
----------------------------------------------------------------------------
sectId | product_code | product_category | consumer_raniking
10444 | 11222 | PRODUCE | RATING_1
10444 | 45555 | PRODUCE | RATING_1
10444 | 10005 | PR0DUCE | RATING_1
20555 | 11344 | PRODUCE | RATING_2
20555 | 94003 | PRODUCE | RATING_2
... and so on.
I wrote a rule to find inserted products which ins not working the way I want, i.e. to find the targetted fact that was inserted into the table. Here is the rule I put together:
rule "find by product codes rating_1"
when
$product_table: ProductRanking( $rank1: this.getProductCodesRankFirst())
$product1 : Product( this.product_code memberOf $rank1, $product_code: product_code )
$product2 : Product( this.product_code == 10444,this.product_code != $product_code ,$product_code2: product_code)
then
System.out.println("Found Products for product_codes "+$product_code+ " "+$product_code2 ) ;
end
Unfortunately, this returns 3 rows. I inserted into the session the product in row 2 i.e. product with ocde 45555 and it does find row 2. However, ir also brings in row 1 and row3.
I can see why it's doing that. It's because the skus are in the sectId with sectId 10444. However, I want to only bring in the row
that I inserted, which is sectionId(10444), product_code(45555). How can I achieve that?
I solved it by using a global to filter out the extra products. In the first line that brings the rankings, I eliminate the extra-matching products this way:
global ProductHelper productHelper
$product_table: ProductRanking( $rank1: productHelper.getProductCodesRankFirst(),
productCode != productHelper.getProductCodeFruitCategory() && productCode!=
productHelper.productCodeVegetableCategory())
The ProductHelper identifies the product codes I want to eliminate and hence the extra 2 products brought in are ignored, creating a single match. I'm sure there is a better way, but since I'm no expert, this is what I was able to come up with.

Postgres Query for Beginners

Ok, I deleted previous post and will try this again. I am sure I don't know the topic and I'm not sure if this is a loop or if I should use a stored function or how to get what I'm looking for. Here's sample data and expected output;
I have a single table A. Table has following fields; date created, unique person key, type, location.
I need a Postgres query that says for any given month(parameter, based on date created) and given a location(parameter based on location field), provide me fieds below where unique person key may be duplicated + or – 30 days from the date created within the month given for same type but all locations.
Example Data
Date Created | Unique Person | Type | Location
---------------------------------------------------
2/5/2017 | 1 | Admit | Hospital1
2/6/2017 | 2 | Admit | Hospital2
2/15/2017 | 1 | Admit | Hospital2
2/28/2017 | 3 | Admit | Hospital2
3/3/2017 | 2 | Admit | Hospital1
3/15/2017 | 3 | Admit | Hospital3
3/20/2017 | 4 | Admit | Hospital1
4/1/2017 | 1 | Admit | Hospital2
Output for the month of March for Hospital1:
DateCreated| UniquePerson | Type | Location | +-30days | OtherLoc.
------------------------------------------------------------------------
3/3/2017 | 2 | Admit| Hospital1 | 2/6/2017 | Hospital2
Output for the month of March for Hospital2:
None, because no one was seen at Hospital2 in March
Output for the month of March for Hospital3:
DateCreated| UniquePerson | Type | Location | +-30days | otherLoc.
------------------------------------------------------------------------
3/15/2017 | 3 | Admit| Hospital3 | 2/28/2017 | Hospital2
Version 1
I would use a WITH clause. Please, notice that I've added a column id that is a primary key to simplify the query. It's just to prevent the rows to be matched with themselves.
WITH x AS (
SELECT
id,
date_created,
unique_person_id,
type,
location
FROM
a
WHERE
location = 'Hospital1' AND
date_trunc('month', date_created) = date_trunc('month', '2017-03-01'::date)
)
SELECT
x.date_created,
x.unique_person_id,
x.type,
x.location,
a.date_created AS "+-30days",
a.location AS other_location
FROM
x
JOIN a
USING (unique_person_id, type)
WHERE
x.id != a.id AND
abs(x.date_created - a.date_created) <= 30;
Now a little bit of explanations:
First we select, let's say a reference data with a WITH clause. Think of it as a temporary table that we can reference in the main query. In given example it could be a "main visit" in given hospital.
Then we join "main visits" with other visits of the same person and type (JOIN condition) that happen in date difference of 30 days (WHERE condition).
Notice that the WITH query has the limits you want to check (location and date). I use date_trunc function that truncates the date to specified precision (a month in this case).
Version 2
As #Laurenz Albe suggested, there is no special need to use a WITH clause. Right, so here is a second version.
SELECT
x.date_created,
x.unique_person_id,
x.type,
x.location,
a.date_created AS "+-30days",
a.location AS other_location
FROM
a AS x
JOIN a
USING (unique_person_id, type)
WHERE
x.location = 'Hospital1' AND
date_trunc('month', x.date_created) = date_trunc('month', '2017-03-01'::date) AND
x.id != a.id AND
abs(x.date_created - a.date_created) <= 30;
This version is shorter than the first one but, in my opinion, the first is easier to understand. I don't have big enough set of data to test and I wonder which one runs faster (the query planner shows similar values for both).

Postgresql query results to depend on few rows of same table

I'm working on some application, and we're using postgres as our DB. I don't a lot of experience with SQL at all, and now i encountered a problem, that i can't find answer to.
So here's a problem:
We have privacy settings stored in separate table, and accessibility of each row of data depends on few rows of this privacy table.
Basically structure of privacy table is:
entityId | entityType | privacyId | privacyType | allow | deletedAt
-------------------------------------------------------------------
5 | user | 6 | user | f | //example entry
5 | user | 1 | user_all | t |
In two words, this settings mean, that user id5 allows to have access to his data to everybody except user id6.
So i get available data by query like:
SELECT <some_relevant_fields> FROM <table>
JOIN <join>
WHERE
(privacy."privacyId"=6 AND privacy."privacyType"='user' AND privacy.allow=true)
OR (
(privacy."privacyType"='user_all' AND privacy."deletedAt" IS NOT NULL)
AND
(privacy."privacyType"='user' AND privacy."privacyId"=6 AND privacy.allow!=false)
);
I know that this query is incorrect in this form, but i want you to get idea of what i try to achieve.
So it must check for field with its type/id and allow=true, OR check that user_all is not deleted(deletedAt field is null) and there is no field restricting access with allow=false to this user.
But it seems like postgres is chaining all expressions, so it overrides privacy."privacyType"='user_all' with 'user' at the end of expression, and returns no results, or returns data even if user "blocked", because 'user_all' exist.
Is there a way to write WHERE clause to return result if AND expression is true for 2 different rows, for example in code above: (privacy."privacyType"='user_all' AND privacy."deletedAt" IS NOT NULL) is true for one row AND (privacy."privacyType"='user' AND privacy."privacyId"=6 AND privacy.allow!=false) is true for other, or maybe check for absence of row with this values.
Is this what you want?
select <some_fields> from <table> where
privacyType='user_all' AND deletedAt IS NOT NULL
union
select <some_fields> from <table> where
privacyType='user' AND privacyId=6 AND allow<>'f';
You left join the table with itself and found what element doesnt have a match using the where.
SELECT p1.*
FROM privacy p1
LEFT JOIN privacy p2
ON p1."entityId" = p2."entityId"
AND p1."privacyType" = 'user_all'
AND p1."deletedAt" IS NULL
AND p2."privacyType"='user' AND
AND p2."privacyId"= 6
AND p2.allow!=false
WHERE
p2.privacyId IS NOT NULL