Function to flag duplicates in within query Postgresql - postgresql

I would like to write a function that flags duplicates in specified columns in postgresql.
For example, if I had the following table:
country | landscape | household
--------------------------------
TZA | L01 | HH02
TZA | L01 | HH03
KEN | L02 | HH01
RWA | L03 | HH01
I would like to be able to run the following query:
SELECT country,
landscape,
household,
flag_duplicates(country, landscape) AS flag
FROM mytable
And get the following result:
country | landscape | household | flag
---------------------------------------
TZA | L01 | HH02 | duplicated
TZA | L01 | HH03 | duplicated
KEN | L02 | HH01 |
RWA | L03 | HH01 |
Inside the body of the function, I think I need something like:
IF (country || landscape IN (SELECT country || landscape FROM mytable
GROUP BY country || landscape)
HAVING count(*) > 1) THEN 'duplicated'
ELSE NULL
But I am confused about how to pass all of those as arguments. I appreciate the help. I am using postgresql version 9.3.

You don't need a function to accomplish that. Using function for every row in result set is not so good idea because of performance. A way better solution is use pure SQL (even with subqueries) and give database engine chance to optimize it. In your very example it should be something like that:
SELECT t.country,t.landscape,t.household,case when duplicates.count>1 then 'duplicate'end
FROM mytable t JOIN (
SELECT count(household) FROM mytable GROUP BY country,landscape
) duplicates ON duplicates.country=t.country AND duplicates.landscape=t.landscape
which produces exactly the same result.
Update - if You want to use function at all cost, here is working example:
CREATE FUNCTION find_duplicates(arg_country varchar, arg_landscape varchar) returns varchar AS $$
BEGIN
RETURN CASE WHEN count(household)>1 THEN 'duplicated' END FROM mytable
WHERE country=arg_country AND landscape=arg_landscape
GROUP BY country,landscape;
END
$$
LANGUAGE plpgsql STABLE;

select
*,
(count(*) over (partition by country, landscape)) > 1 as flag
from
mytable;
For function look at the #MarcinH answer but add stable to the function's definition to make its calls faster.

Related

Using a PostgreSQL function inside a loop

Suppose I have a PostgreSQL function that takes 2 parameters: id (INT), email (TEXT) and can be called like this:
SELECT * FROM my_function(101, 'myemail#gmail.com')
I want to run a SELECT query from a table that would return multiple id's:
SELECT id FROM mytable
| id |
--+------+
| 101 |
--+------+
| 102 |
--+------+
| 103 |
How would I loop through and plug each of the returned id's into my function in a query. FOr this example just assume the default email is alwasy "myemail#gmail.com"
I'm on mobile so I can't test it, but I think maybe this will work.
SELECT * FROM (select my_function(id, 'myemail#gmail.com') from mytable);
You can use a cross join:
SELECT *
FROM my_table mt
cross join lateral my_function(mt.id, 'myemail#gmail.com') as mf

PostgreSQL. Stored function or query improvements

I'm using PostgreSQL and have an employee table:
employeeid | FirstName | LastName | Department | Position
1 | Aaaa | | |
2 | Bbbb | | |
3 | | | |
. | | | |
Reports table:
employeeid | enter | exit
1 | 2020-11-08 09:02:21 | 2020-11-08 18:12:01
. | ... |
Now, I'm querying for missed employees on a certain date something like that function:
for i in select employeeid from employee
loop
if select not exists(select 1 from reports where enter::date = '2020-11-08' and employeeid = i) then
return query select employeeid, lastname, firstname, department, position from employee where employeeid = i;
end if;
end loop;
Seems to me, it's not an ideal solution. Is there any better approach to achieve the same result?
Thanks.
Your code is pretty non effective. Probably it cannot be written badly :-). The reply #a_horse_with_no_name is absolutely correct, just for completeness I'll fix your plpgsql code.
DECLARE _e employee;
BEGIN
FOR _e IN SELECT * FROM employee
LOOP
IF NOT EXISTS(SELECT *
FROM reports r
WHERE r.enter::date = date '2020-11-08'
AND r.employeeid = _e.employeeid)
THEN
RETURN NEXT _e;
END IF;
END LOOP;
END;
When you write stored procedure (or any query based application), the important value is number of queries - low number is better (attention - there are exceptions - sometimes too much complex queries can be slower due their complexity). In your example there are employees * 2 + 1 queries (with larger overhead of plpgsql - RETURN QUERY is more expensive than RETURN NEXT). Solution proposed by #a_horse_with_no_name is one query (without a overhead of plpgsql). Any my example has employees + 1 queries (with lower overhead of plpgsql).
Your example is good example of one common SQL antipattern - "using ISAM style".
You are right: more often than not, a loop is a bad way to do things in a relational database.
You can do this in plain SQL:
select e.employeeid, e.lastname, e.firstname, e.department, e.position
from employee e
where not exists (select *
from reports r
where r.enter::date = date '2020-11-08'
and e.employeeid = r.employeeid);
This will return all employees for which no row exists in the reports table on 2020-11-08

How do I write postgres conditional SELECT query?

I have a table that has 3 columns.
id | name | score | approve
--------------------
1 | foo | 90 | f
2 | foo | 80 | t
I want to
SELECT id WHERE name='foo'
with these conditions:
if approve is True, then return that one (only one will be true for the same name)
otherwise select the one that has highest score
I was looking into IF...ELSE but cannot even come up with a query that executes (despite a working one...)
How to set up the query command for this type of queries?
In SQL, you can often use some logic by defining the right order and limit:
select id
from my_table
where name = 'foo'
order by approve desc, score desc
limit 1

Postgres: How to JOIN

I am working on Postgres 9.3. I have two tables, the first for payment items:
Table "public.prescription"
Column | Type | Modifiers
-------------------+-------------------------+--------------------------------------------------------------------
id | integer | not null default nextval('frontend_prescription_id_seq'::regclass)
presentation_code | character varying(15) | not null
presentation_name | character varying(1000) | not null
actual_cost | double precision | not null
pct_id | character varying(3) | not null
And the second for organisations:
Table "public.pct"
Column | Type | Modifiers
-------------------+-------------------------+-----------
code | character varying(3) | not null
name | character varying(200) |
I have a query to get all the payments for a particular code:
SELECT sum(actual_cost) as total_cost, pct_id as row_id
FROM prescription
WHERE presentation_code='1234' GROUP BY pct_id
Here is the query plan for that query.
Now, I'd like to annotate each row with the name property of the associated organisation. This is what I'm trying:
SELECT sum(prescription.actual_cost) as total_cost, prescription.pct_id, pct.name as row_id
FROM prescription, pct
WHERE prescription.presentation_code='0212000AAAAAAAA'
GROUP BY prescription.pct_id, pct.name;
Here's the ANALYSE for that query. It's incredibly slow: what am I doing wrong?
I think there must be a way to annotate each row with the pct.name AFTER the first query has run, which would be faster.
With JOIN (LEFT JOIN in this case, because we want the line even if there is no pct):
SELECT
sum(prescription.actual_cost) as total_cost,
prescription.pct_id,
pct.name as row_id
FROM prescription
LEFT JOIN pct ON pct.code = prescription.pct_id
WHERE
prescription.presentation_code='0212000AAAAAAAA'
GROUP BY
prescription.pct_id,
pct.name;
I don't know if it's work well, I didn't try this query.
You are taking data from 2 tables, but you do not join the tables in any way. Effectively, you make a full join, resulting in the Cartesian product of both tables. If you look at your ANALYZE statistics, you see that your nested loop processes 62 million rows. that takes time.
Add in a join condition to make this all fast:
SELECT sum(prescription.actual_cost) as total_cost, prescription.pct_id, pct.name as row_id
FROM prescription
JOIN pct On pct.code = prescription.pct_id
WHERE prescription.presentation_code = '0212000AAAAAAAA'
GROUP BY prescription.pct_id, pct.name;

Adding the results of two select queries into one table row with PostgreSQL

I am attempting to return the result of two distinct select statements into one row in PostgreSQL. For example, I have two queries each that return the same number of rows:
Select tableid1, tableid2, tableid3 from table1
+----------+----------+----------+
| tableid1 | tableid2 | tableid3 |
+----------+----------+----------+
| 1 | 2 | 3 |
| 4 | 5 | 6 |
+----------+----------+----------+
Select table2id1, table2id2, table2id3, table2id4 from table2
+-----------+-----------+-----------+-----------+
| table2id1 | table2id2 | table2id3 | table2id4 |
+-----------+-----------+-----------+-----------+
| 7 | 8 | 9 | 15 |
| 10 | 11 | 12 | 19 |
+-----------+-----------+-----------+-----------+
Now i want to concatenate these tables keeping the same number of rows. I do not want to join on any values. The desired result would look like the following:
+----------+----------+----------+-----------+-----------+-----------+-----------+
| tableid1 | tableid2 | tableid3 | table2id1 | table2id2 | table2id3 | table2id4 |
+----------+----------+----------+-----------+-----------+-----------+-----------+
| 1 | 2 | 3 | 7 | 8 | 9 | 15 |
| 4 | 5 | 6 | 10 | 11 | 12 | 19 |
+----------+----------+----------+-----------+-----------+-----------+-----------+
What can I do to the two above queries (select * from table1) and (select * from table2) to return the desired result above.
Thanks!
You can use row_number() for join, but I'm not sure that you have guaranties that order of the rows will stay the same as in the tables. So it's better to add some order into over() clause.
with cte1 as (
select
tableid1, tableid2, tableid3, row_number() over() as rn
from table1
), cte2 as (
select
table2id1, table2id2, table2id3, table2id4, row_number() over() as rn
from table2
)
select *
from cte1 as c1
inner join cte2 as c2 on c2.rn = c1.rn
You can't have what you want, as you wrote the question. Your two SELECTs don't have any ORDER BY clause, so the database can return the rows in whatever order it feels like. If it currently matches up, it does so only by accident, and will stop matching up as soon as you UPDATE a row.
You need a key column. Then you need to join on the key column. Anything else is attempting to invent unreliable and unsafe joins without actually using a join.
Frankly, this seems like a pretty dodgy schema. Lots of numbered integer columns like this, and the desire to concatenate them, may be a sign you should be looking at using integer arrays, or using a side-table with a foreign key relationship, instead.
Sample data in case anyone else wants to play:
CREATE TABLE table1(tableid1 integer, tableid2 integer, tableid3 integer);
INSERT INTO table1 VALUES (1,2,3), (4,5,6);
CREATE TABLE table2(table2id1 integer, table2id2 integer, table2id3 integer, table2id4 integer);
INSERT INTO table2 VALUES (7,8,9,15), (10,11,12,19);
Depending on what you're actually doing you might really have wanted arrays.
I think you might need to read these two posts:
Join 2 sets based on default order
How keep data don't sort?
which explain that SQL tables just don't have an order. So you cannot fetch them in a particular order.
DO NOT USE THE FOLLOWING CODE, IT IS DANGEROUS AND ONLY INCLUDED AS A PROOF OF CONCEPT:
As it happens you can use a set-returning function hack to very inefficiently do what you want. It's incredibly ugly and *completely unsafe without an ORDER BY in the SELECTs, but I'll include it for completeness. I guess.
CREATE OR REPLACE FUNCTION t1() RETURNS SETOF table1 AS $$ SELECT * FROM table1 $$ LANGUAGE sql;
CREATE OR REPLACE FUNCTION t2() RETURNS SETOF table2 AS $$ SELECT * FROM table2 $$ LANGUAGE sql;
SELECT (t1()).*, (t2()).*;
If you use this in any real code then kittens will cry. It'll produce insane and bizarre results if the number of rows in the tables differ and it'll produce the rows in orderings that might seem right at first, but will randomly start coming out wrong later on.
THE SANE WAY is to add a primary key properly, then do a join.