Find duplicate rows in PostgreSQL with additional criteria - postgresql

I have a table called entries that has the following columns: case_id, number and filed_on.
If I were only looking for duplicates where the case_id and number were the same, I would use the following query:
SELECT case_id, number, count(*) FROM entries GROUP BY case_id, number HAVING count(*) > 1;
But I would like to filter by an additional criterion, namely, that at least 1 of the duplicate rows has a filed_on that is null.
I thought the following query would work, but I think it gives me duplicate rows where ALL the duplicates have filed_on set to null, instead of duplicate rows where 1 or more of the rows have filed_on of null:
SELECT case_id, number, count(*) FROM entries WHERE filed_on IS NULL GROUP BY case_id, number HAVING count(*) > 1;
Any ideas for how I can modify this query to get what I want?

You want a condition which is checked after grouping, not before, i.e. HAVING instead of WHERE. Note that the condition should either be one of grouping fields or aggregate (just like in SELECT). You should be able to count number of rows which satisfies the condition like in this answer:
SELECT case_id, number, count(*)
FROM entries
GROUP BY case_id, number
HAVING (count(*) > 1) AND (count(CASE WHEN filed_on IS NULL THEN 1 END) >= 1)
See SQL Fiddle

Related

Postgres: counting records in two groups (existing foreign key or null)

I have a table items and a table batches. A batch can have n items associated by items.batch_id.
I'd like to write a query item counts in two groups batched and unbatched:
items WHERE batch_id IS NOT NULL (batched)
items WHERE batch_id IS NULL (unbatched)
The result should look like this
batched
unbatched
1200000
100
Any help appreciated, thank you!
EDIT:
I got stuck with using GROUP BY which turned out to be the wrong tool for the job.
You can use COUNT with `FILTER( WHERE)
it is called conditional count
CREATE TABLE items(item_id int, batch_id int)
CREATE TABLE
INSERT INTO items VALUEs(1,NULL),(2,NULL),(3,1)
INSERT 0 3
CREATE tABLe batch (batch_id int)
CREATE TABLE
select
count(*) filter (WHERE batch_id IS NOT NULL ) as "matched"
,
count(*) filter (WHERE batch_id IS NULL ) as "unmatched"
from items
matched
unmatched
1
2
SELECT 1
fiddle
The count() function seems to be the most likely basic tool here. Given an expression, it returns a count of the number of rows where that expression evaluates to non-null. Given the argument *, it counts all rows in the group.
To the extent that there is a trick, it is getting the batched an unbatched counts in the same result row. There are at least three ways to do that:
Using subqueries:
select
(select count(batch_id) from items) as batched,
(select count(*) from items where batch_id is null) as unbatched
-- no FROM
That's pretty straightforward. Each subquery is executed and produces one column of the result. Because no FROM clause is given in the outer query, there will be exactly one result row.
Using window functions:
select
count(batch_id) over () as batched,
(count(*) over () - count(batch_id) over ()) as unbatched
from items
limit 1
That will compute the batched and unbatched results for the whole table on every result row, one per row of the items table, but then only one result row is actually returned. It is reasonable to hope (though you would definitely want to test) that postgres doesn't actually compute those counts for all the rows that are culled by the limit clause. You might, for example, compare the performance of this option with that of the previous option.
Using count() with a filter clause, as described in detail in another answer.

Syntax error when trying to populate column with count of unique values in another column

I'm trying to count the number of unique pool operators for every permit # in a table but am having trouble putting this value in a new column dedicated to that count.
So I have 2 tables: doh_analysis; doh_pools.
Both of these tables have a "permit" column (TEXT), but doh_analysis has about 1000 rows with duplicates in the permit column but occasional unique values in the operator column (TEXT).
I'm trying to fill a column "operator_count" in the table "doh_pools" with a count of unique values in "pooloperator" for each permit #.
So I tried the following code but am getting a syntax error at or near "(":
update doh_pools
set operator_count = select count(distinct doh_analysis.pooloperator)
from doh_analysis
where doh_analysis.permit ilike doh_pools.permit;
When I remove the "select" from before the "count" I get "SQL Error [42803]: ERROR: aggregate functions are not allowed in UPDATE".
I can successfully query a list of distinct permit-pooloperator pairs using:
select distinct permit, pooloperator
from doh_analysis;
And I can query the # of unique pooloperators per permit 1 at a time using:
select count(distinct pooloperator)
from doh_analysis
where permit ilike '52-60-03054';
But I'm struggling to insert a count of unique pairs for each permit # in the operatorcount column.
Is there a way to do this?
There is certainly a better way of doing this but I accomplished my goal by creating 2 intermediary tables and the updating the target table with values from the 2nd intermediate table like so:
select distinct permit, pooloperator
into doh_pairs
from doh_analysis;
select permit, count(distinct pooloperator)
into doh_temp
from doh_pairs
group by permit;
select count(distinct permit)
from doh_temp;
update doh_pools
set operator_count = doh_temp.count
from doh_temp
where doh_pools.permit ilike doh_temp.permit
and doh_pools.permit is not NULL
returning count;

How to get distinct results with max of a field

I have a query :
select distinct(donorig_cdn),cerhue_num_rfa,max(cerhue_dt) from t_certif_hue
group by donorig_cdn,cerhue_num_rfa
order by donorig_cdn
it returns me some repeated ids with different cerhue_num_rfa
how do i return only one line for the repeated ids with cerhue_num_rfa that matches the max of date (cerhue_dt) .. and have at the end only 10 results instead of 15 ?
Postgres has SELECT DISTINCT ON to the rescue. It only returns the first row found for each value of the given column. So, all you need is an order that ensures the latest entry comes first. No need for grouping.
SELECT DISTINCT ON (donorig_cdn) donorig_cdn,cerhue_num_rfa,cerhue_dt
FROM t_certif_hue
ORDER BY donorig_cdn, cerhue_dt DESC;

T-SQL Count returns ALL records Instead of Subset

trying to figure out this dumb syntax issue
select distinct Count(Mgr) from CarsManager
returns all records..should be a subset count.
Select Mgr, Count(*)
From CarsManager
Group By Mgr
You did not specify on what the subset count should be made. Given your example, I assumed it was the Mgr column.
If what you seek is a count of unique managers, then you can do:
Select Count(Distinct Mgr)
From CarsManager
Difference between Count(*) and Count(SomeColumn)
In comments, you asked about the difference between Count(*) and Count(SomeCol). The difference isn't in performance but logic. Count(*) counts rows regardless of column. Count(SomeCol) counts non-null values in SomeCol.
COUNT (Transact-SQL)
COUNT(ALL expression) evaluates expression for each row in a group and returns the number of nonnull values.
In this case of Count(SomeCol), ALL is implied.
You can't do this all in one go - you need a subquery:
SELECT count(*) FROM (SELECT DISTINCT Mgr FROM CarsManager) as tbl1

To delete records beyond 20 from a table

At any time, I want my table to display the latest 20 rows and delete the rest.
I tried rownum > 20 but it said " 0 rows deleted" even when my table had 50 records.However, on triying rownum<20 - the first 19 records were deleted.
Please help.
ROWNUM is a pseudo-column which is assigned 1 for the first row produced by the query, 2 for the next, and so on. If you say "WHERE ROWNUM > 20", no row will be matched - the first row, if there was one, would have ROWNUM=1, but your predicate causes it to reject it - therefore the query returns no rows.
If you want to query just the latest 20 rows, you'd need some way of determining what order they were inserted into the table. For example, if each row gets a timestamp when it is inserted, this would usually be pretty reliable (unless you get thousands of rows inserted every second).
For example, a table with definition MYTABLE(ts TIMESTAMP, mycol NUMBER), you could query the latest 20 rows with a query like this:
SELECT * FROM (
SELECT ts, mycol FROM MYTABLE ORDER BY ts DESC
)
WHERE ROWNUM <= 20;
Note that if there is more than one row with exact same timestamp, this query may pick some rows non-deterministically if there are two or more rows tied for the 20th spot.
If you have an index on ts it is likely to use the index to avoid a sort, and Oracle will use stopkey optimisation to halt the query once it's found the 20th row.
If you want to delete the older rows, you could do something like this, assuming mycol is unique:
DELETE MYTABLE
WHERE mycol NOT IN (
SELECT mycol FROM (
SELECT ts, mycol FROM MYTABLE ORDER BY ts DESC
)
WHERE ROWNUM <= 20
);
The performance of this delete, if the number of rows to be deleted is large, will probably be helped by an index on mycol.