Count distinct id with case - tsql

So say I have a list IDs with values, like so:
ID VALUE
1 A
1 NULL
1 B
2 NULL
3 A
And I want to count distinct ID for values that equal A OR B (at least one)
The answer should be 2, since there are 2 IDs that have either A or B value.
If I do a COUNT(DISTINCT(CASE WHEN ... I'm unable to get the unique IDs.
Is there another solution?

You should be able to filter for values A or B, and then count the distinct Ids listed, as below:
declare #t table (id int not null,
value char(1) null);
insert into #t
values
(1,'A'),
(1,NULL),
(1,'B'),
(2,NULL),
(3,'A');
select
count(distinct id)
from #t
where value in ('A','B');

select count(distinct Id)
from tbl
where value in ('A','B')

Related

Prioritize one row over another in Postgres

DB-Fiddle
create table items (
name varchar(15) not null,
id1 integer,
id2 integer,
UNIQUE(id1),
UNIQUE(id2)
);
insert into items (name, id1, id2) values
('a', 1, null),
('b', 2, null),
('c', null, 2);
select * from items where id1=2
union
select * from items where id2=2
and id2 not in (select id2 from items where id1=2);
I have a table where there are multiple fields containing the unique id belonging to a given item. In my example, either id1 or id2 contains this value. My goal would be to only rely on id2 if the item cannot be found via id1. So, I would expect to always get back b in my example.
I have managed to get this working via a union, but it seems like a very hacky solution with bad performance. A better solution seems to me is to filter on the client side. What do you think?
We can use ROW_NUMBER here:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY COALESCE(id1, id2) ORDER BY id1) rn
FROM items
)
SELECT name, id1, id2
FROM cte
WHERE rn = 1;
Each logical pair of records is described by COALESCE(id1, id2) being the same between them (if a pair exists). Within each pair, we choose the record having the lowest id1 value, which would mean the non null record.
I would use
SELECT *
FROM (SELECT * FROM items WHERE id1 = 2
UNION ALL
SELECT * FROM items WHERE id1 = 2) AS q
ORDER BY id2 IS NULL
FETCH FIRST 1 ROWS ONLY;
That relies on the fact that FALSE < TRUE.

Postgres 14 delete with count in where clause

I wanted to delete all records except the one with the highest value so I did
CREATE TABLE code (
id SERIAL,
name VARCHAR(255) NOT NULL ,
value int NOT NULL
);
INSERT INTO code (name,value) VALUES ('name',1);
INSERT INTO code (name,value) VALUES ('name',2);
INSERT INTO code (name,value) VALUES ('name',3);
INSERT INTO code (name,value) VALUES ('name1',3);
INSERT INTO code (name,value) VALUES ('name2',1);
INSERT INTO code (name,value) VALUES ('name2',3);
Example I want to delete all records except the one with the highest value on value column
I am expecting to get result as:
name 3
name1 3
name2 3
I tried doing
DELETE FROM code where value != (select MAX(value) value from code where count(code) > 1)
But I'm getting an error like:
ERROR: aggregate functions are not allowed in WHERE
LINE 1: ...value != (select MAX(value) value from code where count(code...
With everyone's idea and combine with this
SELECT dept, SUM(expense) FROM records
WHERE ROW(year, dept) IN (SELECT x, y FROM otherTable)
GROUP BY dept;
link
I was able to make the query I want
Demo
Your query makes no sense. Try this:
DELETE FROM code
where value <> (select value
FROM (SELECT count(*) AS count,
value
from code
GROUP BY value) AS q
ORDER BY count DESC
FETCH FIRST 1 ROWS ONLY);
The fast and easy solution would be:
BEGIN;
SELECT name,max(value) INTO temp t FROM code group by 1;
TRUNCATE code;
insert into code SELECT * FROM t;
END;
Or you can do like:
BEGIN;
DELETE FROM code USING (SELECT name,max(value) FROM code group by 1) a WHERE code.name=a.name AND code.value!=a.max;
END;

postgresql procedure for fetching top 10%,20% and 30% values of the total values

i have a table named Scoreboard which contains a field named as score which is an array containing values 27,56,78,12,89,77,34,23,90,87,33,55,30,67,76,87,56and i want to write a PostgreSQL procedure to fetch three categories
category 1 = top 10% values of the total no of values in array
category 2 = top 20% values of the total no of values in array
category 3 = top 30% values of the total no of values in array
and put it in an array in the same format i.e
[category 1 values,category 2 values,category 3 values]
smth like this should do:
t=# with p as (
with ntile as (
with v as (
select unnest('{27,56,78,12,89,77,34,23,90,87,33,55,30,67,76,87,56}'::int[]) a
)
select a,ntile(10) over(order by a desc)
from v
)
select distinct string_agg(a::text,',') over (partition by ntile),ntile
from ntile
where ntile <=3 order by ntile
)
select array_agg(concat('category ',ntile,' ',string_agg))
from p;
array_agg
------------------------------------------------------------
{"category 1 90,89","category 2 87,87","category 3 78,77"}
(1 row)
Time: 0.493 ms
I am assuming , you have a table with one column as id and another one is an array type. Based on assumption
I have created table as below and inserted two values to it.
create table test_array (id int, values int[]);
insert into test_array values(1 ,'{27,56,78,12,89,77,34,23,90,87,33,55,30,67,76,87,56}' );
insert into test_array values(2 ,'{72,65,84,21,98,77,43,32,9,78,41,66,3,76,67,88,56}' );
Below is function which is used to find category as mentioned by you. If you do not have any id column in your table
then you can add number by using window function hint: row_number().
create or replace function find_category() returns table(category text[]) as
$$
BEGIN
return query with unnestColumn as (
select id, unnest(values) as values, ntile(10) over(partition by id order by unnest(values) desc) as ntilenumber
from test_array
) ,groupedCategory as ( select id, ntilenumber, string_agg(values::text,',') as combinedvalues from unnestColumn
where
ntilenumber <= 3
group by id, ntilenumber )
select array_agg(concat('Categoty',ntilenumber, ' ', combinedvalues ))
from groupedCategory
group by id;
END;
$$
language 'plpgsql';
Execute below function to check output.
select * from find_category();

Most effective way to get value if select count(*) = 1 with grouping

Lets say I have table with ID int, VALUE string:
ID | VALUE
1 abc
2 abc
3 def
4 abc
5 abc
6 abc
If I do select value, count(*) group by value I should get
VALUE | COUNT
abc 5
def 1
Now the tricky part, if there is count == 1 I need to get that ID from first table. Should I be using CTE? creating resultset where I will add ID string == null and run update b.ID = a.ID where count == 1 ?
Or is there another easier way?
EDIT:
I want to have result table like this:
ID VALUE count
null abc 5
3 def 1
If your ID values are unique, you can simply check to see if the max(id) = min(id). If so, then use either one, otherwise you can return null. Like this:
Select Case When Min(id) = Max(id) Then Min(id) Else Null End As Id,
Value, Count(*) As [Count]
From YourTable
Group By Value
Since you are already performing an aggregate, including the MIN and Max function is not likely to take any extra (noticeable) time. I encourage you to give this a try.
The way I would do it would indeed be a CTE:
using #group AS (SELECT value, Count(*) as count from MyTable GROUP BY value HAVING count = 1)
SELECT MyTable.ID, #group.value, #group.count from MyTable
JOIN #group ON #group.value = MyTable.value
When using group by, after the group by statement you can use a having clause.
So
SELECT [ID]
FROM table
GROUP BY [VALUE]
HAVING COUNT(*) = 1
Edit: with regards to your edited question: this uses some fun joins and unions
CREATE TABLE #table
(ID int IDENTITY,
VALUE varchar(3))
INSERT INTO #table (VALUE)
VALUES('abc'),('abc'),('def'),('abc'),('abc'),('abc')
SELECT * FROM (
SELECT Null as ID,VALUE, COUNT(*) as [Count]
FROM #table
GROUP BY VALUE
HAVING COUNT(*) > 1
UNION ALL
SELECT t.ID,t.VALUE,p.Count FROM
#table t
JOIN
(SELECT VALUE, COUNT(*) as [Count]
FROM #table
GROUP BY VALUE
HAVING COUNT(*) = 1) p
ON t.VALUE=p.VALUE
) a
DROP TABLE #table
maybe not the most efficient but something like this works:
SELECT MAX(Id) as ID,Value FROM Table WHERE COUNT(*) = 1 GROUP BY Value

TSQL: Remove duplicates based on max(date)

I am searching for a query to select the maximum date (a datetime column) and keep its id and row_id. The desire is to DELETE the rows in the source table.
Source Data
id date row_id(unique)
1 11/11/2009 1
1 12/11/2009 2
1 13/11/2009 3
2 1/11/2009 4
Expected Survivors
1 13/11/2009 3
2 1/11/2009 4
What query would I need to achieve the results I am looking for?
Tested on PostgreSQL:
delete from table where (id, date) not in (select id, max(date) from table group by id);
There are various ways of doing this, but the basic idea is the same:
- Indentify the rows you want to keep
- Compare each row in your table to the ones you want to keep
- Delete any that don't match
DELETE
[source]
FROM
yourTable AS [source]
LEFT JOIN
yourTable AS [keep]
ON [keep].id = [source].id
AND [keep].date = (SELECT MAX(date) FROM yourTable WHERE id = [keep].id)
WHERE
[keep].id IS NULL
DELETE
[yourTable]
FROM
[yourTable]
LEFT JOIN
(
SELECT id, MAX(date) AS date FROM yourTable GROUP BY id
)
AS [keep]
ON [keep].id = [yourTable].id
AND [keep].date = [yourTable].date
WHERE
[keep].id IS NULL
DELETE
[source]
FROM
yourTable AS [source]
WHERE
[source].row_id != (SELECT TOP 1 row_id FROM yourTable WHERE id = [source].id ORDER BY date DESC)
DELETE
[source]
FROM
yourTable AS [source]
WHERE
NOT EXISTS (SELECT id FROM yourTable GROUP BY id HAVING id = [source].id AND MAX(date) != [source].date)
Because you are using SQL Server 2000, you'er not able to use the Row Over technique of setting up a sequence and to identify the top row for each unique id.
So, your proposed technique is to use a datetime column to get the top 1 row to remove duplicates. That might work, but there is a possibility that you might still get duplicates having the same datetime value. But that's easy enough to check for.
First check the assumption that all rows are unique based on the id and date columns:
CREATE TABLE #TestTable (rowid INT IDENTITY(1,1), thisid INT, thisdate DATETIME)
INSERT INTO #TestTable (thisid,thisdate) VALUES (1, '11/11/2009')
INSERT INTO #TestTable (thisid,thisdate) VALUES (1, '12/11/2009')
INSERT INTO #TestTable (thisid,thisdate) VALUES (1, '12/12/2009')
INSERT INTO #TestTable (thisid,thisdate) VALUES (2, '1/11/2009')
INSERT INTO #TestTable (thisid,thisdate) VALUES (2, '1/11/2009')
SELECT COUNT(*) AS thiscount
FROM #TestTable
GROUP BY thisid, thisdate
HAVING COUNT(*) > 1
This example returns a value of 2 - indicating that you will still end up with duplicates even after using the date column to remove duplicates. If you return 0, then you have proven that your proposed technique will work.
When de-duping production data, I think one should take some precautions and test before and after. You should create a table to hold the rows you plan to remove so you can recover them easily if you need to after the delete statement has been executed.
Also, it's a good idea to know beforehand how many rows you plan to remove so you can verify the count before and after - and you can gauge the magnitude of the delete operation. Based on how many rows will be affected, you can plan when to run the operation.
To test before the de-duping process, find the occurrences.
-- Get occurrences of duplicates
SELECT COUNT(*) AS thiscount
FROM
#TestTable
GROUP BY thisid
HAVING COUNT(*) > 1
ORDER BY thisid
That gives you the rows with more than one row with the same id. Capture the rows from this query into a temporary table and then run a query using the SUM to get the total number of rows that are not unique based on your key.
To get the number of rows you plan to delete, you need the count of rows that are duplicate based on your unique key, and the number of distinct rows based on your unique key. You subtract the distinct rows from the count of occurrences. All that is pretty straightforward - so I'll leave you to it.
Try this
declare #t table (id int, dt DATETIME,rowid INT IDENTITY(1,1))
INSERT INTO #t (id,dt) VALUES (1, '11/11/2009')
INSERT INTO #t (id,dt) VALUES (1, '11/12/2009')
INSERT INTO #t (id,dt) VALUES (1, '11/13/2009')
INSERT INTO #t (id,dt) VALUES (2, '11/01/2009')
Query:
delete from #t where rowid not in(
select t.rowid from #t t
inner join(
select MAX(dt)maxdate
from #t
group by id) X
on t.dt = X.maxdate )
select * from #t
Output:
id dt rowid
1 2009-11-13 00:00:00.000 3
2 2009-11-01 00:00:00.000 4
delete from temp where row_id not in (
select t.row_id from temp t
right join
(select id,MAX(dt) as dt from temp group by id) d
on t.dt = d.dt and t.id = d.id)
I have tested this answer..
INSERT INTO #t (id,dt) VALUES (1, '11/11/2009')
INSERT INTO #t (id,dt) VALUES (1, '11/12/2009')
INSERT INTO #t (id,dt) VALUES (1, '11/13/2009')
INSERT INTO #t (id,dt) VALUES (2, '11/01/2009')
select * from #t
;WITH T AS(
select dense_rank() over(partition by id order by dt desc)NO,DT,ID,rowid from #t )
DELETE T WHERE NO>1