In Redshift/Postgres, how to count rows that meet a condition? - postgresql

I'm trying to write a query that count only the rows that meet a condition.
For example, in MySQL I would write it like this:
SELECT
COUNT(IF(grade < 70), 1, NULL)
FROM
grades
ORDER BY
id DESC;
However, when I attempt to do that on Redshift, it returns the following error:
ERROR: function if(boolean, integer, "unknown") does not exist
Hint: No function matches the given name and argument types. You may need to add explicit type casts.
I checked the documentation for conditional statements, and I found
NULLIF(value1, value2)
but it only compares value1 and value2 and if such values are equal, it returns null.
I couldn't find a simple IF statement, and at first glance I couldn't find a way to do what I want to do.
I tried to use the CASE expression, but I'm not getting the results I want:
SELECT
CASE
WHEN grade < 70 THEN COUNT(rank)
ELSE COUNT(rank)
END
FROM
grades
This is the way I want to count things:
failed (grade < 70)
average (70 <= grade < 80)
good (80 <= grade < 90)
excellent (90 <= grade <= 100)
and this is how I expect to see the results:
+========+=========+======+===========+
| failed | average | good | excellent |
+========+=========+======+===========+
| 4 | 2 | 1 | 4 |
+========+=========+======+===========+
but I'm getting this:
+========+=========+======+===========+
| failed | average | good | excellent |
+========+=========+======+===========+
| 11 | 11 | 11 | 11 |
+========+=========+======+===========+
I hope someone could point me to the right direction!
If this helps here's some sample info
CREATE TABLE grades(
grade integer DEFAULT 0,
);
INSERT INTO grades(grade) VALUES(69, 50, 55, 60, 75, 70, 87, 100, 100, 98, 94);

First, the issue you're having here is that what you're saying is "If the grade is less than 70, the value of this case expression is count(rank). Otherwise, the value of this expression is count(rank)." So, in either case, you're always getting the same value.
SELECT
CASE
WHEN grade < 70 THEN COUNT(rank)
ELSE COUNT(rank)
END
FROM
grades
count() only counts non-null values, so typically the pattern you'll see to accomplish what you're trying is this:
SELECT
count(CASE WHEN grade < 70 THEN 1 END) as grade_less_than_70,
count(CASE WHEN grade >= 70 and grade < 80 THEN 1 END) as grade_between_70_and_80
FROM
grades
That way the case expression will only evaluate to 1 when the test expression is true and will be null otherwise. Then the count() will only count the non-null instances, i.e. when the test expression is true, which should give you what you need.
Edit: As a side note, notice that this is exactly the same as how you had originally written this using count(if(test, true-value, false-value)), only re-written as count(case when test then true-value end) (and null is the stand in false-value since an else wasn't supplied to the case).
Edit: postgres 9.4 was released a few months after this original exchange. That version introduced aggregate filters, which can make scenarios like this look a little nicer and clearer. This answer still gets some occasional upvotes, so if you've stumbled upon here and are using a newer postgres (i.e. 9.4+) you might want to consider this equivalent version:
SELECT
count(*) filter (where grade < 70) as grade_less_than_70,
count(*) filter (where grade >= 70 and grade < 80) as grade_between_70_and_80
FROM
grades

Another method:
SELECT
sum(CASE WHEN grade < 70 THEN 1 else 0 END) as grade_less_than_70,
sum(CASE WHEN grade >= 70 and grade < 80 THEN 1 else 0 END) as grade_between_70_and_80
FROM
grades
Works just fine in case you want to group the counts by a categorical column.

The solution given by #yieldsfalsehood works perfectly:
SELECT
count(*) filter (where grade < 70) as grade_less_than_70,
count(*) filter (where grade >= 70 and grade < 80) as grade_between_70_and_80
FROM
grades
But since you talked about NULLIF(value1, value2), there's a way with nullif that can give the same result:
select count(nullif(grade < 70 ,true)) as failed from grades;

Redshift only
For lazy typers, here's a "COUNTIF" sum integer casting version built on top of #user1509107 answer:
SELECT
SUM((grade < 70)::INT) AS grade_less_than_70,
SUM((grade >= 70 AND grade < 80)::INT) AS grade_between_70_and_80
FROM
grades

Related

Retain only 3 highest positive and negative records in a table

I am new to databases and postgres as such.
I have a table called names which has 2 columns name and value which gets updated every x seconds with new name value pairs. My requirement is to retain only 3 positive and 3 negative values at any point of time and delete the rest of the rows during each table update.
I use the following query to delete the old rows and retain the 3 positive and 3 negative values ordered by value.
delete from names
using (select *,
row_number() over (partition by value > 0, value < 0 order by value desc) as rn
from names ) w
where w.rn >=3
I am skeptical to use a conditional like value > 0 in a partition statement. Is this approach correct?
For example,
A table like this prior to delete :
name | value
--------------
test | 10
test1 | 11
test1 | 12
test1 | 13
test4 | -1
test4 | -2
My table after delete should look like :
name | value
--------------
test1 | 13
test1 | 12
test1 | 11
test4 | -1
test4 | -2
demo:db<>fiddle
This works generally as expected: value > 0 clusters the values into all numbers > 0 and all numbers <= 0. The ORDER BY value orders these two groups as expected well.
So, the only thing, I would change:
row_number() over (partition by value >= 0 order by value desc)
remove: , value < 0 (Because: Why should you group the positive values into negative and other? You don't have any negative numbers in your positive group and vice versa.)
change: value > 0 to value >= 0 to ignore the 0 as long as possible
For deleting: If you want to keep the top 3 values of each direction:
you should change w.rn >= 3 into w.rn > 3 (it keeps the 3rd element as well)
you need to connect the subquery with the table records. In real cases you should use id columns for that. In your example you could take the value column: where n.value = w.value AND w.rn > 3
So, finally:
delete from names n
using (select *,
row_number() over (partition by value >= 0 order by value desc) as rn
from names ) w
where n.value = w.value AND w.rn > 3
If it's not a hard requirement to delete the other rows, you could instead select only the rows you're interested in:
WITH largest AS (
SELECT name, value
FROM names
ORDER BY value DESC
LIMIT 3),
smallest AS (
SELECT name, value
FROM names
ORDER BY value ASC
LIMIT 3)
SELECT * FROM largest
UNION
SELECT * FROM smallest
ORDER BY value DESC

postgresql, convert column into header without rowid

I am trying to query a column into header and sum it.
I saw some example using crosstab but i can't figure out how to make it work without rowid
Is there other workaround to make this works?
My Table
currency| amount
RMB | 12
IDR | 30
RMB | 22
USD | 58
IDR | 30
Expected query
RMB_sum | IDR_sum | USD_sum
34 | 60 | 58
As you've stated you know in advance all the values in currency at the time of writing this query, you can simply use conditional aggregates. As you're on PostgreSQL 9.1 you can only accomplish this by mixing sum() with a case statement:
select
sum(case when currency = 'RMB' then amount else 0 end) as RMB_sum,
sum(case when currency = 'IDR' then amount else 0 end) as IDR_sum,
sum(case when currency = 'USD' then amount else 0 end) as USD_sum
from
__transactions
(Note the above uses implicit grouping - everything in my select statement is an aggregate function so there is no need to explicitly group the query)
If you were using PostgreSQL 9.4+ you could simplify the above with the filter directive:
select
sum(amount) filter(where currency = 'RMB') as RMB_sum,
sum(amount) filter(where currency = 'IDR') as IDR_sum,
sum(amount) filter(where currency = 'USD') as USD_sum
from
__transactions

Postgres Average calculation ignores null

This is my postgres table
name | revenue
--------+---------
John | 100
Will | 100
Tom | 100
Susan | 100
Ben |
(5 rows)
Here, when I calculate average for revenue, It returns 100, which is clearly not the case and sum/count, which is 400/5 is 80. Is this behaviour by conventional design or am I missing the point?
I know I could change null to 0 and process as usual . But, given the default behaviour, is this intentional and preferred way of calculating average.
This is both intentional and perfectly logical. Remember that NULL means that the value is unknown.
It might, for instance, represent a value which will be filled in at some future date. If the future value turns out to be 0, the average will be 400 / 5 = 80, as you say; but if the future value turns out to be 200, the average value will be 600 / 5 = 120 instead. All we can know right now is that the average of known values is 400 / 4 = 100.
If you actually know that you have 0 revenue for this item, you should store 0 in that column. If you don't know what revenue you have for that item, you should exclude it from your calculations, which is what Postgres, following the SQL Standard, does for you.
If you can't fix the data, but it is in fact a case that all NULLs in this table should be treated as 0 - or as some other fixed value - you can use a COALESCE inside the aggregate:
SELECT AVG(COALESCE(revenue, 0)) as forced_average
You should force a 0 value for null revenues.
create table tbl (name varchar(10), revenue int);
✓
insert into tbl values
('John', 100), ('Will', 100), ('Tom', 100), ('Susan', 100), ('Ben', null);
5 rows affected
select avg(case when revenue is null then 0 else revenue end) from tbl;
| avg |
| ------------------: |
| 80.0000000000000000 |
select avg(coalesce(revenue,0)) from tbl;
| avg |
| ------------------: |
| 80.0000000000000000 |
dbfiddle here

PostgreSQL non-overlapping ranges

I use PostgreSQL database and have a cards table.
Each record(card) in this table have card_drop_rate integer value.
For example:
id | card_name |card_drop_rate
-------------------------------
1 |card1 |34
2 |card2 |16
3 |card3 |54
max drop rate is 34 + 16 + 54 = 104.
In accordance to my application logic I need to find a random value between 0 and 104 and then retrieve card according to this number, for example:
random value: 71
card1 range: 0 - 34(0 + 34)
card2 range: 34 - 50(34 + 16)
card3 range: 50 - 104(50 + 54)
So, my card is card3 because 71 is placed in the range 50 - 104
What is the proper way to reflect this structure in PostgreSQL ? I'll need to query this data often under so the performance is a criterion number one for this solution.
Following query works fine:
SELECT
b.id,
b.card_drop_rate
FROM (SELECT a.id, sum(a.card_drop_rate) OVER(ORDER BY id) - a.card_drop_rate as rate, card_drop_rate FROM cards as a) b
WHERE b.rate < 299 ORDER BY id DESC LIMIT 1
You can do this using cumulative sums and random. The "+ 1"s might be throwing me off, but it is something like this:
with c as (
select c.*,
sum(card_drop_rate + 1) - card_drop_rate as threshhold
from cards c
),
r as (
select random() * (sum(card_drop_rate) + count(*) - 1) as which_card
from cards c
)
select c.*
from c cross join
r
where which_card >= threshhold
order by threshhold
limit 1;
For performance, I would simply take the cards and generate a new table with 106 slots. Assign the card value to the slots and build an index on the slot number. Then get a value using:
select s.*
from slots s
where s.slotid = floor(random() * 107);

Show the count based on some condition

I've asked a question some days back. Here is that link.
Count() corresponding to max()
Now with the same set of tables (SQL Fiddle) I would like to check a different condition
If the first question was about a count related to the max of a status, this question is about showing the count based on the next status of every project.
Explanation
As you can see in the table user_approval,appr_prjt_id=1 has 3 different statuses namely 10,20 ,30. And the next status will be 40 (With every approval the status is increased by 10) and so on. So is it possible to show that there is a project whose status is waiting to be 40? Its count must only be shown for status corresponding to 40 in the output (not in the statuses 10,20,30,...etc)
Desired Output:
10 | 20 | 30 | 40
location1 0 | 0 | 0 | 1
Not sure what the next status will be 40 means. But assuming that the status is increased by 10 with every approval, the following should work:
SELECT *
FROM user_projects pr
WHERE EXISTS (
SELECT * FROM user_approval ex
WHERE ex.appr_prjt_id = pr.proj_id
AND ex.appr_status = 30
)
AND NOT EXISTS (
SELECT * FROM user_approval nx
WHERE nx.appr_prjt_id = pr.proj_id
AND nx.appr_status >= 40
);
You can get the counts for each of the next status requirements with a query that looks more like:
select
sum(case when ua.appr_status = 10 then 1 else 0 end) as app_waiting_20,
sum(case when ua.appr_status = 20 then 1 else 0 end) as app_waiting_30,
sum(case when ua.appr_status = 30 then 1 else 0 end) as app_waiting_40
from
user_approval ua;
The nice thing about this solution is only one table scan, and you can add all kinds of other counts/sums in the query result as well.
select * from user_approval where appr_status
= (select max(appr_status) from user_approval where appr_status < 40);
SQL Fiddle : - http://www.sqlfiddle.com/#!11/f5243/10