I have one layer that represents several land cover categories each identified thanks to a code.
I want to aggregate the number of objects each category has for a subset of code (i.e. from 111:113 how many objects do I have, from 114:222 how many objects do I have? etc.)
How can I do it from the attribute table?
I thought about the field calculator, and added the following function:
Can someone help?
The expression to use is:
count ("code_90", group_by:= "code_90" in (111,112,113))
This counts the features that contain a value of 111, 112 or 113 in the field code_90.
To automatically create a range of values (like from 114 to 222), use this expression:
count ("code_90", group_by:= array_contains (generate_series (114,222), "code_90" ))
This counts the features that contain a value from 114 to 222.
To count the number of several ranges at once, use a CASE condition:
case
when array_contains (generate_series (111,113), "code_90" ) then count ("code_90", group_by:= array_contains (generate_series (111,113), "code_90" ))
when array_contains (generate_series (114,122), "code_90" ) then count ("code_90", group_by:= array_contains (generate_series (114,122), "code_90" ))
when array_contains (generate_series (123,133), "code_90" ) then count ("code_90", group_by:= array_contains (generate_series (123,133), "code_90" ))
when array_contains (generate_series (134,144), "code_90" ) then count ("code_90", group_by:= array_contains (generate_series (134,144), "code_90" ))
end
This counts the features for the following 4 ranges: 111 to 113, 114 to 122, 123 to 133 and 134 to 144.
The same, but without manually define a line for each range. Define the ranges once in line 4 inside the brackets (here 4 ranges: 111 to 113 / 114 to 122 / 123 to 133 / 134 to 144) to then get the correct count for all ranges:
array_max(
with_variable (
'list',
'map(111,113,114,122,123,133,134,144)',
array_foreach(
map_akeys( eval(#list)),
case
when array_contains (generate_series (#element,map_get( eval(#list),#element)), "code_90" )
then count ("code_90", group_by:= array_contains (generate_series (#element,map_get( eval(#list),#element)), "code_90" ))
end
)
)
)
Related
I have two dataframes. the first one is a raw dataframe so its item_value column has all the item values. and the other dataframe has columns named min,avg,max which has min,avg,max values specified for the items in the first dataframe. and I want to count the number of item values in the first dataframe based on the specified agg values in the second dataframe.
the first dataframe looks like this
item_name
item_value
A
1.4
A
2.1
B
3.0
A
2.8
B
4.5
B
1.1
the second dataframe looks like this
item_name
min
avg
max
A
1.1
2
2.7
B
2.1
3
4.0
I want to count the number of item values that are greater than the defined min,avg,max values in the other dataframe
So the result I want is
item_name
min
avg
max
A
3
2
1
B
2
1
1
Any help would be much appreciated
*please forgive my grammar
If you don't mind SQL implementation, you can try:
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
sql = """
select df2.item_name,
sum(case when df1.item_value > df2.min then 1 else 0 end) as min,
sum(case when df1.item_value > df2.avg then 1 else 0 end) as avg,
sum(case when df1.item_value > df2.max then 1 else 0 end) as max
from df2 join df1 on df2.item_name=df1.item_name
group by df2.item_name
"""
df = spark.sql(sql)
df.show()
the consecutive numbers are grouped by the query below, but I dont know how to obtain the maximum of each group of consecutive numbers
with trans as (
select c1,
case when lag(c1) over (order by c1) = c1 - 1 then 0 else 1 end as new
from table1
), groups as (
select c1, sum(new) over (order by c1) as grpnum
from trans
), ranges as (
select grpnum, min(c1) as low, max(c1) as high
from groups
group by grpnum
), texts as (
select grpnum,
case
when low = high then low::text
else low::text||'-'||high::text
end as txt
from ranges
)
select string_agg(txt, ',' order by grpnum) as number
from texts;
In R, we can create a group with diff and cumsum, and then use tapply to get the max of the vector for each group
grp <- cumsum(c(TRUE, diff(v1) > 1))
tapply(v1, grp, FUN = max)
# 1 2 3 4 5
# 3 8 12 15 20
data
v1 <- c(1, 2, 3, 6, 7, 8, 11, 12, 15, 18, 19, 20)
I'm trying to sum a window with a filter. I saw something similar to
sum(x) filter(condition) over (partition by...)
but it does not seem to work in t-sql, SQL Server 2017.
Essentially, I want to sum the last 5 rows that have a condition on another column.
I've tried
sum(case when condition...) over (partition...)
and sum(cast(nullif(x))) over (partition...).
I've tried left joining the table with a where condition to filter out the condition.
All of the above will add the last 5 from the starting point of the current row with the condition.
What I want is from the current row. Add the last 5 values above that meet a condition.
Date| Value | Condition | Result
1-1 10 1
1-2 11 1
1-3 12 1
1-4 13 1
1-5 14 0
1-6 15 1
1-7 16 0
1-8 17 0 sum(15+13+12+11+10)
1-9 18 1 sum(18+15+13+12+11)
1-10 19 1 sum(19+18+15+13+12)
In the above example the condition I would want would be 1, ignoring the 0 but still having the "window" size be 5 non-0 values.
This can easily be achieved using a correlated sub query:
First, create and populate sample table (Please save us this step in your future questions):
DECLARE #T AS TABLE
(
[Date] Date,
[Value] int,
Condition bit
)
INSERT INTO #T ([Date], [Value], Condition) VALUES
('2019-01-01', 10, 1),
('2019-01-02', 11, 1),
('2019-01-03', 12, 1),
('2019-01-04', 13, 1),
('2019-01-05', 14, 0),
('2019-01-06', 15, 1),
('2019-01-07', 16, 0),
('2019-01-08', 17, 0),
('2019-01-09', 18, 1),
('2019-01-10', 19, 1)
The query:
SELECT [Date], [Value], Condition,
(
SELECT Sum([Value])
FROM
(
SELECT TOP 5 [Value]
FROM #T AS t1
WHERE Condition = 1
AND t1.[Date] <= t0.[Date]
-- If you want the sum to appear starting from a specific date, unremark the next row
--AND t0.[Date] > '2019-01-07'
ORDER BY [Date] DESC
) As t2
HAVING COUNT(*) = 5 -- there are at least 5 rows meeting the condition
) As Result
FROM #T As T0
Results:
Date Value Condition Result
2019-01-01 10 1
2019-01-02 11 1
2019-01-03 12 1
2019-01-04 13 1
2019-01-05 14 0
2019-01-06 15 1 61
2019-01-07 16 0 61
2019-01-08 17 0 61
2019-01-09 18 1 69
2019-01-10 19 1 77
How to find the distribution of credit cards by year, and completed transaction. Group these credit cards into three buckets: less than 10 transactions, between 10 and 30 transactions, more than 30 transactions?
The first method I tried to use was using the width_buckets function in PostgresQL, but the documentation says that only creates equidistant buckets, which is not what I want in this case. Because of that, I turned to case statements. However, I'm not sure how to use the case statement with a group by.
This is the data I am working with:
table 1 - credit_cards table
credit_card_id
year_opened
table 2 - transactions table
transaction_id
credit_card_id - matches credit_cards.credit_card_id
transaction_status ("complete" or "incomplete")
This is what I have gotten so far:
SELECT
CASE WHEN transaction_count < 10 THEN “Less than 10”
WHEN transaction_count >= 10 and transaction_count < 30 THEN “10 <= transaction count < 30”
ELSE transaction_count>=30 THEN “Greater than or equal to 30”
END as buckets
count(*) as ct.transaction_count
FROM credit_cards c
INNER JOIN transactions t
ON c.credit_card_id = t.credit_card_id
WHERE t.status = “completed”
GROUP BY v.year_opened
GROUP BY buckets
ORDER BY buckets
Expected output
credit card count | year opened | transaction count bucket
23421 | 2002 | Less than 10
etc
You can specify the bin sizes in width_bucket by specifying a sorted array of the lower bound of each bin.
In you case, it would be array[10,30]: anything less than 10 gets bin 0, between 10 and 29 gets bin 1 and 30 or more gets bin 2.
WITH a AS (select generate_series(5,35) cnt)
SELECT cnt, width_bucket(cnt, array[10,30])
FROM a;
To figure this out you need to count transactions per credit card in order to figure out the right bucket, then you need to count the credit cards per bucket per year. There are a couple of different ways to get the final result. One way is to first join up all your data and compute the first level of aggregate values. Then compute the final level of aggregate values:
with t1 as (
select year_opened
, c.credit_card_id
, case when count(*) < 10 then 'Less than 10'
when count(*) < 30 then 'Between [10 and 30)'
else 'Greater than or equal to 30'
end buckets
from credit_cards c
join transactions t
on t.credit_card_id = c.credit_card_id
where t.transaction_status = 'complete'
group by year_opened
, c.credit_card_id
)
select count(*) credit_card_count
, year_opened
, buckets
from t1
group by year_opened
, buckets;
However, it may be more perforamant first calculate the first level of aggregate data on the transactions table before joining it to the credit cards table:
select count(*) credit_card_count
, year_opened
, buckets
from credit_cards c
join (select credit_card_id
, case when count(*) < 10 then 'Less than 10'
when count(*) < 30 then 'Between [10 and 30)'
else 'Greater than or equal to 30'
end buckets
from transactions
group by credit_card_id) t
on t.credit_card_id = c.credit_card_id
group by year_opened
, buckets;
If you prefer to unroll the above query and uses Common Table Expressions, you can do that too (I find this easier to read/follow along):
with bkt as (
select credit_card_id
, case when count(*) < 10 then 'Less than 10'
when count(*) < 30 then 'Between [10 and 30)'
else 'Greater than or equal to 30'
end buckets
from transactions
group by credit_card_id
)
select count(*) credit_card_count
, year_opened
, buckets
from credit_cards c
join bkt t
on t.credit_card_id = c.credit_card_id
group by year_opened
, buckets;
Not sure if this is what you are looking for.
WITH cte
AS (
SELECT c.year_opened
,c.credit_card_id
,count(*) AS transaction_count
FROM credit_cards c
INNER JOIN transactions t ON c.credit_card_id = t.credit_card_id
WHERE t.STATUS = 'completed'
GROUP BY c.year_opened
,c.credit_card_id
)
SELECT cte.year_opened AS 'year opened'
,SUM(CASE
WHEN transaction_count < 10
THEN 1
ELSE 0
END) AS 'Less than 10'
,SUM(CASE
WHEN transaction_count >= 10
AND transaction_count < 30
THEN 1
ELSE 0
END) AS '10 <= transaction count < 30'
,SUM(CASE
WHEN transaction_count >= 30
THEN 1
ELSE 0
END) AS 'Greater than or equal to 30'
FROM CTE
GROUP BY cte.year_opened
and the output would be as below.
year opened | Less than 10 | 10 <= transaction count < 30 | Greater than or equal to 30
2002 | 23421 | |
I'm trying to write a query that count only the rows that meet a condition.
For example, in MySQL I would write it like this:
SELECT
COUNT(IF(grade < 70), 1, NULL)
FROM
grades
ORDER BY
id DESC;
However, when I attempt to do that on Redshift, it returns the following error:
ERROR: function if(boolean, integer, "unknown") does not exist
Hint: No function matches the given name and argument types. You may need to add explicit type casts.
I checked the documentation for conditional statements, and I found
NULLIF(value1, value2)
but it only compares value1 and value2 and if such values are equal, it returns null.
I couldn't find a simple IF statement, and at first glance I couldn't find a way to do what I want to do.
I tried to use the CASE expression, but I'm not getting the results I want:
SELECT
CASE
WHEN grade < 70 THEN COUNT(rank)
ELSE COUNT(rank)
END
FROM
grades
This is the way I want to count things:
failed (grade < 70)
average (70 <= grade < 80)
good (80 <= grade < 90)
excellent (90 <= grade <= 100)
and this is how I expect to see the results:
+========+=========+======+===========+
| failed | average | good | excellent |
+========+=========+======+===========+
| 4 | 2 | 1 | 4 |
+========+=========+======+===========+
but I'm getting this:
+========+=========+======+===========+
| failed | average | good | excellent |
+========+=========+======+===========+
| 11 | 11 | 11 | 11 |
+========+=========+======+===========+
I hope someone could point me to the right direction!
If this helps here's some sample info
CREATE TABLE grades(
grade integer DEFAULT 0,
);
INSERT INTO grades(grade) VALUES(69, 50, 55, 60, 75, 70, 87, 100, 100, 98, 94);
First, the issue you're having here is that what you're saying is "If the grade is less than 70, the value of this case expression is count(rank). Otherwise, the value of this expression is count(rank)." So, in either case, you're always getting the same value.
SELECT
CASE
WHEN grade < 70 THEN COUNT(rank)
ELSE COUNT(rank)
END
FROM
grades
count() only counts non-null values, so typically the pattern you'll see to accomplish what you're trying is this:
SELECT
count(CASE WHEN grade < 70 THEN 1 END) as grade_less_than_70,
count(CASE WHEN grade >= 70 and grade < 80 THEN 1 END) as grade_between_70_and_80
FROM
grades
That way the case expression will only evaluate to 1 when the test expression is true and will be null otherwise. Then the count() will only count the non-null instances, i.e. when the test expression is true, which should give you what you need.
Edit: As a side note, notice that this is exactly the same as how you had originally written this using count(if(test, true-value, false-value)), only re-written as count(case when test then true-value end) (and null is the stand in false-value since an else wasn't supplied to the case).
Edit: postgres 9.4 was released a few months after this original exchange. That version introduced aggregate filters, which can make scenarios like this look a little nicer and clearer. This answer still gets some occasional upvotes, so if you've stumbled upon here and are using a newer postgres (i.e. 9.4+) you might want to consider this equivalent version:
SELECT
count(*) filter (where grade < 70) as grade_less_than_70,
count(*) filter (where grade >= 70 and grade < 80) as grade_between_70_and_80
FROM
grades
Another method:
SELECT
sum(CASE WHEN grade < 70 THEN 1 else 0 END) as grade_less_than_70,
sum(CASE WHEN grade >= 70 and grade < 80 THEN 1 else 0 END) as grade_between_70_and_80
FROM
grades
Works just fine in case you want to group the counts by a categorical column.
The solution given by #yieldsfalsehood works perfectly:
SELECT
count(*) filter (where grade < 70) as grade_less_than_70,
count(*) filter (where grade >= 70 and grade < 80) as grade_between_70_and_80
FROM
grades
But since you talked about NULLIF(value1, value2), there's a way with nullif that can give the same result:
select count(nullif(grade < 70 ,true)) as failed from grades;
Redshift only
For lazy typers, here's a "COUNTIF" sum integer casting version built on top of #user1509107 answer:
SELECT
SUM((grade < 70)::INT) AS grade_less_than_70,
SUM((grade >= 70 AND grade < 80)::INT) AS grade_between_70_and_80
FROM
grades