Subselect on array_agg in postgresql

Subselect on array_agg in postgresql - postgresql

Is there a way to use a value from an aggregate function in a having clause in Postgresql 9.2+?
For example, I would like to get each monkey_id with a 2nd highest number > 123, as well as the second highest number. In the example below, I'd like to get (monkey_id 1, number 222).
monkey_id | number
------------------
1 | 222
1 | 333
2 | 0
2 | 444
SELECT
monkey_id,
(array_agg(number ORDER BY number desc))[2] as second
FROM monkey_numbers
GROUP BY monkey_id
HAVING second > 123
I get column "second" does not exist.

You will have to place that in the having clause
SELECT
monkey_id
FROM monkey_numbers
GROUP BY monkey_id
HAVING array_agg(number ORDER BY number desc)[2] > 123
The explanation is that the having will be executed before the select so second still doesn't exist at that time.

Related

Retain only 3 highest positive and negative records in a table

I am new to databases and postgres as such.
I have a table called names which has 2 columns name and value which gets updated every x seconds with new name value pairs. My requirement is to retain only 3 positive and 3 negative values at any point of time and delete the rest of the rows during each table update.
I use the following query to delete the old rows and retain the 3 positive and 3 negative values ordered by value.
delete from names
using (select *,
row_number() over (partition by value > 0, value < 0 order by value desc) as rn
from names ) w
where w.rn >=3
I am skeptical to use a conditional like value > 0 in a partition statement. Is this approach correct?
For example,
A table like this prior to delete :
name | value
--------------
test | 10
test1 | 11
test1 | 12
test1 | 13
test4 | -1
test4 | -2
My table after delete should look like :
name | value
--------------
test1 | 13
test1 | 12
test1 | 11
test4 | -1
test4 | -2

demo:db<>fiddle
This works generally as expected: value > 0 clusters the values into all numbers > 0 and all numbers <= 0. The ORDER BY value orders these two groups as expected well.
So, the only thing, I would change:
row_number() over (partition by value >= 0 order by value desc)
remove: , value < 0 (Because: Why should you group the positive values into negative and other? You don't have any negative numbers in your positive group and vice versa.)
change: value > 0 to value >= 0 to ignore the 0 as long as possible
For deleting: If you want to keep the top 3 values of each direction:
you should change w.rn >= 3 into w.rn > 3 (it keeps the 3rd element as well)
you need to connect the subquery with the table records. In real cases you should use id columns for that. In your example you could take the value column: where n.value = w.value AND w.rn > 3
So, finally:
delete from names n
using (select *,
row_number() over (partition by value >= 0 order by value desc) as rn
from names ) w
where n.value = w.value AND w.rn > 3

If it's not a hard requirement to delete the other rows, you could instead select only the rows you're interested in:
WITH largest AS (
SELECT name, value
FROM names
ORDER BY value DESC
LIMIT 3),
smallest AS (
SELECT name, value
FROM names
ORDER BY value ASC
LIMIT 3)
SELECT * FROM largest
UNION
SELECT * FROM smallest
ORDER BY value DESC

PostgreSQL group by error

I have a relation in a PostgreSQL database called 'processed_data' having the following schema:
Date -> date type, shop_id -> integer type, item_category_id -> integer type, sum_item_cnt_day -> real type.
Displaying the first 5 rows of the relation is as follows:
date | shop_id | item_category_id | sum_item_cnt_day
------+-----------+--------------------+------------------
2014-12-29 | 49 | 3 | 4
2014-12-29 | 49 | 6 | 1
2014-12-29 | 49 | 7 | 1
2014-12-29 | 49 | 12 | 3
2014-12-29 | 49 | 16 | 1
Now, the 'shop_id' has 60 unique shops ranging from 0-59 where each shop sells some items grouped to a new column 'item_category_id' where 'sum_item_cnt_day' denotes the number of items sold by a shop and it's item_category_id.
I am now trying to further aggregate the data by just trying to get the following columns as final result-
date, shop_id, sum_item_cnt_day
So that, data is aggregated according to number of all items sold in 'item_category_id' per shop (denoted by 'shop_id') and calculating sum of 'sum_item_cnt_day'.
When I try to execute the following SQL command-
select date, shop_id, sum(sum_item_cnt_day) from processed_data group by shop_id;
It gives the error-
ERROR: column "processed_data.date" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: select date, shop_id, sum(sum_item_cnt_day) from processed_d...
^
Even the following SQL command-
select date, shop_id, sum(sum_item_cnt_day) from processed_data where date between '2013-01-01' and '2013-01-31' group by shop_id;
Gives the error-
ERROR: column "processed_data.date" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: select date, shop_id, sum(sum_item_cnt_day) from processed_d...
^
Any suggestions as to what's going wrong and what am I missing?
Thanks!

The simplest fix, which is what I think you want, would be to just add date to the GROUP BY clause:
SELECT date, shop_id, SUM(sum_item_cnt_day)
FROM processed_data
GROUP BY date, shop_id;
If you really don't want sums taken for each shop on each day, but rather for each shop over all days, then you will have to think of which of the many dates you want to display.

SQL count distinct id too slow (~7 seconds)

I have a query as such:
SELECT disease_name, COUNT(DISTINCT id)
FROM disease_table
GROUP BY disease_name
where each disease_name has an associated identifier, and a disease may occur multiple times for the same identifier.
This works, BUT it takes roughly 7s to run.
If I run this query:
SELECT disease_name, COUNT(disease_name)
FROM disease_table
GROUP BY disease_name
it takes 321ms, BUT duplicate rows (same disease with same id) are counted more than once.
Is there a more efficient way to achieve the results of the first query in about the same time as the second using only SQL?
Table:
disease_name | id
------------ | -------------
dis_1 123
dis_1 104
dis_1 104
dis_32 123
dis_12 123
dis_12 115
Expected:
disease_name | count
------------ | -------------
dis_1 2
dis_32 1
dis_12 2
where dis_1 has 3 entries but is only counted twice because two of those 3 entries have the same id

Try to add a proper index on disease_table, like this:
CREATE INDEX ON disease_table(disease_name, id);
See if that solves out your issue.

How can 'brand new, never before seen' IDs be counted per month in redshift?

A fair amount of material is available detailing methods utilising dense_rank() and the like to count distinct somethings per month, however, I've been unable to find anything that allows a count of distinct per month which also removes/discounts any id's that have been seen in prior month groups.
The data can be imagined like so:
id (int8 type) | observed time (timestamp utc)
------------------
1 | 2017-01-01
2 | 2017-01-02
1 | 2017-01-02
1 | 2017-02-02
2 | 2017-02-03
3 | 2017-02-04
1 | 2017-03-01
3 | 2017-03-01
4 | 2017-03-01
5 | 2017-03-02
The process of the count can be seen as:
1: in 2017-01 we saw devices 1 and 2 so the count is 2
2: in 2017-02 we saw devices 1, 2 and 3. We know already about devices 1 and 2, but not 3, so the count is 1
3: in 2017-03 we saw devices 1, 3, 4 and 5. We already know about 1 and 3, but not 4 or 5, so the count is 2.
with the desired output being something like:
observed time | count of new id
--------------------------
2017-01 | 2
2017-02 | 1
2017-03 | 2
Explicitly, I am looking to have a new table, with an aggregated month per row, with a count of how many new ids occur within that month that have not been seen at all before.
The IRL case allows devices to be seen more than once in a month, but this shouldn't impact the count. It also uses integer for storage (both positive and negative) of the id, and time periods will be to the second in true timestamps. The size of the data set is also significant.
My initial attempt is along the lines of:
WITH records_months AS (
SELECT *,
date_trunc('month', observed_time) AS month_group
FROM my_table
WHERE observed_time > '2017-01-01')
id_months AS (
SELECT DISTINCT
month_group,
id
FROM records_months
GROUP BY month_group, id)
SELECT *
FROM id-months
However, I'm stuck on the next part i.e counting the number of new ID that were not seen in prior months. I believe the solution might be a window function, but I'm having trouble working out which or how.

First thing I thought of. The idea is to
(innermost query) calculate the earliest month that each id was seen,
(next level up) join that back to the main my_table dataset, and then
(outer query) count distinct ids by month after nulling out the already-seen ids.
I tested it out and got the desired result set. Joining the earliest month back to the original table seemed like the most natural thing to do (vs. a window function). Hopefully this is performant enough for your Redshift!
select observed_month,
-- Null out the id if the observed_month that we're grouping by
-- is NOT the earliest month that the id was seen.
-- Then count distinct id
count(distinct(case when observed_month != earliest_month then null else id end)) as num_new_ids
from (
select t.id,
date_trunc('month', t.observed_time) as observed_month,
earliest.earliest_month
from my_table t
join (
-- What's the earliest month an id was seen?
select id,
date_trunc('month', min(observed_time)) as earliest_month
from my_table
group by 1
) earliest
on t.id = earliest.id
)
group by 1
order by 1;

Variable rows and columns in SSRS Matrix

(SSRS 2008)
I have a dataset with results looking like this:
FUNCTION | EMP-NMB
------------------
A | 100
A | 101
A | 103
B | 102
I want to display this data in my report in this way:
A | B
------------
100 | 102
101 |
103 |
I am managed to display it this way:
A | B
------------
100 |
101 |
103 |
| 102
But that table becomes very large with more data.
The number of employees and functions can vary. For now I am using a Matrix, but I don't know how to configure it to work the way I want.

I think the problem is that you are probably using EMP-NMB as you Row Group grouping.
Since you want the report to display different ones on the same line, you need to something else. Unfortunately, there isn't anything is the data you list but you can add a ROWNUMBER() to the query.
SELECT FUNCTION, EMP-NMB, ROW_NUMBER() OVER(PARTITION BY FUNCTION ORDER BY EMP-NMB) AS ROW_NUM
FROM ...
Then change the tablix Row Group Group On to use the new ROW_NUM field.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Subselect on array_agg in postgresql - postgresql

You will have to place that in the having clause SELECT monkey_id FROM monkey_numbers GROUP BY monkey_id HAVING array_agg(number ORDER BY number desc)[2] > 123 The explanation is that the having will be executed before the select so second still doesn't exist at that time.

Related

Retain only 3 highest positive and negative records in a table

PostgreSQL group by error

SQL count distinct id too slow (~7 seconds)

How can 'brand new, never before seen' IDs be counted per month in redshift?

Variable rows and columns in SSRS Matrix

Categories

Resources