Retrieve additional columns on aggregation and date operator - postgresql

I have the following PostgreSQL table structure, which gathers temperature records for every second:
+----+--------+-------------------------------+---------+
| id | value | date | station |
+----+--------+-------------------------------+---------+
| 1 | 0 | 2017-08-22 14:01:09.314625+02 | 1 |
| 2 | 0 | 2017-08-22 14:01:09.347758+02 | 1 |
| 3 | 25.187 | 2017-08-22 14:01:10.315413+02 | 1 |
| 4 | 24.937 | 2017-08-22 14:01:10.322528+02 | 1 |
| 5 | 25.187 | 2017-08-22 14:01:11.347271+02 | 1 |
| 6 | 24.937 | 2017-08-22 14:01:11.355005+02 | 1 |
| 18 | 24.875 | 2017-08-22 14:01:17.35265+02 | 1 |
| 19 | 25.187 | 2017-08-22 14:01:18.34673+02 | 1 |
| 20 | 24.875 | 2017-08-22 14:01:18.355082+02 | 1 |
| 21 | 25.187 | 2017-08-22 14:01:19.361491+02 | 1 |
| 22 | 24.875 | 2017-08-22 14:01:19.371154+02 | 1 |
| 23 | 25.187 | 2017-08-22 14:01:20.354576+02 | 1 |
| 30 | 24.937 | 2017-08-22 14:01:23.372612+02 | 1 |
| 31 | 0 | 2017-08-22 15:58:53.576238+02 | 1 |
| 32 | 0 | 2017-08-22 15:58:53.590872+02 | 1 |
| 33 | 26.625 | 2017-08-22 15:58:54.59986+02 | 1 |
| 38 | 26.375 | 2017-08-22 15:58:56.593205+02 | 1 |
| 39 | 0 | 2017-08-21 15:59:40.181317+02 | 1 |
| 40 | 0 | 2017-08-21 15:59:40.190221+02 | 1 |
| 41 | 26.562 | 2017-08-21 15:59:41.182622+02 | 1 |
| 42 | 26.375 | 2017-08-21 15:59:41.18905+02 | 1 |
+----+--------+-------------------------------+---------+
I want now to retrieve the maximum value for every hour, along with the data associated to that entry (id, date). As such, I tried the following:
select max(value) as m, (date_trunc('hour', date)) as d
from temperature
where station='1'
group by (date_trunc('hour', date));
Which works fine (fiddle), but I only get the columns m and d as a result. If I now try to add the date or id columns to the SELECT statement, I get the usual column "temperature.id" must appear in the GROUP BY clause or be used in an aggregate function error.
I have already tried approaches such as the ones described here, unfortunately to no avail, as for instance I seem to be unable to perform a join on the date_trunc-generated columns.
The result I am aiming for is this:
+----+--------+-------------------------------+---------+
| id | value | date | station |
+----+--------+-------------------------------+---------+
| 3 | 25.187 | 2017-08-22 14:01:10.315413+02 | 1 |
| 33 | 26.625 | 2017-08-22 15:58:54.59986+02 | 1 |
| 41 | 26.562 | 2017-08-21 15:59:41.182622+02 | 1 |
+----+--------+-------------------------------+---------+
It does not matter which record was retrieved in case two or more entries have the same value.

distinct on:
select distinct on (date_trunc('hour', date)) *
from temperature
where station = '1'
order by date_trunc('hour', date), value desc
Fiddle

Related

postgreSQL question: get data by last date of each record and subtract from last date number of days

Please help me make a request. i'm at a dead end.
There are 2 tables:
“Trains”:
+----+---------+
| id | numbers |
+----+---------+
| 1 | 101 |
| 2 | 102 |
| 3 | 103 |
| 4 | 104 |
| 5 | 105 |
+----+---------+
“Passages”:
+----+--------------+-------+---------------------+
| id | train_number | speed | date_time |
+----+--------------+-------+---------------------+
| 1 | 101 | 26 | 2021-11-10 16:26:30 |
| 2 | 101 | 28 | 2021-11-12 16:26:30 |
| 3 | 102 | 24 | 2021-11-14 16:26:30 |
| 4 | 103 | 27 | 2021-11-15 16:26:30 |
| 5 | 101 | 29 | 2021-11-16 16:26:30 |
+----+--------------+-------+---------------------+
The goal is to go through the train numbers from the Trains table, take from the existing ones from the Passages table by the latest date (date_time) and the number of passages for “the last date for each train” - N days. as I understand date_time - interval "N days". should get something like:
+----+--------+---------------------+----------------+
| id | train | last_passage | count_passages |
+----+--------+---------------------+----------------+
| 1 | 101 | 2021-11-10 16:26:30 | 2 |
| 2 | 102 | 2021-11-14 16:26:30 | 1 |
| 3 | 103 | 2021-11-15 16:26:30 | 1 |
| 4 | 104 | null | 0 |
| 5 | 105 | null | 0 |
+----+--------+---------------------+----------------+
ps: "count_passages" - for example, last passage date minus 4 days
I tried through "where in" but I can’t create the necessary and correct request

Serial Number in logical order without gaps

I'm trying to generate a serial number based on a few conditions.
My dataset:
+--------+------------+------------+---------+--------+
| Client | Start_Date | End_date | Product | Ser_No |
+--------+------------+------------+---------+--------+
| 44 | 22-01-2018 | 31-12-2018 | A | |
+--------+------------+------------+---------+--------+
| 44 | 24-02-2018 | 01-01-2019 | B | |
+--------+------------+------------+---------+--------+
| 44 | 12-03-2018 | 01-01-2019 | C | |
+--------+------------+------------+---------+--------+
| 100 | 24-01-2018 | 30-11-2018 | A | |
+--------+------------+------------+---------+--------+
| 100 | 26-01-2018 | 15-12-2018 | D | |
+--------+------------+------------+---------+--------+
| 100 | 26-01-2018 | 01-02-2019 | E | |
+--------+------------+------------+---------+--------+
| 100 | 01-03-2018 | 31-01-2019 | F | |
+--------+------------+------------+---------+--------+
What I did to configure my serial number:
RANK() OVER(PARTITION BY Client ORDER BY Client, Start_date ASC)
So now it generates a serial number for my which looks like this:
+--------+------------+------------+---------+--------+
| Client | Start_Date | End_date | Product | Ser_No |
+--------+------------+------------+---------+--------+
| 44 | 22-01-2018 | 31-12-2018 | A | 1 |
+--------+------------+------------+---------+--------+
| 44 | 24-02-2018 | 01-01-2019 | B | 2 |
+--------+------------+------------+---------+--------+
| 44 | 12-03-2018 | 01-01-2019 | C | 3 |
+--------+------------+------------+---------+--------+
| 100 | 24-01-2018 | 30-11-2018 | A | 1 |
+--------+------------+------------+---------+--------+
| 100 | 26-01-2018 | 15-12-2018 | D | 2 |
+--------+------------+------------+---------+--------+
| 100 | 26-01-2018 | 01-02-2019 | E | 2 |
+--------+------------+------------+---------+--------+
| 100 | 01-03-2018 | 31-01-2019 | F | 4 |
+--------+------------+------------+---------+--------+
What goes wrong for my analysis is the last line, it generates the serial number. What it has to be is 3.
Can anayone help me to generate it in this order?
Thanks in advance!
Extra
In addition to my question from yesterday, there is something extra that I need to do. Because the Ser_No has to be the same when my Start_Date is the same, but the Ser_No has also be the same when my folowing records is the same product (also when it has a different Start_Date)
So what I I expect and what I get right now:
+--------+------------+------------+---------+--------+------------+
| Client | Start_Date | End_date | Product | Ser_No | Ser_No New |
+--------+------------+------------+---------+--------+------------+
| 44 | 22-01-2018 | 31-12-2018 | A | 1 | 1 |
+--------+------------+------------+---------+--------+------------+
| 44 | 24-02-2018 | 01-01-2019 | B | 2 | 2 |
+--------+------------+------------+---------+--------+------------+
| 44 | 12-03-2018 | 01-01-2019 | C | 2 | 2 |
+--------+------------+------------+---------+--------+------------+
| 100 | 24-01-2018 | 30-11-2018 | A | 1 | 1 |
+--------+------------+------------+---------+--------+------------+
| 100 | 26-01-2018 | 15-12-2018 | D | 2 | 2 |
+--------+------------+------------+---------+--------+------------+
| 100 | 26-01-2018 | 01-02-2019 | E | 2 | 2 |
+--------+------------+------------+---------+--------+------------+
| 100 | 01-03-2018 | 31-01-2019 | F | 3 | 3 |
+--------+------------+------------+---------+--------+------------+
| 100 | 11-04-2018 | 31-03-2019 | F | 4 | 3 |
+--------+------------+------------+---------+--------+------------+
| 100 | 20-04-2018 | 31-01-2019 | G | 5 | 4 |
+--------+------------+------------+---------+--------+------------+
| 100 | 21-04-2018 | 31-01-2019 | A | 6 | 5 |
+--------+------------+------------+---------+--------+------------+
| 100 | 21-04-2018 | 31-01-2019 | B | 6 | 5 |
+--------+------------+------------+---------+--------+------------+
| 100 | 01-05-2018 | 31-01-2019 | B | 7 | 5 |
+--------+------------+------------+---------+--------+------------+
Any idea on how to achieve this, because I won't get it
You need to use DENSE_RANK instead:
This function returns the rank of each row within a result set partition, with no gaps in the ranking values.
DENSE_RANK() OVER(PARTITION BY Client ORDER BY Start_date) AS Ser_no
Additionaly the Client in ORDER BY has no effect because it has the same value per partition.

Return unique grouped rows with the latest timestamp [duplicate]

This question already has answers here:
Select first row in each GROUP BY group?
(20 answers)
Closed 3 years ago.
At the moment I'm struggling with a problem that looks very easy.
Tablecontent:
Primay Keys: Timestamp, COL_A,COL_B ,COL_C,COL_D
+------------------+-------+-------+-------+-------+--------+--------+
| Timestamp | COL_A | COL_B | COL_C | COL_D | Data_A | Data_B |
+------------------+-------+-------+-------+-------+--------+--------+
| 31.07.2019 15:12 | - | - | - | - | 1 | 2 |
| 31.07.2019 15:32 | 1 | 1 | 100 | 1 | 5000 | 20 |
| 10.08.2019 09:33 | - | - | - | - | 1000 | 7 |
| 31.07.2019 15:38 | 1 | 1 | 100 | 1 | 33 | 5 |
| 06.08.2019 08:53 | - | - | - | - | 0 | 7 |
| 06.08.2019 09:08 | - | - | - | - | 0 | 7 |
| 06.08.2019 16:06 | 3 | 3 | 3 | 3 | 0 | 23 |
| 07.08.2019 10:43 | - | - | - | - | 0 | 42 |
| 07.08.2019 13:10 | - | - | - | - | 0 | 24 |
| 08.08.2019 07:19 | 11 | 111 | 111 | 12 | 0 | 2 |
| 08.08.2019 10:54 | 2334 | 65464 | 565 | 76 | 1000 | 19 |
| 08.08.2019 11:15 | 232 | 343 | 343 | 43 | 0 | 2 |
| 08.08.2019 11:30 | 2323 | rtttt | 3434 | 34 | 0 | 2 |
| 10.08.2019 14:47 | - | - | - | - | 123 | 23 |
+------------------+-------+-------+-------+-------+--------+--------+
Needed query output:
+------------------+-------+-------+-------+-------+--------+--------+
| Timestamp | COL_A | COL_B | COL_C | COL_D | Data_A | Data_B |
+------------------+-------+-------+-------+-------+--------+--------+
| 31.07.2019 15:38 | 1 | 1 | 100 | 1 | 33 | 5 |
| 06.08.2019 16:06 | 3 | 3 | 3 | 3 | 0 | 23 |
| 08.08.2019 07:19 | 11 | 111 | 111 | 12 | 0 | 2 |
| 08.08.2019 10:54 | 2334 | 65464 | 565 | 76 | 1000 | 19 |
| 08.08.2019 11:15 | 232 | 343 | 343 | 43 | 0 | 2 |
| 08.08.2019 11:30 | 2323 | rtttt | 3434 | 34 | 0 | 2 |
| 10.08.2019 14:47 | - | - | - | - | 123 | 23 |
+------------------+-------+-------+-------+-------+--------+--------+
As you can see, I'm trying to get single rows for my primary keys, using the latest timestamp, which is also a primary key.
Currently, I tried a query like:
SELECT Timestamp, COL_A, COL_B, COL_C, COL_D, Data_A, Data_B From Table XY op
WHERE Timestamp = (
SELECT MAX(Timestamp) FROM XY as tsRow
WHERE op.COL_A = tsRow.COL_A
AND op.COL_B = tsRow.COL_B
AND op.COL_C = tsRow.COL_C
AND op.COL_D = tsRow."COL_D
);
which gives me result that looks fine at first glance.
Is there a better or more safe way to get my preferred result?
demo:db<>fiddle
You can use the DISTINCT ON clause, which gives you the first record of an ordered group. Here your group is your (A, B, C, D). This is ordered by the Timestamp column, in descending order, to get the most recent record to be the first.
SELECT DISTINCT ON ("COL_A", "COL_B", "COL_C", "COL_D")
*
FROM
mytable
ORDER BY "COL_A", "COL_B", "COL_C", "COL_D", "Timestamp" DESC
If you want to get your expected order, you need a second ORDER BY after this operation:
SELECT
*
FROM (
SELECT DISTINCT ON ("COL_A", "COL_B", "COL_C", "COL_D")
*
FROM
mytable
ORDER BY "COL_A", "COL_B", "COL_C", "COL_D", "Timestamp" DESC
) s
ORDER BY "Timestamp"
Note: If you have the Timestamp column as part of the PK, are you sure, you really need the four other columns as PK as well? It seems, that the TS column is already unique.

SQL query help: Calculate max of previous rows in the same query

I want to find for each row(where B = C = D = 1), the max of A among its previous rows(where B = C = D = 1) excluding its row after its ordered in chronological order.
Data in table looks like this:
+-------+-----+-----+-----+------+------+
|Grp id | B | C | D | A | time |
+-------+---- +-----+-----+------+------+
| 111 | 1 | 0 | 0 | 52 | t |
| 111 | 1 | 1 | 1 | 33 | t+1 |
| 111 | 0 | 1 | 0 | 34 | t+2 |
| 111 | 1 | 1 | 1 | 22 | t+3 |
| 111 | 0 | 0 | 0 | 12 | t+4 |
| 222 | 1 | 1 | 1 | 16 | t |
| 222 | 1 | 0 | 0 | 18 | t2+1 |
| 222 | 1 | 1 | 0 | 13 | t2+2 |
| 222 | 1 | 1 | 1 | 12 | t2+3 |
| 222 | 1 | 1 | 1 | 09 | t2+4 |
| 222 | 1 | 1 | 1 | 22 | t2+5 |
| 222 | 1 | 1 | 1 | 19 | t2+6 |
+-------+-----+-----+-----+------+------+
Above table is resultant of below query. Its obtained after left joins as below. Joins are necessary according to my project requirement.
SELECT Grp id, B, C, D, A, time, xxx
FROM "DCR" dcr
LEFT JOIN "DCM" dcm ON "Id" = dcm."DCRID"
LEFT JOIN "DC" dc ON dc."Id" = dcm."DCID"
ORDER BY dcr."time"
Result column needs to be evaluated based on formula I mentioned above. It needs to be calculated in same pass as we need to consider only its previous rows. Above xxx needs to be replaced by a subquery/statement to obtain the result.
And the result table should look like this:
+-------+-----+-----+-----+------+------+------+
|Grp id | B | C | D | A | time |Result|
+-------+---- +-----+-----+------+------+------+
| 111 | 1 | 0 | 0 | 52 | t | - |
| 111 | 1 | 1 | 1 | 33 | t+1 | - |
| 111 | 1 | 1 | 1 | 34 | t+2 | 33 |
| 111 | 1 | 1 | 1 | 22 | t+3 | 34 |
| 111 | 0 | 0 | 0 | 12 | t+4 | - |
| 222 | 1 | 1 | 1 | 16 | t | - |
| 222 | 1 | 0 | 0 | 18 | t2+1 | - |
| 222 | 1 | 1 | 0 | 13 | t2+2 | - |
| 222 | 1 | 1 | 1 | 12 | t2+3 | 16 |
| 222 | 1 | 1 | 1 | 09 | t2+4 | 16 |
| 222 | 1 | 1 | 1 | 22 | t2+5 | 16 |
| 222 | 1 | 1 | 1 | 19 | t2+6 | 22 |
+-------+-----+-----+-----+------+------+------+
The column could be computed with a window function:
CASE WHEN b = 1 AND c = 1 AND d = 1
THEN max(a) FILTER (WHERE b = 1 AND c = 1 AND d = 1)
OVER (PARTITION BY "grp id"
ORDER BY time
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
ELSE NULL
END
I didn't test it.

PostgreSQL aggregate function for each row across multiple unknown number of columns

I looked through similar questions like this one, but they seem to have a definite number of columns. I would like to input a table that I do not know the number of columns.
Question:
How to calculate aggregate functions (e.g. avg() or sum() ) for each row across several columns if number of columns is not known in advance?
I have put the input table panel_stats_rnd csv and a DLL to create it here.
I would like to calculate for each row the rnd_avg_parcelcount as average of all columns c_1_avg_parcelcount, c_2_avg_parcelcount, ... where I can have input tables with any number (say 100) columns of _avg_parcelcount. And for columns rnd_sum_parcelcount I would like to calculate sum() of all columns that start with c_ and end with _sum_parcelcount.
The table looks like this:
SELECT * FROM panel_stats_rnd;
gid | d | dist_from | dist_to | distlabel | rnd_avg_parcelcount | rnd_sum_parcelcount | rnd_avg_callcount | rnd_sum_callcount | rnd_avg_perccalled | called_avg_parcelcount | called_sum_parcelcount | called_avg_callcount | called_sum_callcount | called_avg_perccalled | c_1_avg_parcelcount | c_1_sum_parcelcount | c_1_avg_callcount | c_1_sum_callcount | c_1_avg_perccalled | c_2_avg_parcelcount | c_2_sum_parcelcount | c_2_avg_callcount | c_2_sum_callcount | c_2_avg_perccalled
-----+----+-----------+---------+-----------+---------------------+---------------------+-------------------+-------------------+--------------------+------------------------+------------------------+----------------------+----------------------+-----------------------+---------------------+---------------------+-------------------+-------------------+----------------------+---------------------+---------------------+-------------------+-------------------+----------------------
1 | 0 | 0 | 100 | 0-100 | | | | | | 119045 | 119045 | 119045 | 23 | 0.000193204250493511 | 119045 | 119045 | 119045 | 16 | 0.000134402956865051 | 119045 | 119045 | 119045 | 16 | 0.000134402956865051
2 | 1 | 100 | 200 | 100-200 | | | | | | 163140 | 163140 | 163140 | 22 | 0.000134853500061297 | 163140 | 163140 | 163140 | 17 | 0.000104204977320093 | 163140 | 163140 | 163140 | 18 | 0.000110334681868334
3 | 2 | 200 | 300 | 200-300 | | | | | | 135934 | 135934 | 135934 | 10 | 7.3565112481057e-05 | 135934 | 135934 | 135934 | 18 | 0.000132417202465903 | 135934 | 135934 | 135934 | 15 | 0.000110347668721585
4 | 3 | 300 | 400 | 300-400 | | | | | | 116874 | 116874 | 116874 | 13 | 0.000111230898232284 | 116874 | 116874 | 116874 | 11 | 9.41184523503944e-05 | 116874 | 116874 | 116874 | 18 | 0.000154012012937009
5 | 4 | 400 | 500 | 400-500 | | | | | | 93216 | 93216 | 93216 | 12 | 0.000128733264675592 | 93216 | 93216 | 93216 | 10 | 0.000107277720562993 | 93216 | 93216 | 93216 | 12 | 0.000128733264675592
6 | 5 | 500 | 600 | 500-600 | | | | | | 69992 | 69992 | 69992 | 7 | 0.0001000114298777 | 69992 | 69992 | 69992 | 10 | 0.000142873471253858 | 69992 | 69992 | 69992 | 7 | 0.0001000114298777
7 | 6 | 600 | 700 | 600-700 | | | | | | 50816 | 50816 | 50816 | 10 | 0.000196788413098237 | 50816 | 50816 | 50816 | 6 | 0.000118073047858942 | 50816 | 50816 | 50816 | 0 | 0
8 | 7 | 700 | 800 | 700-800 | | | | | | 34814 | 34814 | 34814 | 0 | 0 | 34814 | 34814 | 34814 | 6 | 0.000172344459125639 | 34814 | 34814 | 34814 | 4 | 0.000114896306083759
9 | 8 | 800 | 900 | 800-900 | | | | | | 23023 | 23023 | 23023 | 1 | 4.34348260435217e-05 | 23023 | 23023 | 23023 | 4 | 0.000173739304174087 | 23023 | 23023 | 23023 | 1 | 4.34348260435217e-05
10 | 9 | 900 | 1000 | 900-1000 | | | | | | 14215 | 14215 | 14215 | 1 | 7.03482237073514e-05 | 14215 | 14215 | 14215 | 1 | 7.03482237073514e-05 | 14215 | 14215 | 14215 | 5 | 0.000351741118536757
11 | 10 | 1000 | 5000 | 1000-5000 | | | | | | 23527 | 23527 | 23527 | 0 | 0 | 23527 | 23527 | 23527 | 0 | 0 | 23527 | 23527 | 23527 | 3 | 0.000127513070089684
(11 rows)
I tried the following for 2 columns (works but I'd rather not write it 5 times for 100 columns, besides the number of columns has to be a parameter):
SELECT d,c_1_avg_parcelcount,c_2_avg_parcelcount,
(SELECT avg(c) FROM (VALUES (c_1_avg_parcelcount) , (c_2_avg_parcelcount) ) T (c)) AS Avg_,
(SELECT sum(c) FROM (VALUES (c_1_avg_parcelcount) , (c_2_avg_parcelcount) ) T (c)) AS sum_
FROM panel_stats_rnd;
I also tried the following but doesn't work.
WITH cols AS (
select value(column_name) from information_schema.columns
where table_name = 'panel_stats_rnd'
AND column_name SIMILAR TO 'c_%avg_parcelcount'
AND column_name != 'called_avg_parcelcount'
)
SELECT *, (SELECT avg(Col) FROM cols V(Col) ) AS col_average
FROM panel_stats_rnd;
I am almost there but something is missing...
select
*,
(select avg(v::numeric)
from json_each_text(row_to_json(panel_stats_rnd.*)) as j(k,v)
where k like 'c\_%\_avg\_parcelcount') as rnd_avg_parcelcount,
(select sum(v::numeric)
from json_each_text(row_to_json(panel_stats_rnd.*)) as j(k,v)
where k like 'c\_%\_sum\_parcelcount') as rnd_sum_parcelcount
from
panel_stats_rnd;
Look at the documentation about functions involved.
There are escapes for underlying characters (\_) because for like operator it is meaning any single character, for example select 'a' like '_'; is true.