I've got a load of time series data about a fleet of batteries stored in TimescaleDB, which record the 'state of charge' of each tank at each time. I don't have measurement of the in- and out-flow, only the instantaneous state of charge.
From this data, I want to find the change in state of charge at each time, which I will later bucket to consumption across hours (after doing some battery-specific maths).
I've written an SQL query which achieves my goal:
SELECT time, charge - LAG(charge) OVER (ORDER BY time) AS delta_soc FROM charge_data;
Putting that in a Postgres generated column:
ADD COLUMN delta_soc smallint GENERATED ALWAYS AS (charge - LAG(charge) OVER (ORDER BY time)) STORED
Fails, as promised in the docs, because it references another row.
So, I (successfully) made a materialized view:
CREATE MATERIALIZED VIEW delta_soc AS
SELECT
time,
batt_uid,
charge,
(charge-LAG(charge) OVER (ORDER BY time)) as delta_charge,
EXTRACT(EPOCH FROM time-LAG(time) OVER (ORDER BY time)) as delta_time
FROM charge_data
ORDER BY time;
But it would be nice to have this data in near-realtime. After all, it's a "simple" operation to just provide the change from last value. So, I looked at Timescale's continuous aggregates. But, as in the docs, you're not allowed a window function in a continuous aggregate, so the continuous aggregate is invalid.
Then, just throwing things at the wall and seeing what sticks, I wondered if I could reference the previous row during insertion
INSERT INTO charge_data VALUES (..., ([$chargevalue]-LAG(charge) OVER (ORDER BY time)), ...);
HINT: There is a column named "charge" in table "mx_data", but it cannot be referenced from this part of the query.
I'm aware I could calculate the deltas
before insertion
after insertion by modifying each charge_data row with its delta
in the SQL query
in the querying program
But it seems much simpler and tidier to just have the DB calculate the values once at/around insertion, leading me to suspect I'm missing something. Is there any way to have charge[battery][n]-charge[battery][n-1] calculated and stored for every row in near-realtime in timescale?
I think a before insert trigger would work fine. You can make a before_insert trigger and update the delta soc while you're inserting using the previous reference.
CREATE TABLE batteries ( time timestamp not null, batt_uid varchar, charge int, delta int);
SELECT create_hypertable('batteries', 'time');
CREATE OR REPLACE FUNCTION update_delta() RETURNS trigger AS
$BODY$
DECLARE
previous_charge integer;
BEGIN
select charge
into previous_charge
from batteries where batt_uid = NEW.batt_uid
order by time desc limit 1;
IF NEW.charge IS NOT NULL THEN
IF previous_charge IS NOT NULL THEN
NEW.delta = NEW.charge - previous_charge;
ELSE
NEW.delta = 0;
END IF;
END IF;
RETURN NEW;
END;
$BODY$
LANGUAGE plpgsql;
CREATE TRIGGER update_delta_on_insert
BEFORE INSERT
ON batteries
FOR EACH ROW
EXECUTE PROCEDURE update_delta();
Testing
INSERT INTO batteries VALUES
('2021-08-26 10:09:00'::timestamp, 'battery-1', 32),
('2021-08-26 10:09:01'::timestamp, 'battery-1', 34),
('2021-08-26 10:09:02'::timestamp, 'battery-1', 38);
INSERT INTO batteries VALUES
('2021-08-26 10:09:00'::timestamp, 'battery-2', 0),
('2021-08-26 10:09:01'::timestamp, 'battery-2', 4),
('2021-08-26 10:09:02'::timestamp, 'battery-2', 28),
('2021-08-26 10:09:03'::timestamp, 'battery-2', 32),
('2021-08-26 10:09:04'::timestamp, 'battery-2', 28);
Output from:
SELECT * FROM batteries;
┌─────────────────────┬───────────┬────────┬───────┐
│ time │ batt_uid │ charge │ delta │
├─────────────────────┼───────────┼────────┼───────┤
│ 2021-08-26 10:09:00 │ battery-1 │ 32 │ 0 │
│ 2021-08-26 10:09:01 │ battery-1 │ 34 │ 2 │
│ 2021-08-26 10:09:02 │ battery-1 │ 38 │ 4 │
│ 2021-08-26 10:09:00 │ battery-2 │ 0 │ 0 │
│ 2021-08-26 10:09:01 │ battery-2 │ 4 │ 4 │
│ 2021-08-26 10:09:02 │ battery-2 │ 28 │ 24 │
│ 2021-08-26 10:09:03 │ battery-2 │ 32 │ 4 │
│ 2021-08-26 10:09:04 │ battery-2 │ 28 │ -4 │
└─────────────────────┴───────────┴────────┴───────┘
Related
I have 2 action tables, one specific, one general, based on a status and related actions.
In the first table, some rows (based on status) are missing.
I am trying to return a global table that would pick the missing rows from the second table (default_action) whenever there would be a row missing in the first one.
Job table: job_actions
Default table: default_actions
I am using the following set for testing:
CREATE TYPE STATUS AS ENUM('In Progress', 'Failed', 'Completed');
CREATE TYPE EXPIRATION_ACTION AS ENUM('Expire', 'Delete');
CREATE TYPE BASIC_ACTION AS (status STATUS,
operation EXPIRATION_ACTION, expiration_time TIMESTAMP);
CREATE TYPE ACTION AS (partition VARCHAR(40), job VARCHAR(48), b_action BASIC_ACTION);
CREATE TABLE IF NOT EXISTS job_actions (
partition VARCHAR(40),
job VARCHAR(48),
status STATUS,
operation EXPIRATION_ACTION,
expiration_time TIMESTAMP
);
CREATE TABLE IF NOT EXISTS default_actions OF BASIC_ACTION;
INSERT INTO default_actions (
status,
operation,
expiration_time
)
VALUES ('In Progress', 'Expire', 'infinity'::timestamp),
('Failed', 'Expire', 'infinity'::timestamp),
('Completed', 'Expire', 'infinity'::timestamp);
INSERT INTO job_actions (
partition ,
job ,
status,
operation,
expiration_time
)
VALUES
('part1', 'job1','Failed', 'Expire', NOW() + INTERVAL '1 hour'),
('part1', 'job2','In Progress', 'Expire', NOW() + INTERVAL '1 hour'),
('part1', 'job2','Failed', 'Expire', NOW() + INTERVAL '1 hour'),
('part1', 'job3','In Progress', 'Expire', NOW() + INTERVAL '1 hour'),
('part1', 'job3','Failed', 'Expire', NOW() + INTERVAL '1 hour');
I am trying to use something like
SELECT ja.partition, ja.job, ja.status, ja.operation, ja.expiration_time
FROM job_actions ja
WHERE NOT EXISTS (
SELECT da.status, da.operation, da.expiration_time
FROM default_actions da );
But at the moment, it returns an empty table.
Here is the expected result:
Would anyone know what I am doing wrong?
First, get all partitions and jobs from job_actions. Then cross join with default_actions to get all possible combinations. Left join that with job_actions and take the expiration_time from there unless it is a NULL value (no matching row was found).
Translated into SQL:
SELECT partition, job, status, operation,
coalesce(ja.expiration_time, da.expiration_time) AS expiration_time
FROM (SELECT DISTINCT partition, job
FROM job_actions) AS jobs
CROSS JOIN default_actions AS da
LEFT JOIN job_actions AS ja USING (partition, job, status, operation)
ORDER BY partition, job, status;
partition │ job │ status │ operation │ expiration_time
═══════════╪══════╪═════════════╪═══════════╪════════════════════════════
part1 │ job1 │ In Progress │ Expire │ infinity
part1 │ job1 │ Failed │ Expire │ 2021-06-18 14:57:23.912874
part1 │ job1 │ Completed │ Expire │ infinity
part1 │ job2 │ In Progress │ Expire │ 2021-06-18 14:57:23.912874
part1 │ job2 │ Failed │ Expire │ 2021-06-18 14:57:23.912874
part1 │ job2 │ Completed │ Expire │ infinity
part1 │ job3 │ In Progress │ Expire │ 2021-06-18 14:57:23.912874
part1 │ job3 │ Failed │ Expire │ 2021-06-18 14:57:23.912874
part1 │ job3 │ Completed │ Expire │ infinity
(9 rows)
How do I format the output of a query to display a value in number of millions (i.e 'million' appended. e.g: 1 million instead of 1000000) using psql?
Example:
SELECT city, population
FROM cities
WHERE state = 'California';
Actual Output:
city | population
---------------+--------------------------
Los Angeles | 3990456
San Diego | 1425976
San Jose | 1030119
Desired Output:
city | population
---------------+--------------------------
Los Angeles | 3.99 million
San Diego | 1.43 million
San Jose | 1.03 million
All I could find on the topic is functions for data formatting for converting numbers/dates to strings and vice-versa: https://www.docs4dev.com/docs/en/postgre-sql/11.2/reference/functions-formatting.html
Also, the to_char function does not seem to perform this sort of formatting: https://www.postgresqltutorial.com/postgresql-to_char/
Using psql and PostgreSQL version 13 on macOS terminal.
There is not any build functionality for your purpose. You should to write own custom function:
CREATE OR REPLACE FUNCTION format_mil(n int)
RETURNS text AS $$
BEGIN
IF n > 500000 THEN
RETURN (n / 1000000.0)::numeric(10,2) || ' million';
ELSE
RETURN n::text;
END IF;
END;
$$ LANGUAGE plpgsql;
postgres=# select format_mil(3990456);
┌──────────────┐
│ format_mil │
╞══════════════╡
│ 3.99 million │
└──────────────┘
(1 row)
But simply SQL expression can be good enough too:
CREATE TABLE cities (population int);
INSERT INTO cities VALUES(3990456);
INSERT INTO cities VALUES(1425976);
postgres=# SELECT CASE WHEN population > 500000 THEN
(population/1000000.0)::numeric(10,2) || ' million'
ELSE population::text END
FROM cities;
┌──────────────┐
│ population │
╞══════════════╡
│ 3.99 million │
│ 1.43 million │
└──────────────┘
(2 rows)
Hello I am a beginner at SQL, especially postgresql.
I have a table that looks something like this:
ID | Entity | Startdate | enddate
------| ------ | ------ | ------
1 | Hospital |2013-01-01 |2013-01-31
1 | Clinic |2013-02-01 |2013-04-30
1 | Hospital |2013-05-01 |2013-05-31
What I would like to do in this case is that where the start and end date span more than a month to break it out so the above table would like this:
ID | Entity | Startdate | enddate
------| ------ | ------ | ------
1 | Hospital |2013-01-01 |2013-01-31
1 | Clinic |2013-02-01 |2013-02-29
1 | Clinic |2013-03-01 |2013-03-31
1 | Clinic |2013-04-01 |2013-04-30
1 | Hospital |2013-05-01 |2013-05-31
If you notice that row 2, 3 and 4 have been broken down by the month and the ID and entity have also been duplicated.
Any suggestions on how to run this in postgresql would be appreciated.
P.S Apologies I am trying to figure out how to create the table above properly. Having difficulty, the pipes between the numbers and words are lines in a table.
Hope its not too confusing.
One way to do this is to create yourself an end_of_month function like this:
CREATE FUNCTION end_of_month(date)
RETURNS date AS
$BODY$
select (date_trunc('month', $1) + interval '1 month' - interval '1 day')::date;
$BODY$
LANGUAGE sql IMMUTABLE STRICT
COST 100;
Then you can have a string of UNIONS like this:
SELECT
id,
entity,
startdate,
least(end_of_month(startdate),enddate) enddate
from hospital
union
SELECT
id,
entity,
startdate,
least(end_of_month((startdate + interval '1 month')::date),enddate) enddate
from hospital
union
id,
entity,
startdate,
least(end_of_month((startdate + interval '1 month')::date),enddate) enddate
from hospital
ORDER BY startdate,enddate
The problem with this approach, is that you need to have as many unions as necessary!
The alternative is to use a cursor.
EDIT
Just thought of another (better) non-cursor solution. Create a table of month-end dates. Then you can simply do:
select h.id,
h.entity,
h.startdate,
least(h.enddate, m.enddate) enddate
from hospital h
INNER JOIN monthends m
ON m.enddate > h.startdate and m.enddate <= end_of_month(h.enddate)
ORDER BY startdate, enddate
Here is example how to clone row based on its data:
-- Demo data begin
with t(i,x,y) as (values
(1, '2013-02-03'::date, '2013-04-27'::date),
(2, current_date, current_date))
-- Demo data end
select
*,
greatest(x, z)::date as x1, least(y, z + '1 month - 1 day'::interval)::date as y1
from
t,
generate_series(date_trunc('month', x)::date, date_trunc('month', y)::date, '1 month') as z;
┌───┬────────────┬────────────┬────────────────────────┬────────────┬────────────┐
│ i │ x │ y │ z │ x1 │ y1 │
╞═══╪════════════╪════════════╪════════════════════════╪════════════╪════════════╡
│ 1 │ 2013-02-03 │ 2013-04-27 │ 2013-02-01 00:00:00+02 │ 2013-02-03 │ 2013-02-28 │
│ 1 │ 2013-02-03 │ 2013-04-27 │ 2013-03-01 00:00:00+02 │ 2013-03-01 │ 2013-03-31 │
│ 1 │ 2013-02-03 │ 2013-04-27 │ 2013-04-01 00:00:00+03 │ 2013-04-01 │ 2013-04-27 │
│ 2 │ 2017-08-27 │ 2017-08-27 │ 2017-08-01 00:00:00+03 │ 2017-08-27 │ 2017-08-27 │
└───┴────────────┴────────────┴────────────────────────┴────────────┴────────────┘
Just remove Demo data block and replace t, x and y by your table/columns names.
Explanation:
least() and greatest() function returns minimum and maximum element accordingly. Link
generate_series(v1,v2,d) function returns series of values started with v1, not greatest then v2 with step d. Link
'1 month - 1 day'::interval - interval data type notation, <value>::<datatype> means explicit type casting, the SQL standard equivalent is cast(<value> as <datatype>). Link and link
date_trunc() function truncates the date/timestamp value to the specified precision. Link
How to get the median of the valcolumn from table test whose value is greater than 20.
id val
1 5.43
2 106.26
3 14.00
4 39.58
5 27.00
In this case output would be median(27.00, 39.58, 106.26) = 39.58.
I am using PostgreSQL database.
Any help would be much appreciated.
From PostgreSQL 9.4 you use ordered aggregates:
postgres=# SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY val)
FROM foo WHERE val > 20;
┌─────────────────┐
│ percentile_cont │
╞═════════════════╡
│ 39.58 │
└─────────────────┘
(1 row)
or some really modern syntax:
postgres=# SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY val)
FILTER (WHERE val > 20)
FROM foo;
┌─────────────────┐
│ percentile_cont │
╞═════════════════╡
│ 39.58 │
└─────────────────┘
(1 row)
I have a list of calls per every user sometimes separated by minutes. Users can buy something in these calls or not.
When a user makes a call within 45 minutes after the last call, I need to consider that it was the same call as the first one.
I need to get the final number of calls ( aggregating the calls separated by less than 45 minutes)
and the number of calls in which they bought something, per user.
So for example, I have a list like this:
buyer timestamp bougth_flag
tom 20150201 9:15 1
anna 20150201 9:25 0
tom 20150201 10:15 0
tom 20150201 10:45 1
tom 20150201 10:48 1
anna 20150201 11:50 0
tom 20150201 11:52 0
anna 20150201 11:54 0
The final table would be:
buyer time_started calls articles_bought
tom 20150201 9:15 1 1
anna 20150201 9:25 1 0
tom 20150201 10:15 3 2
anna 20150201 10:50 2 0
tom 20150201 11:52 1 0
So, I need to merge rows separated by less than 45 minutes, and separate still per user.
This is very easy to do with a loop but I don't have loops or functions/procedures in the postgresql I am using.
Any ideas about how to do it?
Thank you
Since you do not know beforehand how long a "call" is going to be (you could have a call from some buyer every 30 minutes for the full day - see comment to question), you can only solve this with a recursive CTE. (Note that I changed your column 'timestamp' to 'ts'. Never use a keyword as a table or column name.)
WITH conversations AS (
WITH RECURSIVE calls AS (
SELECT buyer, ts, bought_flag, row_number() OVER (ORDER BY ts) AS conversation, 1::int AS calls
FROM (
SELECT buyer, ts, lag(ts) OVER (PARTITION BY buyer ORDER BY ts) AS lag, bought_flag
FROM list) sub
WHERE lag IS NULL OR ts - lag > interval '45 minutes'
UNION ALL
SELECT l.buyer, l.ts, l.bought_flag, c.conversation, c.calls + 1
FROM list l
JOIN calls c ON c.buyer = l.buyer AND l.ts > c.ts
WHERE l.ts - c.ts < interval '45 minutes'
)
SELECT buyer, ts, bought_flag, conversation, max(calls) AS calls
FROM calls
GROUP BY buyer, ts, bought_flag, conversation
order by conversation, ts
)
SELECT buyer, min(ts) AS time_started, max(calls) AS calls, sum(bought_flag) AS articles_bought
FROM conversations
GROUP BY buyer, conversation
ORDER BY time_started
A few words of explanation:
The starting term of the inner recursive CTE has a sub-query that gets the basic data from the table for every call, together with the time of the previous call. The main query in the starting term of the inner CTE keeps only those rows where there is no previous call (lag IS NULL) or where the previous call is more than 45 minutes away. These are therefore the initial calls in what I term here a "conversation". The conversation gets a column and an id which is just the row number from the query, and another column to track the number of calls in the conversation "calls".
In the recursive term successive calls in the same conversation are added, with the "calls" counter incremented.
When calls are very close together (such as 10:45 and 10:48 after 10:15) then the later calls may be included multiple times, those duplicates (10:48) are dropped in the outer CTE by selecting the earliest call in the sequence for each conversation.
In the main query, finally, the 'bought_flag' column is summed for every conversation of every buyer.
The big problem is that you need to group your results per 45 minutes which makes it tricky. This query is a nice starting point but it's not completely correct. It should help you get going though:
SELECT a.buyer,
MIN(a.timestamp),
COUNT(a),
COUNT(b),
SUM(a.bougth_flag),
SUM(b.bougth_flag)
FROM calls a
LEFT JOIN calls b ON (a.buyer = b.buyer
AND a.timestamp != b.timestamp
AND a.timestamp < b.timestamp
AND a.timestamp + '45 minutes'::INTERVAL > b.timestamp)
GROUP BY a.buyer,
DATE_TRUNC('hour', a.timestamp) ;
Results:
┌───────┬─────────────────────┬───────┬───────┬─────┬─────┐
│ buyer │ min │ count │ count │ sum │ sum │
├───────┼─────────────────────┼───────┼───────┼─────┼─────┤
│ tom │ 2015-02-01 11:52:00 │ 1 │ 0 │ 0 │ Ø │
│ anna │ 2015-02-01 11:50:00 │ 2 │ 1 │ 0 │ 0 │
│ anna │ 2015-02-01 09:25:00 │ 1 │ 0 │ 0 │ Ø │
│ tom │ 2015-02-01 09:15:00 │ 1 │ 0 │ 1 │ Ø │
│ tom │ 2015-02-01 10:15:00 │ 4 │ 3 │ 2 │ 3 │
└───────┴─────────────────────┴───────┴───────┴─────┴─────┘
Thanks Patrick for notice about original version.
You are defently need WINDOW functions here, but CTE is optional here.
with start_points as(
select tmp.*,
--calculate distance between start points
(lead(ts) OVER w)-ts AS start_point_lead from( select t.*, ts - (lag(ts) OVER w) AS lag from test t window w as (PARTITION BY buyer ORDER BY ts)
) tmp where lag is null or lag>interval '45 minutes'
window w as (PARTITION BY buyer ORDER BY ts) order by ts
)
select s.buyer, s.ts, count(*), sum(t.bougth_flag) from start_points s join test t
on t.buyer=s.buyer and (t.ts-s.ts<s.start_point_lead or s.start_point_lead is null)and t.ts>=s.ts
group by s.buyer, s.ts order by s.ts