Make postgresql timestamps unique - postgresql

I have a dataset with 6M+ rows including timestamps from about 2003 to current. In 2014 the database was migrated to postgresql and the timestamp column became unique due to the higher precision of timestamps. The original ID column was not migrated. About 300k of the timestamps are repeated at least once. I want to modify the timestamp column so that they are unique by adding precision (all non-unique timestamps only go to the second).
I have this
ts message
--------------------|---------------
2014-02-01 07:40:37 | message1
2014-02-01 07:40:37 | message2
I want this
ts message
-------------------------|---------------
2014-02-01 07:40:37.0000 | message1
2014-02-01 07:40:37.0001 | message2

This should work, but it will be horribly slow I guess:
update the_table
set ts = ts + '1 millisecond'::interval * x.rn
from (
select ctid, row_number() over (order by ts) as rn
from the_table
) x
where the_table.ctid = x.ctid;
The column ctid is an internal unique identifier (actually the physical address of the row) maintained by Postgres.
You might want to add another where condition to pick out only those rows that need to be modified.

One simple solution is to try to add a random interval to the timestamp:
update t
set ts = ts + random() * interval '1000000 microsecond'
where ts = date_trunc('second', ts)
The chance of a collision is very low. If it occurs use #a_horse's answer

Related

Values in raw as column names PosgreSQL

I have a very simple table like below
Events:
Event_name
Event_time
A
2022-02-10
B
2022-05-11
C
2022-07-17
D
2022-10-20
To a table like this are added new events, but we always take the event from the last X days (for example, 30 days), so the query result for this table is changeable.
I would like to transform the above table into this:
A
B
C
D
2022-02-10
2022-05-11
2022-07-17
2022-10-20
In general, the number of columns won't be constant. But if it's not possible we can add a limitation for the number of columns- for example, 10 columns.
I tried with crosstab, but I had to add the column name manually this is not what I mean and it doesn't work with the CTE query
WITH CTE AS (
SELECT DISTINCT
1 AS "Id",
event_time,
event_name,
ROW_NUMBER() OVER(ORDER BY event_time) AS nr
FROM
events
WHERE
event_time >= CURRENT_DATE - INTERVAL '31 days')
SELECT *
FROM
crosstab (
'SELECT id, event_name, event_time
FROM
CTE
WHERE
nr <= 10
ORDER BY
nr') AS ct(id int,
event_name text,
EventTime1 timestamp,
EventTime2 timestamp,
EventTime3 timestamp,
EventTime4 timestamp,
EventTime5 timestamp,
EventTime6 timestamp,
EventTime7 timestamp,
EventTime8 timestamp,
EventTime9 timestamp,
EventTime10 timestamp)
This query will be used as the data source in Tableau (data visualization and analysis software) it would be great if it could be one query (without temp tables, adding new functions, etc.)
Thanks!

How to get timestamp associated with percentile(x) value using timescale db time_bucket

I need find percentile(50) value and its timestamp using timescale db time-bucket. Finding P50 is easy but I don't know how to get the time stamp.
Select time_bucket('120 sec',timestamp_utc) as interval_size,
first(timestamp_utc,int_val) as minTime,
min(int_val) as minVal,
last(timestamp_utc,int_val) as maxTime,
max(int_val) as maxVal,
-- timestamp of percentile value below.
percentile_disc(0.5) within group (order by int_val) as medianVal
from timeseries.raw
where timestamp_utc > NOW() - INTERVAL '10 min'
AND tag_id = 59560544877390423
group by interval_size
order by interval_size desc
I think what you're looking for we can do by selecting where the int_val is equal to the median value in a lateral (percentile_disc does ensure that there is a value exactly equal to that value, there may be more than one depending on what you want there you could deal with the more than one case in different ways), building on a previous answer and making it work a bit better I think would look something like this:
WITH p50 AS (
Select time_bucket('120 sec',timestamp_utc) as interval_size,
first(timestamp_utc,int_val) as minTime,
min(int_val) as minVal,
last(timestamp_utc,int_val) as maxTime,
max(int_val) as maxVal,
-- timestamp of percentile value below.
percentile_disc(0.5) within group (order by int_val) as medianVal
from timeseries.raw
where timestamp_utc > NOW() - INTERVAL '10 min'
AND tag_id = 59560544877390423
group by interval_size
order by interval_size desc
) SELECT p50.*, rmed.*
FROM p50, LATERAL (SELECT * FROM timeseries.raw r
-- copy over the same where clause from above so we're dealing with the same subset of data
WHERE timestamp_utc > NOW() - INTERVAL '10 min'
AND tag_id = 59560544877390423
-- add a where clause on the median value
AND r.int_val = p50.medianVal
-- now add a where clause to account for the time bucket
AND r.timestamp_utc >= p50.interval_size
AND r.timestamp_utc < p50.interval_size + '120 sec'::interval
-- Can add an order by something desc limit 1 if you want to avoid ties
) rmed;
Note that this will do a second scan of the table, it should be reasonably efficient, especially if you have an index on that column, but it will cause another scan, there isn't a great way that I know of of doing it without a second scan.

Postgres join vs aggregation on very large partitioned tables

I have a large table with 100s of millions of rows. Because it is so big, it is partitioned by date range first, and then that partition is also partitioned by a period_id.
CREATE TABLE research.ranks
(
security_id integer NOT NULL,
period_id smallint NOT NULL,
classificationtype_id smallint NOT NULL,
dtz timestamp with time zone NOT NULL,
create_dt timestamp with time zone NOT NULL DEFAULT now(),
update_dt timestamp with time zone NOT NULL DEFAULT now(),
rank_1 smallint,
rank_2 smallint,
rank_3 smallint
)
CREATE TABLE zpart.ranks_y1990 PARTITION OF research.ranks
FOR VALUES FROM ('1990-01-01 00:00:00+00') TO ('1991-01-01 00:00:00+00')
PARTITION BY LIST (period_id);
CREATE TABLE zpart.ranks_y1990p1 PARTITION OF zpart.ranks_y1990
FOR VALUES IN ('1');
every year has a partition and there are another dozen partitions for each year.
I needed to see the result for security_ids side by side for different period_ids.
So the join I initially used was one like this:
select c1.security_id, c1.dtz,c1.rank_2 as rank_2_1, c9.rank_2 as rank_2_9
from research.ranks c1
left join research.ranks c9 on c9.dtz=c9.dtz and c1.security_id=c9.security_id and c9.period_id=9
where c1.period_id =1 and c1.dtz>now()-interval'10 years'
which was slow, but acceptable. I'll call this the JOIN version.
Then, we wanted to show two more period_ids and extended the above to add additional joins on the new period_ids.
This slowed down the join enough for us to look at a different solution.
We found that the following type of query runs about 6 or 7 times faster:
select c1.security_id, c1.dtz
,sum(case when c1.period_id=1 then c1.rank_2 end) as rank_2_1
,sum(case when c1.period_id=9 then c1.rank_2 end) as rank_2_9
,sum(case when c1.period_id=11 then c1.rank_2 end) as rank_2_11
,sum(case when c1.period_id=14 then c1.rank_2 end) as rank_2_14
from research.ranks c1
where c1.period_id in (1,11,14,9) and c1.dtz>now()-interval'10 years'
group by c1.security_id, c1.dtz;
We can use the sum because the table has unique indexes so we know there will only ever be one record that is being "summed". I'll call this the SUM version.
The speed is so much better that I'm questioning half of the code I have written previously! Two questions:
Should I be trying to use the SUM version rather than the JOIN version everywhere or is the efficiency likely to be a factor of the specific structure and not likely to be as useful in other circumstances?
Is there a problem with the logic of the SUM version in cases that I haven't considered?
To be honest, I don't think your "join" version was ever a good idea anyway. You only have one (partitioned) table so there never was a need for any join.
SUM() is the way to go, but I would use SUM(...) FILTER(WHERE ..) instead of a CASE:
SELECT
security_id,
dtz,
SUM(rank_2) FILTER (WHERE period_id = 1) AS rank_2_1,
SUM(rank_2) FILTER (WHERE period_id = 9) AS rank_2_9,
SUM(rank_2) FILTER (WHERE period_id = 11) AS rank_2_11,
SUM(rank_2) FILTER (WHERE period_id = 14) AS rank_2_14,
FROM
research.ranks
WHERE
period_id IN ( 1, 11, 14, 9 )
AND dtz > now( ) - INTERVAL '10 years'
GROUP BY
security_id,
dtz;

Postgres 9.3 count rows matching a column relative to row's timestamp

I've used WINDOW functions before but only when working with data that has a fixed cadence/interval. I am likely missing something simple in aggregation but I've never had a scenario where I'm not working with fixed intervals.
I have a table the records samples at arbitrary timestamps. A sample is only recorded when it is a delta from the previous sample and the sample rate is completely irregular due to a large number of conditions. The table is very simple:
id (int)
happened_at (timestamp)
sensor_id (int)
new_value (float)
I'm trying to construct a query that will include a count of all of the samples before the happened_at of a given result row. So given an ultra simple 2 row sample data set:
id|happened_at |sensor_id| new_value
1 |2019-06-07:21:41|134679 | 123.331
2 |2019-06-07:19:00|134679 | 100.009
I'd like the result set to look like this:
happened_at |sensor_id | new_value | sample_count
2019-06-07:21:41|134679 |123.331 |2
2019-06-07:19:00|134679 |123.331 |1
I've tried:
SELECT *,
(SELECT count(sample_history.id) OVER (PARTITION BY score_history.sensor_id
ORDER BY sample_history.happened_at DESC))
FROM sensor_history
ORDER by happened_at DESC
and the duh not going to work.
(SELECT count(*)
FROM sample_history
WHERE sample_history.happened_at <= sample_timestamp)
Insights greatly appreciated.
Get rid of the SELECT (sub-query) when using the window function.
SELECT *,
count(*) OVER (PARTITION BY sensor_id ORDER BY happened_at DESC)
FROM sensor_history
ORDER BY happened_at DESC

Postgres : find the minimum of difference between timestamp rows

I have users, they run and I have a timestamp record of each lap.
I'd like to sort users by fastest lap.
The idea is to use "lag" function of postgres, with a "GROUP BY user_id"
I tried something but I dont know "lag" well, moreover I can't get and window function (lap) work with an aggregation function (min)
http://sqlfiddle.com/#!15/ec9dc/6
Do you have an idea ?
Thanks
If you want to order the users by their lap durations you have to partition the rows by the column user_id. Otherwise the lap times of different users will be mixed.
I've taken your example code and modified it a bit: I've removed one time stamp, which was inserted twice and I introduced a view which contains the window functions rank() and lag(). The former is used to calculate the lap number per user and the latter to determine the preceeding time stamp for the current time stamp.
BEGIN;
CREATE TABLE lap
(
user_id text,
timestamp timestamp without time zone
);
INSERT INTO lap VALUES
('a', '2015-08-20 16:14:30.568'),
('a', '2015-08-20 16:16:13.06'),
('b', '2015-08-20 16:16:18.06'),
('b', '2015-08-20 16:16:25.63'),
('b', '2015-08-20 16:17:10.568'),
('a', '2015-08-20 16:17:25.63'),
--('a', '2015-08-20 16:17:34.087'), -- Timestamp was inserted twice.
('a', '2015-08-20 16:17:34.087');
CREATE OR REPLACE VIEW user_lap_duration AS
SELECT
user_id,
-1+rank() OVER w AS lap_nr,
lag(timestamp) OVER w AS timestamp_prev,
timestamp,
timestamp - lag(timestamp)
OVER w
AS lap_duration
FROM
lap
WINDOW w AS (PARTITION BY user_id ORDER BY timestamp);
SELECT
user_id,
lap_nr,
lap_duration
FROM
user_lap_duration
WHERE
lap_nr <> 0
ORDER BY
lap_duration;
ROLLBACK;
Running above code yields the following output.
user_id | lap_nr | lap_duration
---------+--------+--------------
b | 1 | 00:00:07.57
a | 3 | 00:00:08.457
b | 2 | 00:00:44.938
a | 2 | 00:01:12.57
a | 1 | 00:01:42.492
(5 rows)