Will i get benefits of hyper table if I have a query in which I join a hyper table with a normal (non-hyper) table in timescaledb - postgresql

I have to fetch record from two tables, there is one table is hyper table another table is normal table.
Hyper table primary key (a UUID, not a timestampz column) is used as foreign key in 2nd normal table.
The hyper table has one to many relationship with the normal table.
Will I get all benefits of hyper table here if I select record after joining this table?
I am using postgresql database for timescale.
Below are create table queries for same. The demography_person is the hypertable and the emotions_person is the normal table
CREATE TABLE public.demography_person
(
start_timestamp timestamp with time zone NOT NULL,
end_timestamp timestamp with time zone,
demography_person_id character varying NOT NULL,
device_id bigint,
age_actual numeric,
age_band integer,
gender integer,
dwell_time_in_millis bigint,
customer_id bigint NOT NULL
);
SELECT create_hypertable('demography_person', 'start_timestamp');
CREATE TABLE public.emotions_person
(
emotion_start_timestamp timestamp with time zone NOT NULL,
demography_person_id character varying NOT NULL,
count integer,
emotion integer,
emotion_percentage numeric
);
select sql Query is like:-
SELECT * FROM crosstab
(
$$
SELECT * FROM ( select to_char(dur,'HH24') as duration , dur as time_for_sorting from
generate_series(
timestamp '2019-04-01 00:00:00',
timestamp '2020-03-09 23:59:59' ,
interval '1 hour'
) as dur ) d
LEFT JOIN (
select to_char(
start_timestamp ,
'HH24'
)
as duration,
emotion,count(*) as count from demography_person dp INNER JOIN (
select distinct ON (demography_person_id) demography_person_id, emotion_start_timestamp,count,emotion,emotion_percentage,
(CASE emotion when 4 THEN 1 when 6 THEN 2 when 1 THEN 3 WHEN 3 THEN 4 WHEN 2 THEN 5 when 7 THEN 6 when 5 THEN 7 ELSE 8 END )
as emotion_key_for_sorting from emotions_person where demography_person_id in (select demography_person_id from demography_person where start_timestamp >= '2019-04-01 00:00:00'
AND start_timestamp <= '2020-03-09 23:59:59' AND device_id IN ( 2052,2692,1797,2695,1928,2697,2698,1931,2574,2575,2706,1942,1944,2713,1821,2719,2720,2721,2722,2723,2596,2725,2217,2603,1852,2750,1726,1727,2754,2757,1990,2759,2760,2376,2761,2762,2257,2777,2394,2651,2652,1761,2658,1762,2659,2788,2022,2791,2666,1770,2026,2028,2797,2675,1780,2549 ))
order by demography_person_id asc,emotion_percentage desc, emotion_key_for_sorting asc
) ep ON
ep.demography_person_id = dp.demography_person_id
WHERE start_timestamp >= '2019-04-01 00:00:00'
AND start_timestamp <= '2020-03-09 23:59:59' AND device_id IN ( 2052,2692,1797,2695,1928,2697,2698,1931,2574,2575,2706,1942,1944,2713,1821,2719,2720,2721,2722,2723,2596,2725,2217,2603,1852,2750,1726,1727,2754,2757,1990,2759,2760,2376,2761,2762,2257,2777,2394,2651,2652,1761,2658,1762,2659,2788,2022,2791,2666,1770,2026,2028,2797,2675,1780,2549 ) AND gender IN ( 1,2 )
group by 1,2 ORDER BY 1,2 ASC
) t USING (duration) GROUP BY 1,2,3,4 ORDER BY time_for_sorting;
$$ ,
$$
select emotion from (
values ('1'), ('2'), ('3'),('4'), ('5'), ('6'),('7'), ('8')
) t(emotion)
$$
) AS ct
(
duration text,
time_for_sorting timestamp,
ANGER bigInt,
DISGUSTING bigInt,
FEAR bigInt,
HAPPY bigInt,
NEUTRAL bigInt,
SAD bigInt,
SURPRISE bigInt,
NO_DETECTION bigInt
);

Will i get benefits of hyper table if I have a query in which I join a hyper table with a normal (non-hyper) table in timescaledb
I don't fully understand the question and see 2 interpretations for it:
Will I benefit from using TimescaleDB and hypertable just for improving this query?
Can I join a hypertable and a normal table and how to make the above query to perform better?
If you just need to execute a complex query over large dataset, PostgreSQL can do good job if you provide indexes. TimescaleDB provides benefits for Timeseries workflows especially when a workflow includes data in-order ingesting, time-related queries, timeseries operators and/or usage TimescaleDB specific functionality such as continuous aggregates and compression, i.e., not just a query. TimescaleDB is designed for large volumes of timeseries data. I hope it clarifies the first question.
In TimescaleDB it is very common to join hypertable, which stores timeseries data, and a normal table, which contains metadata on timerseries data. TimescaleDB implements constraint exclusion to improve query performance. However, it might not be applied in some cases due to uncommon query expressions or too complex queries.
The query in the question is very complex. So I suggest to use ANALYZE on the query to see if the query planner misses some optimisations.
I see that the query generates data and I doubt it can be done much to produce good query plan. So this is my biggest concern for getting good performance. It would be great if you can explain motivation around the generating data inside the query.
Another issue, which I see, is a nested query demography_person_id in (select demography_person_id from demography_person ... in a where condition. And the outer query is a part in a inner join with the same table as in the nested query. I expect it can be rewritten without nested subquery utilising inner join.
I doubt that TimescaleDB or PostgreSQL can do much to execute query efficiently. The query requires manual human rewriting.

Related

Most efficient psql query to find count of records with date greater than date of target record

given table changelogs
create table changelogs
(
created_at timestamp,
user_action varchar(255) default ''::character varying not null,
id serial
);
how can I most efficiently query for all records newer than the latest record with a given user_action
Something like this works:
select count(*)
from changelogs
where created_at > (select max(created_at)
from changelogs
WHERE user_action in ('target', 'target2')
)
Indexes on (created_at) and (user_action, created_at) should make the given query reasonably efficient, with the caveat that if you end up needing to count most of the table, nothing is going to be very efficient.
If that is not good enough, provide EXPLAIN(ANALYZE, BUFFERS) for further ideas.

How to correctly GROUP BY on jdbc sources

I have a Kafka stream with user_id and want to produce another stream with user_id and number of records in a JDBC table.
Following is how I tried to achieve this (I'm new to flink, so please correct me if that's not how things are supposed to be done). The issue is that flink ignores all updates to JDBC table after the job has started.
As far as I understand the answer to this is to use lookup joins but flink complains that lookup joins are not supported on temporal views. Also tried to use versioned views without much success.
What would be the correct approach to achieve what I want?
CREATE TABLE kafka_stream (
user_id STRING,
event_time TIMESTAMP(3) METADATA FROM 'timestamp',
WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
-- ...
)
-- NEXT SQL --
CREATE TABLE jdbc_table (
user_id STRING,
checked_at TIMESTAMP,
PRIMARY KEY(user_id) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
-- ...
)
-- NEXT SQL --
CREATE TEMPORARY VIEW checks_counts AS
SELECT user_id, count(*) as num_checks
FROM jdbc_table
GROUP BY user_id
-- NEXT SQL --
INSERT INTO output_kafka_stream
SELECT
kafka_stream.user_id,
checks_counts.num_checks
FROM kafka_stream
LEFT JOIN checks_counts ON kafka_stream.user_id = checks_counts.user_id

Create pivot table with dynamic column names

I am creating a pivot table which represents crash values for particular year. Currently, i am doing a hard code for column names to create pivot table. Is there anyway to make the column names dynamic to create pivot table? years are stored inside an array
{2018,2017,2016 ..... 2008}
with crash as (
--- pivot table generated for total fatality ---
SELECT *
FROM crosstab('SELECT b.id, b.state_code, a.year, count(case when a.type = ''Fatal'' then a.type end) as fatality
FROM '||state_code_input||'_all as a, (select * from source_grid_repository where state_code = '''||upper(state_code_input)||''') as b
where st_contains(b.geom,a.geom)
group by b.id, b.state_code, a.year
order by b.id, a.year',$$VALUES ('2018'),('2017'),('2016'),('2015'),('2014'),('2013'),('2012'),('2011'),('2010'),('2009'),('2008') $$)
AS pivot_table(id integer, state_code varchar, fat_2018 bigint, fat_2017 bigint, fat_2016 bigint, fat_2015 bigint, fat_2014 bigint, fat_2013 bigint, fat_2012 bigint, fat_2011 bigint, fat_2010 bigint, fat_2009 bigint, fat_2008 bigint)
)
In the above code, fat_2018, fat_2017 , fat_2016 etc were hard coded. I need the years after fat_ to be dynamic.
This question has been asked many times, & there are decent (even dynamic) solutions. While CROSSTAB() is available in recent versions of Postgres, not everyone has sufficient privileges to install the prerequisite extensions.
One such solution involves a temp type (temp table) created by an anonymous function & JSON expansion of the resultant type.
See also: DB FIDDLE (UK): https://dbfiddle.uk/Sn7iO4zL
How to pivot or crosstab in postgresql without writing a function?
It is not possible. PostgreSQL is strict type system. Result is a table (relation). A format of this table (columns, columns names, columns types) should be defined before query execution (in planning time). So you cannot to write any query for Postgres that returns dynamic number of columns.

PostgreSQL - How to make a condition with records between the current record date and the same date plus 5 min?

I have something like this. With this part of code I detect if a vehicle stopped at least 5 minutes.
And works but, with a large amount of data, it starts to be slow.
I did a lot of tests and I'm sure that my problem is in the not exists block.
My table:
CREATE TABLE public.messages
(
id bigint PRIMARY KEY DEFAULT nextval('messages_id_seq'::regclass),
messagedate timestamp with time zone NOT NULL,
vehicleid integer NOT NULL,
driverid integer NOT NULL,
speedeffective double precision NOT NULL,
-- ... few nonsense properties
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.messages OWNER TO postgres;
CREATE INDEX idx_messages_1 ON public.messages
USING btree (vehicleid, messagedate);
And my query:
SELECT
*
FROM
messages m
WHERE
m.speedeffective > 0
and m.next_speedeffective = 0
and not exists( -- my problem
select id
from messages
where
vehicleid = m.vehicleid
and speedeffective > 5 -- I forgot this condition
and messagedate > m.messagedate
and messagedate <= m.messagedate + interval '5 minutes'
)
I can't figure out how to build the condition in a more performant way.
Edit DAY2:
I added a previous table like this to use in the second table:
WITH messagesx as (
SELECT
vehicleid,
messagedate
FROM
messages
WHERE
speedeffective > 5
)
and now works better. I think that I'm missing a little detail.
Typically, a 'NOT EXISTS' will slow down your query as it requires a full scan of the table for each of the outer rows. Try to incorporate the same functionality within a join (I'm trying to rewrite the query here, without knowing the table, so I might make a mistake here):
SELECT
*
FROM
messages m1
LEFT JOIN
messages m2
ON m1.vehicleid = m2.vehicleid AND m1.messagedate < m2.messagedate AND m1.messagedate <= m2.messagedate+interval '5 minutes'
WHERE
speedeffective > 0
and next_speedeffective = 0
and m2.vehicleid IS NULL
Take note that the NOT EXISTS is rewritten as the non-hit of the join condition.
Based on this answer: https://stackoverflow.com/a/36445233/5000827
and reading about NOT IN, NOT EXISTS and LEFT JOIN (where join is NULL)
For PostgreSQL, NOT EXISTS and LEFT JOIN are anti-join and works at the same way. (This is the reason why the #CountZukula answer's result is almost the same than mine)
The problem was on the kind of operation: Nest or Hash.
So, based on this: https://www.postgresql.org/docs/9.6/static/routine-vacuuming.html
PostgreSQL's VACUUM command has to process each table on a regular basis for several reasons:
To recover or reuse disk space occupied by updated or deleted rows.
To update data statistics used by the PostgreSQL query planner.
To update the visibility map, which speeds up index-only scans.
To protect against loss of very old data due to transaction ID wraparound or multixact ID wraparound.
I made a VACUUM ANALYZE to messages table and the same query works way fast.
So, with the VACUUM PostgreSQL can decide better.

busy table performance optimization

I have a postgresql table storing data from a table-like form.
id SERIAL,
item_id INTEGER ,
date BIGINT,
column_id INTEGER,
row_id INTEGER,
value TEXT,
some_flags INTEGER,
The issue is we have 5000+ entries per day and the information needs to be kept for years.
So I end up with a huge table witch is busy for the top 1000-5000 rows,
with lots of SELECT, UPDATE, DELETE queries but the old content is rarely used (only in statistics) and is almost never changed.
The question is how can I boost the performance for the daily work (top 5000 entries from 50 millions).
There are simple indexes on almost all columns .. but nothing fancy.
Splitting the table is not possible for now, I`m looking more for Index optimisation .
The advices in the comments from dezso and Jack are good. If you want the simplest then this is how you implement the partial index:
create table t ("date" bigint, archive boolean default false);
insert into t ("date")
select generate_series(
extract(epoch from current_timestamp - interval '5 year')::bigint,
extract(epoch from current_timestamp)::bigint,
5)
;
create index the_date_partial_index on t ("date")
where not archive
;
To avoid having to change all queries adding the index condition rename the table:
alter table t rename to t_table;
And create a view with the old name including the index condition :
create view t as
select *
from t_table
where not archive
;
explain
select *
from t
;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Index Scan using the_date_partial_index on t_table (cost=0.00..385514.41 rows=86559 width=9)
Then each day you archive older rows:
update t_table
set archive = true
where
"date" < extract(epoch from current_timestamp - interval '1 week')
and
not archive
;
The not archive condiditon is to avoid updating millions of already archived rows.