postgres - estimate index size for timestamp column - postgresql

Have a postgres table, ENTRIES, with a 'made_at' column of type timestamp without time zone.
That table has a btree index on both that column and on another column (USER_ID, a foreign key):
btree (user_id, date_trunc('day'::text, made_at))
As you can see, the date is truncated at the 'day'. The total size of the index constructed this way is 130 MB -- there are 4,000,000 rows in the ENTRIES table.
QUESTION: How do I estimate the size of the index if I were to care for time to be up to the second? Basically, truncate timestamp at second rather than day (should be easy to do, I hope).

Interesting question! According to my investigation they will be the same size.
My intuition told me that there should be no difference between the size of your two indices, as timestamp types in PostgreSQL are of fixed size (8 bytes), and I supposed the truncate function simply zeroed out the appropriate number of least significant time bits, but I figured I had better support my guess with some facts.
I spun up a free dev database on heroku PostgreSQL and generated a table with 4M random timestamps, truncated to both day and second values as follows:
test_db=> SELECT * INTO ts_test FROM
(SELECT id,
ts,
date_trunc('day', ts) AS trunc_day,
date_trunc('second', ts) AS trunc_s
FROM (select generate_series(1, 4000000) AS id,
now() - '1 year'::interval * round(random() * 1000) AS ts) AS sub)
AS subq;
SELECT 4000000
test_db=> create index ix_day_trunc on ts_test (id, trunc_day);
CREATE INDEX
test_db=> create index ix_second_trunc on ts_test (id, trunc_s);
CREATE INDEX
test_db=> \d ts_test
Table "public.ts_test"
Column | Type | Modifiers
-----------+--------------------------+-----------
id | integer |
ts | timestamp with time zone |
trunc_day | timestamp with time zone |
trunc_s | timestamp with time zone |
Indexes:
"ix_day_trunc" btree (id, trunc_day)
"ix_second_trunc" btree (id, trunc_s)
test_db=> SELECT pg_size_pretty(pg_relation_size('ix_day_trunc'));
pg_size_pretty
----------------
120 MB
(1 row)
test_db=> SELECT pg_size_pretty(pg_relation_size('ix_second_trunc'));
pg_size_pretty
----------------
120 MB
(1 row)

Related

SQL partition by query optimization

I have below prices table and I want to obtain last_30_days price array and last_year_price from it. As shown below
CREATE TABLE prices
(
id integer NOT NULL,
"time" timestamp without time zone NOT NULL,
close double precision NOT NULL,
CONSTRAINT prices_pkey PRIMARY KEY (id, "time")
)
select id,time,first_value(close) over (partition by id order by time range between '1 year' preceding and CURRENT ROW) as prev_year_close,
array_agg(p.close) OVER (PARTITION BY id ORDER BY time ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) as prices_30
from prices p
However I want to place a where clause on the prices table to get last_30_days price array and last_year_price for some specific rows. Eg. where time >= '1 week' interval (so I run this query only for last 1 week values as opposed to the entire table)
But a where clause pre-filtering the rows and then window partition only runs on that conditioned rows which results in wrong results. it is giving result like
time, id, last_30_days
-1day, X, [A,B,C,D,E, F,G]
-2day, X, [A,B,C,D,E,F]
-3day, X, [A,B,C,D,E]
-4day, X, [A,B,C,D]
-5day, X, [A,B,C]
-6day, X, [A,B]
-7day, X, [A]
How do I fix this so that partition over window always takes 30 values irrespective of where condition? Without having to run the query always on the entire table and then selecting a subset of rows with where clause. My prices table is huge and running it on entire table is very expensive.
EDIT
CREATE TABLE prices
(
id integer NOT NULL,
"time" timestamp without time zone NOT NULL,
close double precision NOT NULL,
prev_30 double precision[],
prev_year double precision,
CONSTRAINT prices_pkey PRIMARY KEY (id, "time")
)
Use a subquery:
SELECT *
FROM (select id, time,
first_value(close) over (partition by id order by time range between '1 year' preceding and CURRENT ROW) as prev_year_close,
array_agg(p.close) OVER (PARTITION BY id ORDER BY time ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) as prices_30 I
from prices p
WHERE time >= current_timestamp - INTERVAL '1 year 1 week') AS q
WHERE time >= current_timestamp - INTERVAL '1 week' ;

date at time zone related syntax and semantic differences

Question: How is query 1 "semantically" different than the query 2?
Background:
To extract data from the table in a db which is at my localtime zone (AT TIME ZONE 'America/New_York').
The table has data for various time zones such as the 'America/Los_Angeles', America/North_Dakota/New_Salem and such time zones.
(Postgres stores the table data for various timezones in my local timezone)
So, everytime I retrieve data for a different location other than my localtime, I convert it to its relevant timezone for evaluation purposes..
Query 1:
test_db=# select count(id) from click_tb where date::date AT TIME ZONE 'America/Los_Angeles' = '2017-05-22'::date AT TIME ZONE 'America/Los_Angeles';
count
-------
1001
(1 row)
Query 2:
test_db=# select count(id) from click_tb where (date AT TIME ZONE 'America/Los_Angeles')::date = '2017-05-22'::date;
count
-------
5
(1 row)
Table structure:
test_db=# /d+ click_tb
Table "public.click_tb"
Column | Type | Modifiers | Storage | Stats target | Description
-----------------------------------+--------------------------+-------------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('click_tb_id_seq'::regclass) | plain | |
date | timestamp with time zone | | plain | |
Indexes:
"click_tb_id" UNIQUE CONSTRAINT, btree (id)
"click_tb_date_index" btree (date)
The query 1 and query 2 do not produce consistent results.
As per my tests, the below query 3, semantically addresses my requirement.
Your critical feedback is welcome.
Query 3:
test_db=# select count(id) from click_tb where ((date AT TIME ZONE 'America/Los_Angeles')::timestamp with time zone)::date = '2017-05-22'::date;
Do not convert the timestamp field. Instead, do a range query. Since your data is already using a timestamp with time zone type, just set the time zone of your query accordingly.
set TimeZone = 'America/Los_Angeles';
select count(id) from click_tb
where date >= '2017-01-02'
and date < '2017-01-03';
Note how this uses a half open interval of the dates (at start of day in the set time zone). If you want to compute the second date from your first date, then:
set TimeZone = 'America/Los_Angeles';
select count(id) from click_tb
where date >= '2017-01-02'
and date < (timestamp with time zone '2017-01-02' + interval '1 day');
This properly handles daylight saving time, and sargability.

Get Data From Postgres Table At every nth interval

Below is my table and i am inserting data from my windows .Net application at every 1 Second Interval. i want to write query to fetch data from the table at every nth interval for example at every 5 second.Below is the query i am using but not getting result as required. Please Help me
CREATE TABLE table_1
(
timestamp_col timestamp without time zone,
value_1 bigint,
value_2 bigint
)
This is my query which i am using
select timestamp_col,value_1,value_2
from (
select timestamp_col,value_1,value_2,
INTERVAL '5 Seconds' * (row_number() OVER(ORDER BY timestamp_col) - 1 )
+ timestamp_col as r
from table_1
) as dt
Where r = 1
Use date_part() function with modulo operator:
select timestamp_col, value_1, value_2
from table_1
where date_part('second', timestamp_col)::int % 5 = 0

Postgresql: using 'with clause' to iterate over a range of dates

I have a database table that contains a start visdate and an end visdate. If a date is within this range the asset is marked available. Assets belong to a user. My query takes in a date range (start and end date). I need to return data so that for a date range it will query the database and return a count of assets for each day in the date range that assets are available.
I know there are a few examples, I was wondering if it's possible to just execute this as a query/common table expression rather than using a function or a temporary table. I'm also finding it quite complicated because the assets table does not contain one date which an asset is available on. I'm querying a range of dates against a visibility window. What is the best way to do this? Should I just do a separate query for each day in the date range I'm given?
Asset Table
StartvisDate Timestamp
EndvisDate Timestamp
ID int
User Table
ID
User & Asset Join table
UserID
AssetID
Date | Number of Assets Available | User
11/11/14 5 UK
12/11/14 6 Greece
13/11/14 4 America
14/11/14 0 Italy
You need to use a set returning function to generate the needed rows. See this related question:
SQL/Postgres datetime division / normalizing
Example query to get you started:
with data as (
select id, start_date, end_date
from (values
(1, '2014-12-02 14:12:00+00'::timestamptz, '2014-12-03 06:45:00+00'::timestamptz),
(2, '2014-12-05 15:25:00+00'::timestamptz, '2014-12-05 07:29:00+00'::timestamptz)
) as rows (id, start_date, end_date)
)
select data.id,
count(data.id)
from data
join generate_series(
date_trunc('day', data.start_date),
date_trunc('day', data.end_date),
'1 day'
) as days (d)
on days.d >= date_trunc('day', data.start_date)
and days.d <= date_trunc('day', data.end_date)
group by data.id
id | count
----+-------
1 | 2
2 | 1
(2 rows)
You'll want to convert it to using ranges instead, and adapt it to your own schema and data, but it's basically the same kind of query as the one you want.

adding missing date in a table in PostgreSQL

I have a table that contains data for every day in 2002, but it has some missing dates. Namely, 354 records for 2002 (instead of 365). For my calculations, I need to have the missing data in the table with Null values
+-----+------------+------------+
| ID | rainfall | date |
+-----+------------+------------+
| 100 | 110.2 | 2002-05-06 |
| 101 | 56.6 | 2002-05-07 |
| 102 | 65.6 | 2002-05-09 |
| 103 | 75.9 | 2002-05-10 |
+-----+------------+------------+
you see that 2002-05-08 is missing. I want my final table to be like:
+-----+------------+------------+
| ID | rainfall | date |
+-----+------------+------------+
| 100 | 110.2 | 2002-05-06 |
| 101 | 56.6 | 2002-05-07 |
| 102 | | 2002-05-08 |
| 103 | 65.6 | 2002-05-09 |
| 104 | 75.9 | 2002-05-10 |
+-----+------------+------------+
Is there a way to do that in PostgreSQL?
It doesn't matter if I have the result just as a query result (not necessarily an updated table)
date is a reserved word in standard SQL and the name of a data type in PostgreSQL. PostgreSQL allows it as identifier, but that doesn't make it a good idea. I use thedate as column name instead.
Don't rely on the absence of gaps in a surrogate ID. That's almost always a bad idea. Treat such an ID as unique number without meaning, even if it seems to carry certain other attributes most of the time.
In this particular case, as #Clodoaldo commented, thedate seems to be a perfect primary key and the column id is just cruft - which I removed:
CREATE TEMP TABLE tbl (thedate date PRIMARY KEY, rainfall numeric);
INSERT INTO tbl(thedate, rainfall) VALUES
('2002-05-06', 110.2)
, ('2002-05-07', 56.6)
, ('2002-05-09', 65.6)
, ('2002-05-10', 75.9);
Query
Full table by query:
SELECT x.thedate, t.rainfall -- rainfall automatically NULL for missing rows
FROM (
SELECT generate_series(min(thedate), max(thedate), '1d')::date AS thedate
FROM tbl
) x
LEFT JOIN tbl t USING (thedate)
ORDER BY x.thedate
Similar to what #a_horse_with_no_name posted, but simplified and ignoring the pruned id.
Fills in gaps between first and last date found in the table. If there can be leading / lagging gaps, extend accordingly. You can use date_trunc() like #Clodoaldo demonstrated - but his query suffers from syntax errors and can be simpler.
INSERT missing rows
The fastest and most readable way to do it is a NOT EXISTS anti-semi-join.
INSERT INTO tbl (thedate, rainfall)
SELECT x.thedate, NULL
FROM (
SELECT generate_series(min(thedate), max(thedate), '1d')::date AS thedate
FROM tbl
) x
WHERE NOT EXISTS (SELECT 1 FROM tbl t WHERE t.thedate = x.thedate)
Just do an outer join against a query that returns all dates in 2002:
with all_dates as (
select date '2002-01-01' + i as date_col
from generate_series(0, extract(doy from date '2002-12-31')::int - 1) as i
)
select row_number() over (order by ad.date_col) as id,
t.rainfall,
ad.date_col as date
from all_dates ad
left join your_table t on ad.date_col = t.date
order by ad.date_col;
This will not change your table, it will just produce the result as desired.
Note that the generated id column will not contain the same values as the ID column in your table as it is merely a counter in the result set.
You could also replace the row_number() function with extract(doy from ad.date_col)
To fill the gaps. This will not reorder the IDs:
insert into t (rainfall, "date") values
select null, "date"
from (
select d::date as "date"
from (
t
right join
generate_series(
(select date_trunc('year', min("date")) from t)::timestamp,
(select max("date") from t),
'1 day'
) s(d) on t."date" = s.d::date
where t."date" is null
) q
) s
You have to fully re-create your table as indexes haves to change.
The better way to do it is to use your prefered dbi language, make a loop ignoring ID and putting values in a new table with new serialized IDs.
for day in (whole needed calendar)
value = select rainfall from oldbrokentable where date = day
insert into newcleanedtable date=day, rainfall=value, id=serialized
(That's not real code! Just conceptual to be adapted to your prefered scripting language)