How do you add a compound / composite index on a PostgreSQL table with TimescaleDB installed? - postgresql

How do you add a compound / composite index on a PostgreSQL table with TimescaleDB installed?

Following https://docs.timescale.com/latest/using-timescaledb/schema-management, you can add a compound / composite index to TimescaleDB by simply doing:
CREATE INDEX ON conditions (time DESC, cityid)
WHERE cityid IS NOT NULL;
time is a column with timestamps (The one used as primary key in TimescaleDB).
cityid is a column for a city identifier we might to often query for (As second parameter after the time series dates).
This can be done before or after converting the table to a hypertable.
To avoid bloating the index when the column cityid is often NULL, the statement WHERE cityid IS NOT NULL is for. Use this per default unless you are often searching for missing data (cityid IS NULL).

Related

PostgreSQL Unique constraint and compound index

I have a table with a unique constraint on two fields, I also use this as an index for faster performance. I want to query a third field as part of this index but I don't want the third field to be part of the unique constraint. i.e. I don't want a new composite index just for the third field as it's quite large.
Is there a way to do this in Postgres? I presently create the unique constraint and get the index created for free, can I specify the three-field composite index and tell the unique constraint to use this index, and Postgres will figure out it can use this index as a UC?
You can use the INCLUDE option:
create unique index on the_table (column_1, column_2)
include (column_3);

Index required for basic joins on foreign key that references a primary key

I have a question about a fundamental aspect of PostgreSQL.
Suppose I have two tables along the lines of the following:
create table source_data_property (
source_data_property_id integer primary key generated by default as identity,
property_name text not null
);
create table source_data_value (
source_data_value_id integer primary key generated by default as identity,
source_data_property_id integer not null references source_data_property,
data_value numeric not null
);
Suppose I write a very simple query that just performs a basic join:
select
sdp.source_data_property_id,
sdp.property_name,
sdv.source_data_value_id,
sdv.data_value
from source_data_property as sdp
join source_data_value as sdv using (source_data_property_id)
;
For optimal query performance, is it necessary to add an index on the source_data_property_id column in the source_data_value table? My original thought was no, because the source_data_property_id is already indexed in the source_data_property table, but after thinking about it a bit I'm not so sure.
For optimal query performance, is it necessary to add an index on the source_data_property_id column in the source_data_value table?
In general yes, make indexes for your foreign keys. However...
A very small table won't get any advantage from indexes and Postgres will do a seq scan instead.
Similarly it depends on what sort of queries you're doing. In your example you're fetching every row in source_data_property which will also fetch every row in source_data_value. Using an index is slower and Postgres will do a seq scan instead.

Cannot create primary key using already created index

I have a table ideas with columns idea_id, element_id and element_value.
Initially, I had created a composite primary key(ideas_pkey) using all three columns but I started facing size limit issues with the index associated with the primary key as the element_value column had a huge value.
Hence, I created another unique index hashing the column with possible large values
CREATE UNIQUE INDEX ideas_pindex ON public.ideas USING btree (idea_id, element_id, md5(element_value))
Now I deleted the initial primary key ideas_pkey and wanted to recreate it using this newly created index like so
alter table ideas add constraint ideas_pkey PRIMARY KEY ("idea_id", "element_id", "element_value") USING INDEX ideas_pindex;
But this fails with the following error
ERROR: syntax error at or near "ideas_pindex"
LINE 2: ...a_id", "element_id", "element_value") USING INDEX ideas_...
^
SQL state: 42601
Character: 209
What am I doing wrong?
A primary key index can't be a functional index. You can instead just have a unique index on your table, or create another column storing the md5() of your larger column and use it in the PK.
That being said, there is also another error in your query: If you want to specify an index name, you can't specify the PK columns (they are derived from the underlying index). And if you want to specify the pk columns, you can't specify the index name/definition, as it will be automatically created. See the doc

Converting PostgreSQL table to TimescaleDB hypertable

I have a PostgreSQL table which I am trying to convert to a TimescaleDB hypertable.
The table looks as follows:
CREATE TABLE public.data
(
event_time timestamp with time zone NOT NULL,
pair_id integer NOT NULL,
entry_id bigint NOT NULL,
event_data int NOT NULL,
CONSTRAINT con1 UNIQUE (pair_id, entry_id ),
CONSTRAINT pair_id_fkey FOREIGN KEY (pair_id)
REFERENCES public.pairs (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
)
When I attempt to convert this table to a TimescaleDB hypertable using the following command:
SELECT create_hypertable(
'data',
'event_time',
chunk_time_interval => INTERVAL '1 hour',
migrate_data => TRUE
);
I get the Error: ERROR: cannot create a unique index without the column "event_time" (used in partitioning)
Question 1: From this post How to convert a simple postgresql table to hypertable or timescale db table using created_at for indexing my understanding is that this is because I have specified a unique constraint (pair_id_fkey) which does not contain the column I am partitioning by - event_time. Is that correct?
Question 2: How should I change my table or hypertable to be able to convert this? I have added some data on how I plan to use the data and the structure of the data bellow.
Data Properties and usage:
There can be multiple entries with the same event_time - those entries would have entry_id's which are in sequence
This means that if I have 2 entries (event_time 2021-05-18::10:16, id 105, <some_data>) and (event_time 2021-05-18::10:16, id 107, <some_data>) then the entry with id 106 would also have event_time 2021-05-18::10:16
The entry_id is not generated by me and I use the unique constraint con1 to ensure that I am not inserting duplicate data
I will query the data mainly on event_time e.g. to create plots and perform other analysis
At this point the database contains around 4.6 Billion rows but should contain many more soon
I would like to take advantage of TimescaleDB's speed and good compression
I don't care too much about insert performance
Solutions I have been considering:
Pack all the events which have the same timestamp in to an array somehow and keep them in one row. I think this would have downsides on compression and provide less flexibility on querying the data. Also I would probably end up having to unpack the data on each query.
Remove the unique constraint con1 - then how do I ensure that I don't add the same row twice?
Expand unique constraint con1 to include event_time - would that not somehow decrease performance while at the same time open up for the error where I accidentally insert 2 rows with entry_id and pair_id but different event_time? (I doubt this is a likely thing to happen though)
You understand correctly that UNIQUE (pair_id, entry_id ) doesn't allow to create hypertable from the table, since unique constraints need to include the partition key, i.e., event_time in your case.
I don't follow how the first option, where records with the same timestamp are packed into single record, will help with the uniqueness.
Removing the unique constraint will allow to create hypertable and as you mentioned you will lose possibility to check the constraint.
Adding the time column, e.g., UNIQUE (pair_id, entry_id, event_time) is quite common approach, but it allows to insert duplicates with different timestamps as you mentioned. It will perform worse than option 2 during inserts. You can replace index on event_time (which you need, since you query on this column, and it is created automatically by TimescaleDB) with unique index, so you save a little bit e.g.,
CREATE UNIQUE INDEX indx ON (event_time, pair_id, entry_id);
Manually create unique constraint on each chunk table. This will guarantee uniqueness within the chunk, but it will be still possible to have duplicates in different chunks. The main drawback is you will need to figure out how to create it when new chunk is created.
Unique constraints without partition keys are not supported in TimescaleDB, since it will require to access all existing chunks to check uniqueness and it will kill performance. (or it will require to create a global index, which can be large) I don't think it is common case for time series data to have unique constraints as it is usually related to artificially generated counter-based identifiers.

Index to query sorted values in keyed time range

Suppose I have key/value/timerange tuples, e.g.:
CREATE TABLE historical_values(
key TEXT,
value NUMERIC,
from_time TIMESTAMPTZ,
to_time TIMESTAMPTZ
)
and would like to be able to efficiently query values (sorted descending) for a specific key and time, e.g.:
SELECT value
FROM historical_values
WHERE
key = [KEY]
AND from_time <= [TIME]
AND to_time >= [TIME]
ORDER BY value DESC
What kind of index/types should I use to get the best lookup performance? I suspect my solution will involve a tstzrange and a gist index, but I'm
not sure how to make that play well with the key matching and value ordering requirements.
Edit: Here's some more information about usage.
Ideally uses features available in Postgres v9.6.
Relation will contain approx. 1k keys and 5m values per key. Values are large integers (up to 32 bytes), mostly unique. Time ranges between few hours to a couple years. Time horizon is 5 years. No NULL values allowed, but some time ranges are open-ended (could either use NULL or a time far into the future for to_time).
The primary key is the key and time range (as there is only one historical value for a time range, per key).
Common operations are a) updating to_time to "close" a historical value, and b) inserting a new value with from_time = NOW.
All values may be queried. Partitioning is an option.
DB design
For a big table like that ("1k keys and 5m values per key") I would suggest to optimize storage like:
CREATE TABLE hist_keys (
key_id serial PRIMARY KEY
, key text NOT NULL UNIQUE
);
CREATE TABLE hist_values (
hist_value_id bigserial PRIMARY KEY -- optional, see below!
, key_id int NOT NULL REFERENCES hist_keys
, value numeric
, from_time timestamptz NOT NULL
, to_time timestamptz NOT NULL
, CONSTRAINT range_valid CHECK (from_time <= to_time) -- or < ?
);
Also helps index performance.
And consider partitioning. List-partitioning on key_id. Maybe even add sub-partitioning on (range partitioning this time) on from_time. Read the manual here.
With one partition per key_id, (and constraint exclusion enabled!) Postgres would only look at the small partition (and index) for the given key, instead of the whole big table. Major win.
But I would strongly suggest to upgrade to at least Postgres 10 first, which added "declarative partitioning". Makes managing partition a lot easier.
Better yet, skip forward to Postgres 11 (currently beta), which adds major improvements for partitioning (incl. performance improvements). Most notably, for your goal to get the best lookup performance, quoting the chapter on partitioning in release notes for Postgres 11 (currently beta):
Allow faster partition elimination during query processing (Amit Langote, David Rowley, Dilip Kumar)
This speeds access to partitioned tables with many partitions.
Allow partition elimination during query execution (David Rowley, Beena Emerson)
Previously partition elimination could only happen at planning time,
meaning many joins and prepared queries could not use partition elimination.
Index
From the perspective of the value column, the small subset of selected rows is arbitrary for every new query. I don't expect you'll find a useful way to support ORDER BY value DESC with an index. I'd concentrate on the other columns. Maybe add value as last column to each index if you can get index-only scans out of it (possible for btree and GiST).
Without partitioning:
CREATE UNIQUE INDEX hist_btree_idx ON hist_values (key_id, from_time, to_time DESC);
UNIQUE is optional, but see below.
Note the importance of opposing sort orders for from_time and to_time. See (closely related!):
Optimizing queries on a range of timestamps (two columns)
This is almost the same index as the one implementing your PK on (key_id, from_time, to_time). Unfortunately, we cannot use it as PK index. Quoting the manual:
Also, it must be a b-tree index with default sort ordering.
So I added a bigserial as surrogate primary key in my suggested table design above and NOT NULL constraints plus the UNIQUE index to enforce your uniqueness rule.
In Postgres 10 or later consider an IDENTITY column instead:
Auto increment table column
You might even do with PK constraint in this exceptional case to avoid duplicating the index and keep the table at minimum size. Depends on the complete situation. You may need it for FK constraints or similar. See:
How does PostgreSQL enforce the UNIQUE constraint / what type of index does it use?
A GiST index like you already suspected may be even faster. I suggest to keep your original timestamptz columns in the table (16 bytes instead of 32 bytes for a tstzrange) and add key_id after installing the additional module btree_gist:
CREATE INDEX hist_gist_idx ON hist_values
USING GiST (key_id, tstzrange(from_time, to_time, '[]'));
The expression tstzrange(from_time, to_time, '[]') constructs a range including upper and lower bound. Read the manual here.
Your query needs to match the index:
SELECT value
FROM hist_values
WHERE key = [KEY]
AND tstzrange(from_time, to_time, '[]') #> tstzrange([TIME_FROM], [TIME_TO], '[]')
ORDER BY value DESC;
It's equivalent to your original.
#> being the range contains operator.
With list-partitioning on key_id
With a separate table for each key_id, we can omit key_id from the index, improving size and performance - especially for the GiST index - for which we then also don't need the additional module btree_gist. Results in ~ 1000 partitions and the corresponding indexes:
CREATE INDEX hist999_gist_idx ON hist_values USING GiST (tstzrange(from_time, to_time, '[]'));
Related:
Store the day of the week and time?