Indexing on Timescaledb - postgresql

I am testing some queries on Postgresql extension Timescaledb.
The table is called timestampdb and i run some queries on that seems like this
select id13 from timestampdb where timestamp1 >='2010-01-01 00:05:00' and timestamp1<='2011-01-01 00:05:00',
select avg(id13)::numeric(10,2) from timestasmpdb where timestamp1>='2015-01-01 00:05:00' and timestamp1<='2015-01-01 10:30:00'
When i create a hypertable i do this.
create hyper_table('timestampdb','timestamp1')
The thing is that now i want to create an index on id13.
should i try something like this?:
create hyper_table('timestampdb','timestamp1') ,import data of the table and then create index on timestampdb(id13)
or something like this:
create table timestampdb,then create hypertable('timestampdb',timestamp1') ,import the data and then CREATE INDEX ON timestampdb (timestamp1,id13)
What is the correct way to do this?

You can create an index without time dimension column, since you don't require it to be unique. Including time dimension column into an index is needed if an index contains UNIQUE or is PRIMARY KEY, since TimescaleDB partitions a hypertable into chunks on the time dimension column, which is timestamp1 in the question. If partitioning key will include space dimension columns in addition to time, they will need to be included too.
So in your case the following should be sufficient after the migration to hypertable:
create index on timestampdb(id13);
The question contains two queries and none of them need index on id13. It will be valuable to create the index on id13 if you expect different queries than in the question, which will contain condition or join on id13 column.

Related

Converting PostgreSQL table to TimescaleDB hypertable

I have a PostgreSQL table which I am trying to convert to a TimescaleDB hypertable.
The table looks as follows:
CREATE TABLE public.data
(
event_time timestamp with time zone NOT NULL,
pair_id integer NOT NULL,
entry_id bigint NOT NULL,
event_data int NOT NULL,
CONSTRAINT con1 UNIQUE (pair_id, entry_id ),
CONSTRAINT pair_id_fkey FOREIGN KEY (pair_id)
REFERENCES public.pairs (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
)
When I attempt to convert this table to a TimescaleDB hypertable using the following command:
SELECT create_hypertable(
'data',
'event_time',
chunk_time_interval => INTERVAL '1 hour',
migrate_data => TRUE
);
I get the Error: ERROR: cannot create a unique index without the column "event_time" (used in partitioning)
Question 1: From this post How to convert a simple postgresql table to hypertable or timescale db table using created_at for indexing my understanding is that this is because I have specified a unique constraint (pair_id_fkey) which does not contain the column I am partitioning by - event_time. Is that correct?
Question 2: How should I change my table or hypertable to be able to convert this? I have added some data on how I plan to use the data and the structure of the data bellow.
Data Properties and usage:
There can be multiple entries with the same event_time - those entries would have entry_id's which are in sequence
This means that if I have 2 entries (event_time 2021-05-18::10:16, id 105, <some_data>) and (event_time 2021-05-18::10:16, id 107, <some_data>) then the entry with id 106 would also have event_time 2021-05-18::10:16
The entry_id is not generated by me and I use the unique constraint con1 to ensure that I am not inserting duplicate data
I will query the data mainly on event_time e.g. to create plots and perform other analysis
At this point the database contains around 4.6 Billion rows but should contain many more soon
I would like to take advantage of TimescaleDB's speed and good compression
I don't care too much about insert performance
Solutions I have been considering:
Pack all the events which have the same timestamp in to an array somehow and keep them in one row. I think this would have downsides on compression and provide less flexibility on querying the data. Also I would probably end up having to unpack the data on each query.
Remove the unique constraint con1 - then how do I ensure that I don't add the same row twice?
Expand unique constraint con1 to include event_time - would that not somehow decrease performance while at the same time open up for the error where I accidentally insert 2 rows with entry_id and pair_id but different event_time? (I doubt this is a likely thing to happen though)
You understand correctly that UNIQUE (pair_id, entry_id ) doesn't allow to create hypertable from the table, since unique constraints need to include the partition key, i.e., event_time in your case.
I don't follow how the first option, where records with the same timestamp are packed into single record, will help with the uniqueness.
Removing the unique constraint will allow to create hypertable and as you mentioned you will lose possibility to check the constraint.
Adding the time column, e.g., UNIQUE (pair_id, entry_id, event_time) is quite common approach, but it allows to insert duplicates with different timestamps as you mentioned. It will perform worse than option 2 during inserts. You can replace index on event_time (which you need, since you query on this column, and it is created automatically by TimescaleDB) with unique index, so you save a little bit e.g.,
CREATE UNIQUE INDEX indx ON (event_time, pair_id, entry_id);
Manually create unique constraint on each chunk table. This will guarantee uniqueness within the chunk, but it will be still possible to have duplicates in different chunks. The main drawback is you will need to figure out how to create it when new chunk is created.
Unique constraints without partition keys are not supported in TimescaleDB, since it will require to access all existing chunks to check uniqueness and it will kill performance. (or it will require to create a global index, which can be large) I don't think it is common case for time series data to have unique constraints as it is usually related to artificially generated counter-based identifiers.

Postgres timescale hypertable: is separate index necessary?

I want to create a hypertable in postgres timescale.
What I do is CREATE TABLE then CREATE INDEX and finally SELECT CREATE_HYPERTABLE.
My question: is CREATE INDEX necessary, helpful or problematic for a high performance of the hypertable?
In short: no indexes are needed to be created as TimescaleDB will create an index on time dimension by default. Depending on your usage you might want to create indexes to speedup select queries and it is good to create them after creating the hypertable.
In more details:
Creating hypertable with create_hypertable function replaces the original PotgreSQL table with new table. Thus it is better to create hypertable and then create index. It also works to create index first, and then call create_hypertable. In such case the existing indexes will be recreated on the hypertable. It is important to remember that unique indexes and primary keys need to include time dimension column. And note that create_hypertable will create an index on the time dimension column by default.
In general, the considerations for creating indexes are the similar as with PostgreSQL: there are tradeoffs in using indexes. Indexes introduces overheads during data ingesting, while can improve select queries significantly. I suggest to check the best practice of using indexes in TimescaleDB and the blog about using composite indexes for time-series queries

What is the fastest way to add indices and an ID primary key column to a large table?

I have about 300 tables in my Postgres (PostgreSQL 9.6.5-1) database. The tables are large, each with about 6 million records. To insert the records, I created the tables without any indexes as I have found it is substantially faster to insert without any. I did not add an ID column (primary key, auto increment, unique), either.
I now need to add indices to each table, as well as a new ID column.
To do this, I use the following commands:
CREATE INDEX IF NOT EXISTS some_table_1_index ON some_table_1 (latitude, longitude, measurement_time, level, speed, altitude);
ALTER TABLE some_table_1 ADD COLUMN id SERIAL PRIMARY KEY;
I have found that it takes between 30 and 90 seconds per command...meaning that it would take 7h30 to do all my tables (assuming a worst-case scenario of 90s per command).
Is there a faster way to alter all my tables?
I am using Python and psycopg2, if that makes any difference.
First, your command doesn't create four indices. It creates two indices in which the first is a composite index (which may not be exactly what you want because column order matters whether or not the planner will choose to use the index).
Second, are you executing the CREATE commands serially? Could you run all 300 create commands in parallel?
Psuedo code since I don't know Python well:
tableList = ['table1', 'table2', 'table3', ...]
createSql = 'CREATE INDEX...[0]...'
[executeInThread(table) for table in tableList]

Understanding indexes and performance as they relate to indexed column and non-indexed column data in the same row

I have some tables that are around 100 columns wide. I haven't normalized them because to put it back together would require almost 3 dozen joins and am not sure it would perform any better... haven't tested it yet (I will) so can't say for sure.
Anyway, that really isn't the question. I have been indexing columns in these tables that I know will be pulled frequently, so something like 50 indexes per table.
I got to thinking though. These columns will never be pulled by themselves and are meaningless without the primary key (basically an item number). The PK will always be used for the join and even in simple SELECT queries, it will have to be a specified column so the data makes sense.
That got me thinking further about indexes and how they work. As I understand them the locations of a values are committed to memory for that column so it is quickly found in a query.
For example, if you have:
SELECT itemnumber, expdate
FROM items;
And both itemnumber and expdate are indexed, is that excessive and really adding any benefit? Is it sufficient to just index itemnumber and the index will know that expdate, or anything else that is queried for that item, is on the same row?
Secondly, if multiple columns constitute a primary key, should the index include them grouped together, or is individually sufficient?
For example,
CREATE INDEX test_index ON table (pk_col1, pk_col2, pk_col3);
vs.
CREATE INDEX test_index1 ON table (pk_col1);
CREATE INDEX test_index2 ON table (pk_col2);
CREATE INDEX test_index3 ON table (pk_col3);
Thanks for clearing that up in advance!
Uh oh, there is a mountain of basics that you still have to learn.
I'd recommend that you read the PostgreSQL documentation and the excellent book “SQL Performance Explained”.
I'll give you a few pointers to get you started:
Whenever you create a PRIMARY KEY or UNIQUE constraint, PostgreSQL automatically creates a unique index over all the columns of that constraint. So you don't have to create that index explicitly (but if it is a multicolumn index, it sometimes is useful to create another index on any but the first column).
Indexes are relevant to conditions in the WHERE clause and the GROUP BY clause and to some extent for table joins. They are irrelevant for entries in the SELECT list. An index provides an efficient way to get the part of a table that satisfies a certain condition; an (unsorted) access to all rows of a table will never benefit from an index.
Don't sprinkle your schema with indexes randomly, since indexes use space and make all data modification slow.
Use them where you know that they will do good: on columns on which a foreign key is defined, on columns that appear in WHERE clauses and contain many different values, on columns where your examination of the execution plan (with EXPLAIN) suggests that you can expect a performance benefit.

Remove Duplicate rows from a large table - PostgreSQL

I want to remove duplicates from a large table having about 1million rows and increasing every hour. It has no unique id and has about ~575 columns but sparsely filled.
The table is 'like' a log table where new entries are appended every hour without unique timestamp.
The duplicates are like 1-3% but I want to remove it anyway ;) Any ideas?
I tried ctid column (as here) but its very slow.
The basic idea that works generally well with PostgreSQL is to create an index on the hash of the set of columns as a whole.
Example:
CREATE INDEX index_name ON tablename (md5((tablename.*)::text));
This will work unless there are columns that don't play well with the requirement of immutability (mostly timestamp with time zone because their cast-to-text value is session-dependent).
Once this index is created, duplicates can be found quickly by self-joining with the hash, with a query looking like this:
SELECT t1.ctid, t2.ctid
FROM tablename t1 JOIN tablename t2
ON (md5((t1.*)::text) = md5((t2.*)::text))
WHERE t1.ctid > t2.ctid;
You may also use this index to avoid duplicates rows in the future rather than periodically de-duplicating them, by making it UNIQUE (duplicate rows would be rejected at INSERT or UPDATE time).