Postgresql - Optimizing creation of several indexes on one big table - postgresql

What's the best way to create multiple indexes on one very large table (hundreds of Gb).
Currently I have to execute queries like that
create index A on myBigTable (a) using btree ;
create index B on myBigTable (b) using btree ;
create index C on myBigTable (c) using btree ;
....
It takes a very long time. PG has to read all the data from the table several times.
When index on column A is in creation, it prevents from creating an index on column B at the same time.
So, is there a way to optimize the time (and resources consumption) to create all indexes on the same table?
If not possible to really optimize currently, is it something planned for a near future?

Related

Postgres index creation time on temp table

I have a statement trigger that executes a function after some rows are updated.
With the transition table of the trigger (joined with some other tables) I need to create a temporary table, which I will use to query later.
The thing is, the query on this temp table is pretty fast but when the number of rows increase it gets slower. So I was trying to add an index.
My question is how can I measure the impact of the index creation? If the number of rows is not high, the index is not being used, so I would need to know if the index creation is expensive.
The cost of creating a b-tree index is dominated by the cost of sorting for bigger tables: O(n * log(n))
For small tables, the cost is low. So I wouldn't worry about creating the index even for small tables where it is not used.
You shouldn't forget to ANALYZE the table, however, because autovacuum doesn't do that for temporary tables.

Indexing on Timescaledb

I am testing some queries on Postgresql extension Timescaledb.
The table is called timestampdb and i run some queries on that seems like this
select id13 from timestampdb where timestamp1 >='2010-01-01 00:05:00' and timestamp1<='2011-01-01 00:05:00',
select avg(id13)::numeric(10,2) from timestasmpdb where timestamp1>='2015-01-01 00:05:00' and timestamp1<='2015-01-01 10:30:00'
When i create a hypertable i do this.
create hyper_table('timestampdb','timestamp1')
The thing is that now i want to create an index on id13.
should i try something like this?:
create hyper_table('timestampdb','timestamp1') ,import data of the table and then create index on timestampdb(id13)
or something like this:
create table timestampdb,then create hypertable('timestampdb',timestamp1') ,import the data and then CREATE INDEX ON timestampdb (timestamp1,id13)
What is the correct way to do this?
You can create an index without time dimension column, since you don't require it to be unique. Including time dimension column into an index is needed if an index contains UNIQUE or is PRIMARY KEY, since TimescaleDB partitions a hypertable into chunks on the time dimension column, which is timestamp1 in the question. If partitioning key will include space dimension columns in addition to time, they will need to be included too.
So in your case the following should be sufficient after the migration to hypertable:
create index on timestampdb(id13);
The question contains two queries and none of them need index on id13. It will be valuable to create the index on id13 if you expect different queries than in the question, which will contain condition or join on id13 column.

Postgres timescale hypertable: is separate index necessary?

I want to create a hypertable in postgres timescale.
What I do is CREATE TABLE then CREATE INDEX and finally SELECT CREATE_HYPERTABLE.
My question: is CREATE INDEX necessary, helpful or problematic for a high performance of the hypertable?
In short: no indexes are needed to be created as TimescaleDB will create an index on time dimension by default. Depending on your usage you might want to create indexes to speedup select queries and it is good to create them after creating the hypertable.
In more details:
Creating hypertable with create_hypertable function replaces the original PotgreSQL table with new table. Thus it is better to create hypertable and then create index. It also works to create index first, and then call create_hypertable. In such case the existing indexes will be recreated on the hypertable. It is important to remember that unique indexes and primary keys need to include time dimension column. And note that create_hypertable will create an index on the time dimension column by default.
In general, the considerations for creating indexes are the similar as with PostgreSQL: there are tradeoffs in using indexes. Indexes introduces overheads during data ingesting, while can improve select queries significantly. I suggest to check the best practice of using indexes in TimescaleDB and the blog about using composite indexes for time-series queries

Postgres not very fast at finding unique values in table with about 1.3 billion rows

So I have a (logged) table with two columns A, B, containing text.
They basically contain the same type of information, it's just two columns because of where the data came from.
I wanted to have a table of all unique values (so I made the column be the primary key), not caring about the column. But when I asked postgres to do
insert into new_table(value) select A from old_table on conflict (value) do nothing; (and later on same thing for column B)
it used 1 cpu core, and only read from my SSD with about 5 MB/s. I stopped it after a couple of hours.
I suspected that it might be because of the b-tree being slow and so I added a hashindex on the only attribute in my new table. But it's still using 1 core to the max and reading from the ssd at only 5 MB/s per second. My java program can hashset that at at least 150 MB/s, so postgres should be way faster than 5 MB/s, right? I've analyzed my old table and I made my new table unlogged for faster inserts and yet it still uses 1 core and reads extremely slowly.
How to fix this?
EDIT: This is the explain to the above query. Seems like postgres is using the b-tree it created for the primary key instead of my (much faster, isn't it??) Hash index.
Insert on users (cost=0.00..28648717.24 rows=1340108416 width=14)
Conflict Resolution: NOTHING
Conflict Arbiter Indexes: users_pkey
-> Seq Scan on games (cost=0.00..28648717.24 rows=1340108416 width=14)
The ON CONFLICT mechanism is primarily for resolving concurrency-induced conflicts. You can use it in a "static" case like this, but other methods will be more efficient.
Just insert only distinct values in the first place:
insert into new_table(value)
select A from old_table union
select B from old_table
For increased performance, don't add the primary key until after the table is populated. And set work_mem to the largest value you credibly can.
My java program can hashset that at at least 150 MB/s,
That is working with the hashset entirely in memory. PostgreSQL indexes are disk-based structures. They do benefit from caching, but that only goes so far and depends on hardware and settings you haven't told us about.
Seems like postgres is using the b-tree it created for the primary key instead of my (much faster, isn't it??) Hash index.
It can only use the index which defines the constraint, which is the btree index, as hash indexes cannot support primary key constraints. You could define an EXCLUDE constraint using the hash index, but that would just make it slower yet. And in general, hash indexes are not "much faster" than btree indexes in PostgreSQL.

PostgreSQL insert into too many referenced table

I have a simple table with a primary key ID and other columns that are not so interesting. The data there is not so much, like several thousands records. However there too many constraints on this table. I mean it is referenced by other tables (more than 200) with foreign key to the ID column. And every time I tried to insert something into it, it checks all the constraints and takes around 2-3 secs for every insert to complete. There are btree indexes on all tables, so the query planner uses index scans as well as sequential ones.
My question is there any option or something that I can apply in order to speed up these inserts. I tried to disable the sequential scans and use index scans only, but this didn't help. The partitioning wouldn't be helpful as well, I think. So please advise.
The version of psql is 10.0.
Thank you!