How do I measure if an index is bloated in postgresql 13? I'm specifically wondering about the primary key index of a table.
Related
Does Postgres offer any way to optimize storage for a single-column table with a primary key on said column?
It seems all of the data is duplicated between the table and the primary key.
I have a decently large postgres table with a few billion rows.
However the table could be partitioned by one column (type)
Should we prefer:
An index with two columns
create nonclustered index ix_index1 on table1(type, string_urn_id)
or a conditional index
create nonclustered index ix_index1_alternative on table1(string_urn_id) WHERE type = 'type1'
create nonclustered index ix_index1_alternative2 on table1(string_urn_id) WHERE type = 'type2'
create nonclustered index ix_index1_alternative3 on table1(string_urn_id) WHERE type = 'type3'
....
There is no statement create nonclustered index in PostgreSQL.
What is better depends on the definition of "better". From a maintenance perspective, the single index is better, because you won't have to create a new index whenever you add a new type.
From a performance perspective, only a benchmark with realistic data can tell. Planning time will increase with many indexes, but query performance may be a tad better.
If you partition the table, query performance will decrease, but you can do with a single partitioned index on string_urn_id.
So I have a (logged) table with two columns A, B, containing text.
They basically contain the same type of information, it's just two columns because of where the data came from.
I wanted to have a table of all unique values (so I made the column be the primary key), not caring about the column. But when I asked postgres to do
insert into new_table(value) select A from old_table on conflict (value) do nothing; (and later on same thing for column B)
it used 1 cpu core, and only read from my SSD with about 5 MB/s. I stopped it after a couple of hours.
I suspected that it might be because of the b-tree being slow and so I added a hashindex on the only attribute in my new table. But it's still using 1 core to the max and reading from the ssd at only 5 MB/s per second. My java program can hashset that at at least 150 MB/s, so postgres should be way faster than 5 MB/s, right? I've analyzed my old table and I made my new table unlogged for faster inserts and yet it still uses 1 core and reads extremely slowly.
How to fix this?
EDIT: This is the explain to the above query. Seems like postgres is using the b-tree it created for the primary key instead of my (much faster, isn't it??) Hash index.
Insert on users (cost=0.00..28648717.24 rows=1340108416 width=14)
Conflict Resolution: NOTHING
Conflict Arbiter Indexes: users_pkey
-> Seq Scan on games (cost=0.00..28648717.24 rows=1340108416 width=14)
The ON CONFLICT mechanism is primarily for resolving concurrency-induced conflicts. You can use it in a "static" case like this, but other methods will be more efficient.
Just insert only distinct values in the first place:
insert into new_table(value)
select A from old_table union
select B from old_table
For increased performance, don't add the primary key until after the table is populated. And set work_mem to the largest value you credibly can.
My java program can hashset that at at least 150 MB/s,
That is working with the hashset entirely in memory. PostgreSQL indexes are disk-based structures. They do benefit from caching, but that only goes so far and depends on hardware and settings you haven't told us about.
Seems like postgres is using the b-tree it created for the primary key instead of my (much faster, isn't it??) Hash index.
It can only use the index which defines the constraint, which is the btree index, as hash indexes cannot support primary key constraints. You could define an EXCLUDE constraint using the hash index, but that would just make it slower yet. And in general, hash indexes are not "much faster" than btree indexes in PostgreSQL.
I was reading up on the BRIN index within PostgreSQL, and it seems to be beneficial to many of the tables we use.
That said, it applies nicely to a column which is already the primary key, in which case adding a separate index would negate part of the benefit of the index, which is space savings.
A PK is implicitly indexed, is it not? On that note, can it be done using a BRIN instead of a Btree, assuming the Btree is also implicit?
I tried this, and as expected it did not work:
create table foo (
id integer,
constraint foo_pk primary key using BRIN (id)
)
So, two questions:
Can a BRIN index be used on a PK?
If not, will the planner pick the more appropriate of the two if I have both a PK and a separate BRIN index (if performance means more to me than space)
And it's course possible that my understanding of this is incomplete, in which case I would appreciate any enlightenment.
Primary keys are a logical combination of NOT NULL and UNIQUE, therefore only an index type that supports uniqueness can be used.
From the PostgreSQL documentation (currently version 13):
Only B-tree currently supports unique indexes.
I'm not so sure BRIN would be faster than B-tree. It's a lot more space-efficient, but the fact that it's lossy and requires a secondary verification pass erodes any potential speed advantages. Once you are locked into having your B-tree primary key index, there's not much point to making a secondary overlapping BRIN index.
I have a doubt that if my table do n't have any constraint like Primary Key,Foreign key,Unique key etc. then can i create the clustered index on table and clustered index can have the douplicate records ?
My 2nd question is where should we exectly use the non clustered index and when it is useful and benificial to create in table?
My 3rd question is How can we create the 249 non clustered index in a table .Is it the meaning, Creating the non clustered index on 249 columns ?
Can you anyone help me to remove my confusion in this.
First, the definition of a clustered index is that it is physical ordering of data on the disk. Every time you do an insert into that table, the new record will be placed on the physical disk in its order based on its value in the clustered index column. Because it is the physical location on the disk, it is (A) the most rapidly accessible column in the table but (B) only possible to define a single clustered index per table. Which column (or columns) you use as the clustered index depend on the data itself and its use. Primary keys are typically the clustered index, especially if the primary key is sequential (e.g. an integer that increments automatically with each insert). This will provide the fastest insert/update/delete functionality. If you are more interested in performing reads (select * from table), you may want to cluster on a Date column, as most queries have either a date in the where clause, the group by clause or both.
Second, clustered indexes (at least in the DB's I know) need not be unique (they CAN have duplicates). Constraining the column to be unique is separate matter. If the clustered index is a primary key its uniqueness is a function of being a primary key.
Third, I can't follow you questions concerning 249 columns. A non-clustered index is basically a tool for accelerating queries at the expense of extra disk space. It's hard to think of a case where creating an index on each column is necessary. If you want a quick rule of thumb...
Write a query using your table.
If a column is required to do a join, index it.
If a column is used in a where column, index it.
Remember all the indexes are doing for you is speeding up your queries. If queries run fast, don't worry about them.
This is just a thumbnail sketch of a large topic. There are tons of more informative/comprehensive resources on this matter, and some depend on the database system ... just google it.