I have a decently large postgres table with a few billion rows.
However the table could be partitioned by one column (type)
Should we prefer:
An index with two columns
create nonclustered index ix_index1 on table1(type, string_urn_id)
or a conditional index
create nonclustered index ix_index1_alternative on table1(string_urn_id) WHERE type = 'type1'
create nonclustered index ix_index1_alternative2 on table1(string_urn_id) WHERE type = 'type2'
create nonclustered index ix_index1_alternative3 on table1(string_urn_id) WHERE type = 'type3'
....
There is no statement create nonclustered index in PostgreSQL.
What is better depends on the definition of "better". From a maintenance perspective, the single index is better, because you won't have to create a new index whenever you add a new type.
From a performance perspective, only a benchmark with realistic data can tell. Planning time will increase with many indexes, but query performance may be a tad better.
If you partition the table, query performance will decrease, but you can do with a single partitioned index on string_urn_id.
Related
I have a question about a fundamental aspect of PostgreSQL.
Suppose I have two tables along the lines of the following:
create table source_data_property (
source_data_property_id integer primary key generated by default as identity,
property_name text not null
);
create table source_data_value (
source_data_value_id integer primary key generated by default as identity,
source_data_property_id integer not null references source_data_property,
data_value numeric not null
);
Suppose I write a very simple query that just performs a basic join:
select
sdp.source_data_property_id,
sdp.property_name,
sdv.source_data_value_id,
sdv.data_value
from source_data_property as sdp
join source_data_value as sdv using (source_data_property_id)
;
For optimal query performance, is it necessary to add an index on the source_data_property_id column in the source_data_value table? My original thought was no, because the source_data_property_id is already indexed in the source_data_property table, but after thinking about it a bit I'm not so sure.
For optimal query performance, is it necessary to add an index on the source_data_property_id column in the source_data_value table?
In general yes, make indexes for your foreign keys. However...
A very small table won't get any advantage from indexes and Postgres will do a seq scan instead.
Similarly it depends on what sort of queries you're doing. In your example you're fetching every row in source_data_property which will also fetch every row in source_data_value. Using an index is slower and Postgres will do a seq scan instead.
I want to create a hypertable in postgres timescale.
What I do is CREATE TABLE then CREATE INDEX and finally SELECT CREATE_HYPERTABLE.
My question: is CREATE INDEX necessary, helpful or problematic for a high performance of the hypertable?
In short: no indexes are needed to be created as TimescaleDB will create an index on time dimension by default. Depending on your usage you might want to create indexes to speedup select queries and it is good to create them after creating the hypertable.
In more details:
Creating hypertable with create_hypertable function replaces the original PotgreSQL table with new table. Thus it is better to create hypertable and then create index. It also works to create index first, and then call create_hypertable. In such case the existing indexes will be recreated on the hypertable. It is important to remember that unique indexes and primary keys need to include time dimension column. And note that create_hypertable will create an index on the time dimension column by default.
In general, the considerations for creating indexes are the similar as with PostgreSQL: there are tradeoffs in using indexes. Indexes introduces overheads during data ingesting, while can improve select queries significantly. I suggest to check the best practice of using indexes in TimescaleDB and the blog about using composite indexes for time-series queries
I have two tables:
CREATE TABLE soils (
sample_id TEXT PRIMARY KEY,
project_id TEXT,
technician_id TEXT
);
CREATE INDEX soils_idx
ON soils
USING btree
(sample_id COLLATE pg_catalog."default");
CREATE TABLE assays (
sample_id TEXT PRIMARY KEY,
mo_ppm NUMERIC
);
CREATE INDEX assays_idx
ON assays
USING btree
(sample_id COLLATE pg_catalog."default");
Each table contains about a half million records, and, in reality, about 20 additional columns each, of type TEXT (omitted in the DDL posted above to save time here).
When I perform the query:
EXPLAIN SELECT
s.sample_id, s.project_id, s.technician_id, a.mo_ppm
FROM
soils AS s INNER JOIN assays AS a ON s.sample_id = a.sample_id
I get 2 SEQ SCANs, rather than a lookup to the index. Is that expected behaviour?
Since you have no WHERE conditions, you effectively read the whole table. It's cheaper to run sequential scans and not involve any indexes at all.
Try:
EXPLAIN
SELECT s.sample_id, s.project_id, s.technician_id, a.mo_ppm
FROM soils s
JOIN assays a USING (sample_id)
WHERE <some condition that returns few rows>;
... and an index matching the WHERE condition should be used.
You don't need to define an index on a PRIMARY KEY column. A PK constraint is implemented with a unique index automatically. Your additional index is redundant and of no use.
An index on a foreign key column would be a good idea, but there isn't one in your example, which looks odd. Like the two tables could be combined into one. Probably just over-simplification for the test case.
Finally, for big tables, I would consider using a simple integer primary key instead of text, possibly a serial column. That's typically faster.
Yes, that's expected behaviour. On the other hand it depends on your random_page_cost, seq_page_cost and effective_cache_size settings. Your query doesn't have WHERE clause hence it might be faster to read everything sequentially. You can try to penalise sequential scan:
set enable_seqscan = off;
explain analyse <your query>;
and then compare plan/cost/IO wait (it is not possible to disable seq-scan but it gets very high cost -- ~1e7 (or 1e8)).
If you have SSD and WHERE clause in your query then you can lower random_page_cost to 1.5..2.5 and encourage PG to use index.
I have got a table with millions of rows in postgresql. One row can be represent by eight int4 or sixteen int2 columns.
I want to have one multicolumn (btree) index on this table: create index on mytable(c1,c2,c3,c4,....c8);
I wonder, what is better solution (for performance purpose): one multicolumn index with eight (int4 type) columns or one multicolumn index with sixteen (int2 type) columns.
In other words:
create index on mytable (c_int4_1, c_int4_2, ... c_int4_8);
vs.
create index on mytable (c_int2_1,c_int2_2...c_int2_16);
Whichever most naturally matches the use of the data. Any gains from the more efficient on the btree will be lost again when forcing it into another format.
I have a doubt that if my table do n't have any constraint like Primary Key,Foreign key,Unique key etc. then can i create the clustered index on table and clustered index can have the douplicate records ?
My 2nd question is where should we exectly use the non clustered index and when it is useful and benificial to create in table?
My 3rd question is How can we create the 249 non clustered index in a table .Is it the meaning, Creating the non clustered index on 249 columns ?
Can you anyone help me to remove my confusion in this.
First, the definition of a clustered index is that it is physical ordering of data on the disk. Every time you do an insert into that table, the new record will be placed on the physical disk in its order based on its value in the clustered index column. Because it is the physical location on the disk, it is (A) the most rapidly accessible column in the table but (B) only possible to define a single clustered index per table. Which column (or columns) you use as the clustered index depend on the data itself and its use. Primary keys are typically the clustered index, especially if the primary key is sequential (e.g. an integer that increments automatically with each insert). This will provide the fastest insert/update/delete functionality. If you are more interested in performing reads (select * from table), you may want to cluster on a Date column, as most queries have either a date in the where clause, the group by clause or both.
Second, clustered indexes (at least in the DB's I know) need not be unique (they CAN have duplicates). Constraining the column to be unique is separate matter. If the clustered index is a primary key its uniqueness is a function of being a primary key.
Third, I can't follow you questions concerning 249 columns. A non-clustered index is basically a tool for accelerating queries at the expense of extra disk space. It's hard to think of a case where creating an index on each column is necessary. If you want a quick rule of thumb...
Write a query using your table.
If a column is required to do a join, index it.
If a column is used in a where column, index it.
Remember all the indexes are doing for you is speeding up your queries. If queries run fast, don't worry about them.
This is just a thumbnail sketch of a large topic. There are tons of more informative/comprehensive resources on this matter, and some depend on the database system ... just google it.