Why order by multiple column caused table scan - postgresql

I have a question regarding to postgresql order by.
In my case, I have one table with two column(id, first_name), pk is id, also i set an index on clolumn first_name.
CREATE TABLE STUDENT (
ID UUID NOT NULL,
FIRST_NAME VARCHAR(255) NULL,
CONSTRAINT Student_PK PRIMARY KEY (ID)
);
CREATE INDEX INDEX_NAME ON STUDENT (FIRST_NAME);
when i execute query below: it will trigger an index scan
explain SELECT id, first_name
FROM public.student
order by first_name asc
limit 1
offset 0
//Index Scan using index_name on student (cost=0.14..50.25 rows=140 width=532)
also, when i order by id, it still trigger an index scan:
explain SELECT id, first_name
FROM public.student
order by id asc
limit 1
offset 0
//Index Scan using student_pk on student (cost=0.14..50.25 rows=140 width=532)
My question is, when i use order by id, first_name, why it trigger a seq scan?
explain SELECT id, first_name
FROM public.student
order by id asc, first_name asc
limit 1
offset 0
//Seq Scan on student (cost=0.00..11.40 rows=140 width=532)
I've looked at a lot of PostgreSQL documentation, but I can't find any explains about this. Can someone give me some explain about this phenomenon?
Thank you~

This is because you are not using the latest version of PostgreSQL, version 13.
If it uses the index on (id) alone, then it will need to "read ahead" in the index to find any ties, and if it finds any then it has to go break those ties by looking at first_name and resorting to re-sorting. Doing this was implemented in v13 (called "Incremental Sort"), but not before that.
Of course there can't be any ties on id in the first place as it is a primary key, but the planner does not make use of this knowledge.

You explicitly created a secondary index on FIRST_NAME, and also Postgres already created the table with id having an index, as this field is the primary key of the table (the clustered index).
When you use:
order by id, first_name
You are requesting a two-tier sort of your data. Neither the id nor FIRST_NAME index alone can satisfy this sort. Actually, using either index would require doing seeks back to the main table to find the values for the alternate column. As a result, Postgres is choosing to not use any index at all, and just scan the table. Note that your current result might change depending on the size of the table and type of the data.

Related

Replacing two columns (first name, last name) with an auto-increment id

I have a time-series location data table containing the following columns (time, first_name, last_name, loc_lat, loc_long) with the first three columns as the primary key. The table has more than 1M rows.
I notice that first_name and last_name duplicate quite often. There are only 100 combinations in 1M rows. Therefore, to save disk space, I am thinking about creating a separate people table with columns (id, first_name, last_name) where (first_name, last_name) is a unique constraint, in order to simplify the time-series location table to be (time, person_id, loc_lat, loc_long) where person_id is a foreign key for the people table.
I want to first create a new table from my existing 1M row table to test if there is indeed meaningful disk space save with this change. I feel like this task is quite doable but cannot find a concrete way to do so yet. Any suggestions?
That's a basic step of database normalization.
If you can afford to do so, it will be faster to write a new table exchanging full names for IDs, than altering the schema of the existing table and update all rows. Basically:
BEGIN; -- wrap in single transaction (optional, but safer)
CREATE TABLE people (
people_id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, first_name text NOT NULL
, last_name text NOT NULL
, CONSTRAINT full_name_uni UNIQUE (first_name, last_name)
);
INSERT INTO people (first_name, last_name)
SELECT DISTINCT first_name, last_name
FROM tbl
ORDER BY 1, 2; -- optional
ALTER TABLE tbl RENAME TO tbl_old; -- free up org. table name
CREATE TABLE tbl AS
SELECT t.time, p.people_id, t.loc_lat, t.loc_long
FROM tbl_old t
JOIN people p USING (first_name, last_name);
-- ORDER BY ??
ALTER TABLE tbl ADD CONSTRAINT people_id_fk FOREIGN KEY (people_id) REFERENCES people(people_id);
-- make sure the new table is complete. indexes? constraints?
-- Finally:
DROP TABLE tbl_old;
COMMIT;
Related:
Best way to populate a new column in a large table?
Add new column without table lock?
Updating database rows without locking the table in PostgreSQL 9.2
DISTINCT is simple. But for only 100 distinct full names - and with the right index support! - there are more sophisticated, (much) faster ways. See:
Optimize GROUP BY query to retrieve latest row per user

Use index to speed up query using values from different tables

I have a table products, a table orders and a table orderProducts.
Products have a name as a PK (apple, banana, mango) and a price .
orders have a created_at date and an id as a PK.
orderProducts connects orders and products, so they have a product_name and an order_id. Now I would like to show all orders for a given product that happened in the last 24 hours.
I use the following query:
SELECT
orders.id,
orders.created_at,
products.name,
products.price
FROM
orderProducts
JOIN products ON
products.name=orderProducts.product
JOIN orders ON
orders.id=orderProducts.order
WHERE
products.name='banana'
AND
orders.created_at BETWEEN NOW() - INTERVAL '24 HOURS' AND NOW()
ORDER BY
orders.created_at
This works, but I would like to optimize this query with an index. This index would need to first be ordered by
the product name, so it can be filtered
then the created_at of the order in descending order, so it can select only the ones from 24 hours ago
The problem is, that from what I have seen, indexes can only be created on a single table, without the possibility of joining another tables values to it. Since two individual index do not solve this problem either, I was wondering if there was an alternative way to optimize this particular query.
Here are the table scripts:
CREATE TABLE products
(
name text PRIMARY KEY,
price integer,
)
CREATE TABLE orders
(
id SERIAL PRIMARY KEY,
created_at TIMESTAMP DEFAULT NOW(),
)
CREATE TABLE orderProducts
(
product text REFERENCES products(name),
"order" integer REFERENCES orders(id),
)
First of all. Please do not put indices everywhere - that lead to slower changing operations...
As proposed by #Laurenz Albe - do not guess - check.
Other than that. Note that you know product name, price is repeated - so you can query that once. Question if in your case two queries are going to be faster then single one... Check that.
Please read docs. I would try this index:
create index orders_id_created_at on orders(created_at desc, id)
Normally id should go first, since that is unique, however here system should be able to filter out on both predicates - where/join. Just guessing here.
orderProducts I would like to see index on both columns, however for this query only one should be needed. In practice you are going from products to orders, or other way - both paths are possible, that is why I've wrote about indexing both columns. I would use two separate indexes:
create index orderproducts_product_id on orderproducts (product_id) include (order_id);
create index orderproducts_order_id on orderproducts (order_id) include (product_id);
Probably that is not changing much, but... idea is to use only index, but not the table itself.
These rules are important in terms of performance:
Integer index faster than string index, therefore, you should try to make the primary keys always be an integer. Because join the tables uses primary keys too.
If when in where clauses always use two fields then we must create an index for both fields.
Foreign-Keys are not indexed, you must create an index for foreign-key fields manually.
So, recommended table scripts will be are that:
CREATE TABLE products
(
id serial primary key,
name text,
price integer
);
CREATE UNIQUE INDEX products_name_idx ON products USING btree (name);
CREATE TABLE orders
(
id SERIAL PRIMARY KEY,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX orders_created_at_idx ON orders USING btree (created_at);
CREATE TABLE orderProducts
(
product_id integer REFERENCES products(id),
order_id integer REFERENCES orders(id)
);
CREATE INDEX orderproducts_product_id_idx ON orderproducts USING btree (product_id, order_id);
---- OR ----
CREATE INDEX orderproducts_product_id ON orderproducts (product_id);
CREATE INDEX orderproducts_order_id ON orderproducts (order_id);

Delete all records that violate new unqiue constraint

I have a table that has the following fields
----------------------------------
| id | user_id | doc_id |
----------------------------------
I want to create a new unique constraint to make sure that there are no repeat user_id and doc_id records. Aka a user can only be linked to a doc one time. That is simple enough.
ALTER TABLE mytable
ADD CONSTRAINT uniquectm_const UNIQUE (user_id, doc_id);
The issue is I have records that currently violate that constraint. I was wondering if there is an easy way to query for those records or to tell postgres just delete anything that violates the constraint.
Identifying records that violate your new key:
SELECT *
FROM
(
SELECT id, user_id, doc_id
, COUNT(*) OVER (PARTITION BY user_id, doc_id) as unique_check
FROM mytable
)
WHERE unique_check > 1;
Then you can figure out from those duplicates, which should be deleted and perform the delete.
To my knowledge there is no other way to perform this since any automated "Delete any duplicates" command would leave the database engine to decide which of the two-or-more duplicate records to get rid of.
If the entire record is a duplicate (all columns match) then you could just create a new table with your new unique constraint and do a INSERT INTO newtable SELECT DISTINCT * FROM oldtable but I'm betting that isn't the case.

Index on foreign keys

I'm just trying to best understand index.
On pg 106 of 70-461 - Querying Microsoft Sql Server 2012,
it says when a primary or unique constraint SQL Sever will automatically create a unique index.
But no index are created for foreign keys.
Therefore to make joins more efficient is it best to just create a non_clustered index on the foreign keys?
Not sure what part is the question.
An index is used to enforce a unique constraint.
A FK by nature does not require an index.
But if the FK has an index the query optimizer will often use it in the join.
In this query docMVEnum1.valueID is a FK with an index.
The query optimizer used that index.
Even with the index it was still the most expensive part of the query.
select docMVEnum1.sID, docEnum1.value
from docMVEnum1
join docEnum1
on docEnum1.valueID = docMVEnum1.valueID
Also by nature a FK is often used in a where clause.
Indexes are not free.
They improve select but slow down insert and update.
No, you don't need to create a index for the foreign keys, it will not promise that it will make joins more efficient.
The indexes for unique and PK are created to improve the INSERT and UPDATE performance.
While you are querying with JOIN it will use zero or one index to seek / scan the table.
Lets say that you have couple of tables like
MyTable
(
ID int (PK),
Description varchar(max),
ColumnFK int (FK to LookupTable)
)
Table LookupTable
(
ID int (PK),
Description varchar(max)
)
SELECT MyTable.ID, MyTable.Description, MyTable.ColumnFK, LookupTable.Description
FROM MyTable
INNER JOIN LookupTable
on LookupTable.ID = MyTable.ColumnFK,
WHERE ID between 5 and 10000
most probably is that the profiler will use index scan to find all the relevant IDs in MyTable so it will pick from Mytable columns ColumnFK1 and Description.
if you were thinking of adding the FK to the unique or pk just evaluate what happens if you are going to have many FK in the same table?
Note that intentionally I added to the predicate MyTable.Description and made it varchar(max) to show that you will reach the data for such a query.

Should this PostgreSQL query use the indexes?

I have two tables:
CREATE TABLE soils (
sample_id TEXT PRIMARY KEY,
project_id TEXT,
technician_id TEXT
);
CREATE INDEX soils_idx
ON soils
USING btree
(sample_id COLLATE pg_catalog."default");
CREATE TABLE assays (
sample_id TEXT PRIMARY KEY,
mo_ppm NUMERIC
);
CREATE INDEX assays_idx
ON assays
USING btree
(sample_id COLLATE pg_catalog."default");
Each table contains about a half million records, and, in reality, about 20 additional columns each, of type TEXT (omitted in the DDL posted above to save time here).
When I perform the query:
EXPLAIN SELECT
s.sample_id, s.project_id, s.technician_id, a.mo_ppm
FROM
soils AS s INNER JOIN assays AS a ON s.sample_id = a.sample_id
I get 2 SEQ SCANs, rather than a lookup to the index. Is that expected behaviour?
Since you have no WHERE conditions, you effectively read the whole table. It's cheaper to run sequential scans and not involve any indexes at all.
Try:
EXPLAIN
SELECT s.sample_id, s.project_id, s.technician_id, a.mo_ppm
FROM soils s
JOIN assays a USING (sample_id)
WHERE <some condition that returns few rows>;
... and an index matching the WHERE condition should be used.
You don't need to define an index on a PRIMARY KEY column. A PK constraint is implemented with a unique index automatically. Your additional index is redundant and of no use.
An index on a foreign key column would be a good idea, but there isn't one in your example, which looks odd. Like the two tables could be combined into one. Probably just over-simplification for the test case.
Finally, for big tables, I would consider using a simple integer primary key instead of text, possibly a serial column. That's typically faster.
Yes, that's expected behaviour. On the other hand it depends on your random_page_cost, seq_page_cost and effective_cache_size settings. Your query doesn't have WHERE clause hence it might be faster to read everything sequentially. You can try to penalise sequential scan:
set enable_seqscan = off;
explain analyse <your query>;
and then compare plan/cost/IO wait (it is not possible to disable seq-scan but it gets very high cost -- ~1e7 (or 1e8)).
If you have SSD and WHERE clause in your query then you can lower random_page_cost to 1.5..2.5 and encourage PG to use index.