SQL Non Clustered Index Design - tsql

Have a large table T
sID Int PK
pos Int PK
wID Int PK
Self joins
T2.sID = T1.sID
T2.pos = T1.pos + 1
T1.wID = 'alpha'
T2.wID = 'beta'
Added an index on (wID, sID) and sure enough it helped (a lot)
In the execution plan noticed the output was sID, pos
So changed the index to (wID, sID, pos).
What surprised me is the query plan stayed the same and no improvement in execution time
So changed the index to (wID)
Same query plan and same execution time
Clearly will go with the smaller index for the same performance
My question is why is the smaller index as effective?
Is it because that extra data is the PK?

Related

PostgreSQL array of data composite update element using where condition

I have a composite type:
CREATE TYPE mydata_t AS
(
user_id integer,
value character(4)
);
Also, I have a table, uses this composite type as an array of mydata_t.
CREATE TABLE tbl
(
id serial NOT NULL,
data_list mydata_t[],
PRIMARY KEY (id)
);
Here I want to update the mydata_t in data_list, where mydata_t.user_id is 100000
But I don't know which array element's user_id is equal to 100000
So I have to make a search first to find the element where its user_id is equal to 100000 ... that's my problem ... I don't know how to make the query .... in fact, I want to update the value of the array element, where it's user_id is equal to 100000 (Also where the id of tbl is for example 1) ... What will be my query?
Something like this (I know it's wrong !!!)
UPDATE "tbl" SET "data_list"[i]."value"='YYYY'
WHERE "id"=1 AND EXISTS (SELECT ROW_NUMBER() OVER() AS i
FROM unnest("data_list") "d" WHERE "d"."user_id"=10000 LIMIT 1)
For example, this is my tbl data:
Row1 => id = 1, data = ARRAY[ROW(5,'YYYY'),ROW(6,'YYYY')]
Row2 => id = 2, data = ARRAY[ROW(10,'YYYY'),ROW(11,'YYYY')]
Now i want to update tbl where id is 2 and set the value of one of the tbl.data elements to 'XXXX' where the user_id of element is equal to 11
In fact, the final result of Row2 will be this:
Row2 => id = 2, data = ARRAY[ROW(10,'YYYY'),ROW(11,'XXXX')]
If you know the value value, you can use the array_replace() function to make the change:
UPDATE tbl
SET data_list = array_replace(data_list, (11, 'YYYY')::mydata_t, (11, 'XXXX')::mydata_t)
WHERE id = 2
If you do not know the value value then the situation becomes more complex:
UPDATE tbl SET data_list = data_arr
FROM (
-- UPDATE doesn't allow aggregate functions so aggregate here
SELECT array_agg(new_data) AS data_arr
FROM (
-- For the id value, get the data_list values that are NOT modified
SELECT (user_id, value)::mydata_t AS new_data
FROM tbl, unnest(data_list)
WHERE id = 2 AND user_id != 11
UNION
-- Add the values to update
VALUES ((11, 'XXXX')::mydata_t)
) x
) y
WHERE id = 2
You should keep in mind, though, that there is an awful lot of work going on in the background that cannot be optimised. The array of mydata_t values has to be examined from start to finish and you cannot use an index on this. Furthermore, updates actually insert a new row in the underlying file on disk and if your array has more than a few entries this will involve substantial work. This gets even more problematic when your arrays are larger than the pagesize of your PostgreSQL server, typically 8kB. All behind the scene so it will work, but at a performance penalty. Even though array_replace sounds like changes are made in-place (and they indeed are in memory), the UPDATE command will write a completely new tuple to disk. So if you have 4,000 array elements that means that at least 40kB of data will have to be read (8 bytes for the mydata_t type on a typical system x 4,000 = 32kB in a TOAST file, plus the main page of the table, 8kB) and then written to disk after the update. A real performance killer.
As #klin pointed out, this design may be more trouble than it is worth. Should you make data_list as table (as I would do), the update query becomes:
UPDATE data_list SET value = 'XXXX'
WHERE id = 2 AND user_id = 11
This will have MUCH better performance, especially if you add the appropriate indexes. You could then still create a view to publish the data in an aggregated form with a custom type if your business logic so requires.

Optimizing a query that compares a table to itself with millions of rows

I could use some help optimizing a query that compares rows in a single table with millions of entries. Here's the table's definition:
CREATE TABLE IF NOT EXISTS data.row_check (
id uuid NOT NULL DEFAULT NULL,
version int8 NOT NULL DEFAULT NULL,
row_hash int8 NOT NULL DEFAULT NULL,
table_name text NOT NULL DEFAULT NULL,
CONSTRAINT row_check_pkey
PRIMARY KEY (id, version)
);
I'm reworking our push code and have a test bed with millions of records across about 20 tables. I run my tests, get the row counts, and can spot when some of my insert code has changed. The next step is to checksum each row, and then compare the rows for differences between versions of my code. Something like this:
-- Run my test of "version 0" of the push code, the base code I'm refactoring.
-- Insert the ID and checksum for each pushed row.
INSERT INTO row_check (id,version,row_hash,table_name)
SELECT id, 0, hashtext(record_changes_log::text),'record_changes_log'
FROM record_changes_log
ON CONFLICT ON CONSTRAINT row_check_pkey DO UPDATE SET
row_hash = EXCLUDED.row_hash,
table_name = EXCLUDED.table_name;
truncate table record_changes_log;
-- Run my test of "version 1" of the push code, the new code I'm validating.
-- Insert the ID and checksum for each pushed row.
INSERT INTO row_check (id,version,row_hash,table_name)
SELECT id, 1, hashtext(record_changes_log::text),'record_changes_log'
FROM record_changes_log
ON CONFLICT ON CONSTRAINT row_check_pkey DO UPDATE SET
row_hash = EXCLUDED.row_hash,
table_name = EXCLUDED.table_name;
That gets two rows in row_check for every row in record_changes_log, or any other table I'm checking. For the two runs of record_changes_log, I end up with more than 8.6M rows in row_check. They look like this:
id version row_hash table_name
e6218751-ab78-4942-9734-f017839703f6 0 -142492569 record_changes_log
6c0a4111-2f52-4b8b-bfb6-e608087ea9c1 0 -1917959999 record_changes_log
7fac6424-9469-4d98-b887-cd04fee5377d 0 -323725113 record_changes_log
1935590c-8d22-4baf-85ba-00b563022983 0 -1428730186 record_changes_log
2e5488b6-5b97-4755-8a46-6a46317c1ae2 0 -1631086027 record_changes_log
7a645ffd-31c5-4000-ab66-a565e6dad7e0 0 1857654119 record_changes_log
I asked yesterday for some help on the comparison query, and it lead to this:
select v0.table_name,
v0.id,
v0.row_hash as v0,
v1.row_hash as v1
from row_check v0
join row_check v1 on v0.id = v1.id and
v0.version = 0 and
v1.version = 1 and
v0.row_hash <> v1.row_hash
That works, but now I'm hoping to optimize it a bit. As an experiment, I clustered the data on version and then built a BRIN index, like this:
drop index if exists row_check_version_btree;
create index row_check_version_btree
on row_check
using btree(version);
cluster row_check using row_check_version_btree;
drop index row_check_version_btree; -- Eh? I want to see how the BRIN performs.
drop index if exists row_check_version_brin;
create index row_check_version_brin
on row_check
using brin(row_hash);
vacuum analyze row_check;
I ran the query through explain analyze and get this:
Merge Join (cost=1.12..559750.04 rows=4437567 width=51) (actual time=1511.987..14884.045 rows=10 loops=1)
Output: v0.table_name, v0.id, v0.row_hash, v1.row_hash
Inner Unique: true
Merge Cond: (v0.id = v1.id)
Join Filter: (v0.row_hash <> v1.row_hash)
Rows Removed by Join Filter: 4318290
Buffers: shared hit=8679005 read=42511
-> Index Scan using row_check_pkey on ascendco.row_check v0 (cost=0.56..239156.79 rows=4252416 width=43) (actual time=0.032..5548.180 rows=4318300 loops=1)
Output: v0.id, v0.version, v0.row_hash, v0.table_name
Index Cond: (v0.version = 0)
Buffers: shared hit=4360752
-> Index Scan using row_check_pkey on ascendco.row_check v1 (cost=0.56..240475.33 rows=4384270 width=24) (actual time=0.031..6070.790 rows=4318300 loops=1)
Output: v1.id, v1.version, v1.row_hash, v1.table_name
Index Cond: (v1.version = 1)
Buffers: shared hit=4318253 read=42511
Planning Time: 1.073 ms
Execution Time: 14884.121 ms
...which I did not really get the right idea from...so I ran it again to JSON and fed the results into this wonderful plan visualizer:
http://tatiyants.com/pev/#/plans
The tips there are right, the top node estimate is bad. The result is 10 rows, the estimate is for about 443,757 rows.
I'm hoping to learn more about optimizing this kind of thing, and this query seems like a good opportunity. I have a lot of notions about what might help:
-- CREATE STATISTICS?
-- Rework the query to move the where comparison?
-- Use a better index? I did try a GIN index and a straight B-tree on version, but neither was superior.
-- Rework the row_check format to move the two hashes into the same row instead of splitting them over two rows, compare on insert/update, flag non-matches, and add a partial index for the non-matching values.
Granted, it's funny to even try to index something where there are only two values (0 and 1 in the case above), so there's that. In fact, is there any sort of clever trick for Booleans? I'll always be comparing two versions, so "old" and "new", which I can express however makes life best. I understand that Postgres only has bitmap indexes internally at search/merge (?) time and that it does not have a bitmap type index. Would there be some kind of INTERSECT that might help? I don't know how Postgres implements set math operators internally.
Thanks for any suggestions on how to rethink this data or the query to make it faster for comparisons involving millions, or tens of millions, of rows.
I'm going to add an answer to my own question, but am still interested in what anyone else has to say. In the process of writing out my original question, I realized that maybe a redesign is in order. This hinges on my plan to only ever compare two versions at a time. That's a good solution here, but there are other cases where it wouldn't work. Anyway, here's a slightly different table design that folds the two results into a single row:
DROP TABLE IF EXISTS data.row_compare;
CREATE TABLE IF NOT EXISTS data.row_compare (
id uuid NOT NULL DEFAULT NULL,
hash_1 int8, -- Want NULL to defer calculating hash comparison until after both hashes are entered.
hash_2 int8, -- Ditto
hashes_match boolean, -- Likewise
table_name text NOT NULL DEFAULT NULL,
CONSTRAINT row_compare_pkey
PRIMARY KEY (id)
);
The following expression index should, hopefully, be very small as I shouldn't have any non-matching entries:
CREATE INDEX row_compare_fail ON row_compare (hashes_match)
WHERE hashes_match = false;
The trigger below does the column calculation, once hash_1 and hash_2 are both provided:
-- Run this as a BEFORE INSERT or UPDATE ROW trigger.
CREATE OR REPLACE FUNCTION data.on_upsert_row_compare()
RETURNS trigger AS
$BODY$
BEGIN
IF NEW.hash_1 = NULL OR
NEW.hash_2 = NULL THEN
RETURN NEW; -- Don't do the comparison, hash_1 hasn't been populated yet.
ELSE-- Do the comparison. The point of this is to avoid constantly thrashing the expression index.
NEW.hashes_match := NEW.hash_1 = NEW.hash_2;
RETURN NEW; -- important!
END IF;
END;
$BODY$
LANGUAGE plpgsql;
This now adds 4.3M rows instead of 8.6M rows:
-- Add the first set of results and build out the row_compare records.
INSERT INTO row_compare (id,hash_1,table_name)
SELECT id, hashtext(record_changes_log::text),'record_changes_log'
FROM record_changes_log
ON CONFLICT ON CONSTRAINT row_compare_pkey DO UPDATE SET
hash_1 = EXCLUDED.hash_1,
table_name = EXCLUDED.table_name;
-- I'll truncate the record_changes_log and push my sample data again here.
-- Add the second set of results and update the row compare records.
-- This time, the hash is going into the hash_2 field for comparison
INSERT INTO row_compare (id,hash_2,table_name)
SELECT id, hashtext(record_changes_log::text),'record_changes_log'
FROM record_changes_log
ON CONFLICT ON CONSTRAINT row_compare_pkey DO UPDATE SET
hash_2 = EXCLUDED.hash_2,
table_name = EXCLUDED.table_name;
And now the results are simple to find:
select * from row_compare where hashes_match = false;
This changes the query time from around 17 seconds to around 24 milliseconds.

Select query became very very very slow in postgresql

I have one table which contains "133,072,194" records and I am trying to execute
SELECT COUNT(test)
FROM mytable
WHERE test = false
but it is taking Execution time: 128320.712 ms
I already have indexing on test column. Could you please let me know, what I can optimize or change, so my query became faster?
Because of this, my other select query is also not working.
If there are many rows where test is FALSE, you won't be able to get an exact result faster than with a sequential scan, which is slow for big tables.
If you have only few rows that satisfy the condition, you should create a partial index:
CREATE INDEX mytable_notest_ind ON mytable(id) WHERE NOT test;
(assuming that id is the primary key) and keep mytable autovacuumed often enough that you get an index only scan.
But usually exact results for queries like this are not required.
You could calculate an estimated count from the table statistics with a query like this:
SELECT t.reltuples
* (1 - t.nullfrac)
* mcv.freq AS count_false
FROM pg_stats AS s
CROSS JOIN LATERAL unnest(s.most_common_vals::text::boolean[],
s.most_common_freqs) AS mcv(val, freq)
JOIN pg_class AS t
ON s.tablename = t.relname
AND s.schemaname = t.relnamespace::regnamespace::text
WHERE s.tablename = 'mytable'
AND s.attname = 'test'
AND mcv.val = FALSE;
That would be very fast.
See my blog post for more considerations about the speed of SELECT count(*).

Postgres using wrong index when querying a view of indexed expressions?

When I run the following script with Postgres 9.3 (with enable_seqscan set to off), I expect the final query to make use of the "forms_string" partial index, but instead uses the "forms_int" index, which doesn't make sense.
When I've been testing this with actual code with JSON functions and indexes for more types, it consistently seems to use whatever the last index created was, for every query.
Adding more unrelated rows so that the rows relevant to the partial index are only a small percentage of total rows in the table results in a "bitmap heap scan", but still mentions the same incorrect index after that.
Any idea how I can get it to use the correct index?
CREATE EXTENSION IF NOT EXISTS plv8;
CREATE OR REPLACE FUNCTION
json_string(data json, key text) RETURNS TEXT AS $$
var ret = data,
keys = key.split('.'),
len = keys.length;
for (var i = 0; i < len; ++i) {
if (ret) {
ret = ret[keys[i]]
};
}
if (typeof ret === "undefined") {
ret = null;
} else if (ret) {
ret = ret.toString();
}
return ret;
$$ LANGUAGE plv8 IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION
json_int(data json, key text) RETURNS INT AS $$
var ret = data,
keys = key.split('.'),
len = keys.length;
for (var i = 0; i < len; ++i) {
if (ret) {
ret = ret[keys[i]]
}
}
if (typeof ret === "undefined") {
ret = null;
} else {
ret = parseInt(ret, 10);
if (isNaN(ret)) {
ret = null;
}
}
return ret;
$$ LANGUAGE plv8 IMMUTABLE STRICT;
CREATE TABLE form_types (
id SERIAL NOT NULL,
name VARCHAR(200),
PRIMARY KEY (id)
);
CREATE TABLE tenants (
id SERIAL NOT NULL,
name VARCHAR(200),
PRIMARY KEY (id)
);
CREATE TABLE forms (
id SERIAL NOT NULL,
tenant_id INTEGER,
type_id INTEGER,
data JSON,
PRIMARY KEY (id),
FOREIGN KEY(tenant_id) REFERENCES tenants (id),
FOREIGN KEY(type_id) REFERENCES form_types (id)
);
CREATE INDEX ix_forms_type_id ON forms (type_id);
CREATE INDEX ix_forms_tenant_id ON forms (tenant_id);
INSERT INTO tenants (name) VALUES ('mike'), ('bob');
INSERT INTO form_types (name) VALUES ('type 1'), ('type 2');
INSERT INTO forms (tenant_id, type_id, data) VALUES
(1, 1, '{"string": "unicorns", "int": 1}'),
(1, 1, '{"string": "pythons", "int": 2}'),
(1, 1, '{"string": "pythons", "int": 8}'),
(1, 1, '{"string": "penguins"}');
CREATE OR REPLACE VIEW foo AS
SELECT forms.id AS forms_id,
json_string(forms.data, 'string') AS "data.string",
json_int(forms.data, 'int') AS "data.int"
FROM forms
WHERE forms.tenant_id = 1 AND forms.type_id = 1;
CREATE INDEX "forms_string" ON forms (json_string(data, 'string'))
WHERE tenant_id = 1 AND type_id = 1;
CREATE INDEX "forms_int" ON forms (json_int(data, 'int'))
WHERE tenant_id = 1 AND type_id = 1;
EXPLAIN ANALYZE VERBOSE SELECT "data.string" from foo;
Outputs:
Index Scan using forms_int on public.forms
(cost=0.13..8.40 rows=1 width=32) (actual time=0.085..0.239 rows=20 loops=1)
Output: json_string(forms.data, 'string'::text)
Total runtime: 0.282 ms
Without enable_seqscan=off:
Seq Scan on public.forms (cost=0.00..1.31 rows=1 width=32) (actual time=0.080..0.277 rows=28 loops=1)
Output: json_string(forms.data, 'string'::text)
Filter: ((forms.tenant_id = 1) AND (forms.type_id = 1))
Total runtime: 0.327 ms
\d forms prints
Table "public.forms"
Column | Type | Modifiers
-----------+---------+----------------------------------------------------
id | integer | not null default nextval('forms_id_seq'::regclass)
tenant_id | integer |
type_id | integer |
data | json |
Indexes:
"forms_pkey" PRIMARY KEY, btree (id)
"forms_int" btree (json_int(data, 'int'::text)) WHERE tenant_id = 1 AND type_id = 1
"forms_string" btree (json_string(data, 'string'::text)) WHERE tenant_id = 1 AND type_id = 1
"ix_forms_tenant_id" btree (tenant_id)
"ix_forms_type_id" btree (type_id)
Foreign-key constraints:
"forms_tenant_id_fkey" FOREIGN KEY (tenant_id) REFERENCES tenants(id)
"forms_type_id_fkey" FOREIGN KEY (type_id) REFERENCES form_types(id)
Index vs seqscan, costs
Looks like your random_page_cost is too high compared to the real performance of your machine. Random I/O is faster (costs less) than Pg thinks it does, so it's choosing a slightly less ideal plan.
That's why the cost estimate for the indexscan is (cost=0.13..8.40 rows=1 width=32) and for the seqscan it's slightly lower at (cost=0.00..1.31 rows=1 width=32).
Lower random_page_cost - try SET random_page_cost = 2 then re-running.
To learn more, read the documentation on PostgreSQL query planning, parameters, and tuning, and the relevant wiki pages.
Index selection
PostgreSQL appears to be picking an index scan on forms_int instead of forms_string because it'll be a more compact, smaller index, and both indexes exactly match the search criteria for the view: tenant_id = 1 AND type_id = 1.
If you disable or drop forms_int it'll probably use forms_string and go slightly slower.
The key thing to understand is that while the index does contain the value of interest, PostgreSQL isn't actually using it. It's scanning the index without an index condition, since every tuple in the index matches, to get tuples from the heap. It's then extracting the value from those heap tuples and outputting them.
This can be demonstrated with an expression-index on a constant:
CREATE INDEX "forms_novalue" ON forms((true)) WHERE tenant_id = 1 AND type_id = 1;
PostgreSQL is quite likely to select this index for the query:
regress=# EXPLAIN ANALYZE VERBOSE SELECT "data.string" from foo;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Index Scan using forms_novalue on public.forms (cost=0.13..13.21 rows=4 width=32) (actual time=0.190..0.310 rows=4 loops=1)
Output: json_string(forms.data, 'string'::text)
Total runtime: 0.346 ms
(3 rows)
All the indexes are the same size because they're all so tiny they fit in the minimum allocation:
regress=# SELECT x.idxname, pg_relation_size(x.idxname) FROM (VALUES ('forms_novalue'),('forms_int'),('forms_string')) x(idxname);
idxname | pg_relation_size
---------------+------------------
forms_novalue | 16384
forms_int | 16384
forms_string | 16384
(3 rows)
but the stats for novalue will be somewhat more attractive due to a narrower row width.
Index scan vs index-only scan
It sounds like what you really expect is an index-only scan, where Pg never touches the table's heap and only uses the tuples in the index its self.
I would expect that this query's requirements could be satisfied with forms_string, but can't get Pg to pick an index-only scan plan for it.
It's not immediately clear to me why Pg is not using an index-only scan here, as it should be a candidate, but it doesn't seem to be able to plan one. If I force enable_indexscan = off, it'll pick an inferior bitmap index scan plan instead, and if force disable enable_bitmapscan it'll fall back to a max-cost-estimate seqscan. This is true even after a VACUUM of the table(s) of interest.
That means it must not be being generated as a candidate path in the query planner - Pg doesn't know how to use an index-only scan for this query, or thinks it cannot do so for some reason.
It isn't an issue with view introspection, as an expanded view query is the same.
Your table has insufficient data in it. In short, Postgres won't use an index when the table fits in a single disk page. Ever. When your table contains a few hundred or thousand rows, it'll become too big to fit, and then you'll see Postgres begin to use index scans when relevant.
Another point to consider is you need to analyze your tables after a large import. Without accurate stats on your actual data, Postgres may end up dismissing some index scans as too expensive, when in fact they'd be cheap.
Lastly, there are cases when it is cheaper to not use an index. In essence, whenever Postgres is about to visit most disk pages repeatedly and in a random order to retrieve a large number of rows, it'll seriously consider the cost of visiting most (bitmap index) or all (seq scan) disk pages once sequentially and filtering invalid rows out. The latter wins if you're selecting enough rows.

Is it possible to optimize a SELECT COUNT(*) query using a filtered index as a hint to achieve constant speed?

I'd like to count all the Orders that are not urgent and whose order status = 1 (shipped).
This should be a very simple query to optimize. I'd like to put a simple filtered index on the Orders table to cover this query to make it a constant time/O(1) operation. However, when I look at the query plan, it looks like it's using a Index Scan which doesn't make sense. Ideally, this query should just returning the number of items in the index.
The table look like this (simplified to get to the essence):
CREATE TABLE [dbo].[Orders](
[Id] [int] IDENTITY(1,1) NOT NULL,
[IsUrgent] [bit] NOT NULL,
[Status] [tinyint] NOT NULL
CONSTRAINT [PK_Orders] PRIMARY KEY CLUSTERED ( [Id] ASC )
I've created this filtered index:
CREATE INDEX IX_Orders_ShippedNonUrgent ON Orders(Id) WHERE IsUrgent = 0 AND Status = 1;
Now, when I do this query:
SELECT COUNT(*) FROM Orders WHERE IsUrgent = 0 AND Status = 1
I see that the query plan is using IX_Orders_ShippedNonUrgent, but it's doing an Index Scan and performing around 200 reads across the ~150,000 rows in Orders.
Is it possible to always have this query run in constant time assuming the filtered index is kept up to date? Ideally, it should only perform 1 read to get the size of the index.
If I switch to a non-filtered index like this:
CREATE INDEX IX_Orders_IsUrgentStatus ON Orders(IsUrgent, Status);
The query plan uses an Index Seek, but still performs many more reads than should be necessary to answer this simple query.
UPDATE
I'm able to do this
SELECT TOP 1 rows FROM sys.partitions p
INNER JOIN sys.indexes i
ON i.name = 'IX_Orders_ShippedNonUrgent'
AND i.object_id = p.object_id
AND i.index_id = p.index_id
and get the result in 9 reads but it seems like there should be a much easier and less brittle way of using the simple COUNT(*) query.
It seems like what I'm wanting isn't possible. The best answer was left in the comments by Nikola Markovinović which is to forget about the filtered index and use an indexed view instead:
CREATE VIEW [dbo].vw_Orders_TotalShippedNonUrgent WITH SCHEMABINDING
AS
SELECT COUNT_BIG(*) AS TotalOrders
FROM dbo.Orders WHERE IsUrgent = 0 AND Status = 1;
with
CREATE UNIQUE CLUSTERED INDEX IX_vw_Orders_TotalShippedNonUrgent ON vw_Orders_TotalShippedNonUrgent(TotalOrders);
This forces creating views and their index for each summary statistic that I want as well as rewriting the query to ask the view instead of the simple approach, but it is fast at only 2 reads.
I'll leave this question open for awhile in case anyone has a simpler approach that's just as fast.