I'm trying to index a large JSONB column based on a text field (with an ISO date string). This index works fine using = but is ignored if I use a > condition.
create table test_table (
id text NOT null primary key,
data jsonb,
text_test text
);
Then I add a bunch data to the jsonb column. And to ensure my JSON is valid, extract/copy the value I'm interested in from the JSONB column into another text column to test against too.
update test_table set text_test = (data->>'dueDate');
A quick sample shows it's good ISO formatted date strings:
select text_test, (data->>'dueDate') from test_table limit 1;
-- 2020-08-07T11:59:00 2020-08-07T11:59:00
I add btree indexes to both the JSONB and the text_test copy column. I tried adding one with explicit '::text' casting, as well as one with 'text_pattern_ops'.
create index test_table_duedate_iso on test_table using btree(text_test);
create index test_table_duedate_iso_jsonb on test_table using btree((data->>'dueDate'));
create index test_table_duedate_iso_jsonb_cast on test_table using btree(((data->>'dueDate')::text));
create index test_table_duedate_iso_jsonb_cast_pattern on test_table using btree(((data->>'dueDate')::text) text_pattern_ops);
Now if I query an exact value, explain shows it using the 'cast' version of the index. Good.
explain select * from test_table where (data->>'dueDate') = '2020-08-07T11:59:00';
"-> Bitmap Index Scan on test_table_duedate_iso_jsonb_cast (cost=0.00..10.37 rows=261 width=0)"
But if I try it with a >, it does a very slow full scan.
explain analyze select count(*) from test_table where (data->>'dueDate') > '2020-04-14';
--Aggregate (cost=10037.94..10037.95 rows=1 width=8) (actual time=1070.808..1070.813 rows=1 loops=1)
-- -> Seq Scan on test_table (cost=0.00..9994.42 rows=17409 width=0) (actual time=0.069..1057.258 rows=2930 loops=1)
-- Filter: ((data ->> 'dueDate'::text) > '2020-04-14'::text)
-- Rows Removed by Filter: 49298
--Planning Time: 0.252 ms
--Execution Time: 1070.874 ms
So just to check my sanity, I do the same query against the text_test column, it uses it's index as desired:
explain analyze select count(*) from test_table where text_test > '2020-04-14';
--Aggregate (cost=6037.02..6037.03 rows=1 width=8) (actual time=19.979..19.984 rows=1 loops=1)
-- -> Bitmap Heap Scan on test_table (cost=77.76..6030.14 rows=2754 width=0) (actual time=1.354..11.007 rows=2930 loops=1)
-- Recheck Cond: (text_test > '2020-04-14'::text)
-- Heap Blocks: exact=455
-- -> Bitmap Index Scan on test_table_duedate_iso (cost=0.00..77.07 rows=2754 width=0) (actual time=1.215..1.217 rows=2930 loops=1)
-- Index Cond: (text_test > '2020-04-14'::text)
--Planning Time: 0.145 ms
--Execution Time: 20.041 ms
I have also tested indexing a numerical field within the JSON and it actually works properly, using it's index for ranged type queries. So it's something about the text field or something I'm doing wrong with it.
PostgreSQL 11.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-14), 64-bit
Related
I have a stored procedure on PostgreSQL like this:
create or replace procedure new_emp_sp (f_name varchar, l_name varchar, age integer, threshold integer, dept varchar)
language plpgsql
as $$
declare
new_emp_count integer;
begin
INSERT INTO employees (id, first_name, last_name, age)
VALUES (nextval('emp_id_seq'),
random_string(10),
random_string(20),
age);
select count(*) into new_emp_count from employees where age > threshold;
update dept_employees set emp_count = new_emp_count where id = dept;
end; $$
I have enabled auto_explain module and set log_min_duration to 0 so that it logs everything.
I have an issue with the update statement in the procedure. From the auto_explain logs I see that it is not using the primary key index to update the table:
-> Seq Scan on dept_employees (cost=0.00..1.05 rows=1 width=14) (actual time=0.005..0.006 rows=1 loops=1)
Filter: ((id)::text = 'ABC'::text)
Rows Removed by Filter: 3
This worked as expected until a couple of hours ago and I used to get a log like this:
-> Index Scan using dept_employees_pkey on dept_employees (cost=0.15..8.17 rows=1 width=48) (actual time=0.010..0.011 rows=1 loops=1)
Index Cond: ((id)::text = 'ABC'::text)
Without the procedure, if I run the statement standalone like this:
explain analyze update dept_employees set emp_count = 123 where id = 'ABC';
The statement correctly uses the primary key index:
Update on dept_employees (cost=0.15..8.17 rows=1 width=128) (actual time=0.049..0.049 rows=0 loops=1)
-> Index Scan using dept_employees_pkey on dept_employees (cost=0.15..8.17 rows=1 width=128) (actual time=0.035..0.036 rows=1 loops=1)
Index Cond: ((id)::text = 'ABC'::text)
I can't figure out what has gone wrong especially because it worked perfectly just a couple of hours ago.
It is faster to scan N rows sequentially than to scan N rows using an index. So for small tables Postgres may decide that a sequence scan is faster than an index scan.
PL/pgSQL can cache prepared statements and execution plans, so you're probably getting a cached execution plan from when the table was smaller.
I have a table named "k3_order" with jsonb column "json_delivery".
Example content of that column is:
{
"delivery_cost": "11.99",
"packageNumbers": [
"0000000596034Q"
]
}
I've created index on json_delivery->'packageNumbers':
CREATE INDEX test_idx ON k3_order USING gin(json_delivery->'packageNumbers');
Now I use this two SQL Queries:
SELECT id, delivery_method_id
FROM k3_order
WHERE jsonb_exists (json_delivery->'packageNumbers', '0000000596034Q');
SELECT id, delivery_method_id
FROM k3_order
WHERE json_delivery->'packageNumbers' ? '0000000596034Q';
The second is faster and using index, but the first doesn't.
Is there any way to create index in PostgreSQL 10.4 in order for query 1) to use it?
Is this even possible in PostgreSQL 10.4 or newer versions?
EXPLAIN ANALYZE SELECT id, delivery_method_id
FROM k3_order
WHERE jsonb_exists (json_delivery->'packageNumbers', > '0000000596034Q');
produces:
Seq Scan on k3_order (cost=0.00..117058.10 rows=216847 width=8 (actual time=162.001..569.863 rows=1 loops=1)
Filter: jsonb_exists((json_delivery -> 'packageNumbers'::text), '0000000596034Q'::text)
Rows Removed by Filter: 650539
Planning time: 0.748 ms
Execution time: 569.886 ms
EXPLAIN ANALYZE SELECT id, delivery_method_id
FROM k3_order
WHERE json_delivery->'packageNumbers' ? '0000000596034Q';
produces:
Bitmap Heap Scan on k3_order (cost=21.04..2479.03 rows=651 width=8) (actual time=0.022..0.022 rows=1 loops=1)
Recheck Cond: ((json_delivery -> 'packageNumbers'::text) ? '0000000596034Q'::text)
Heap Blocks: exact=1
-> Bitmap Index Scan on test_idx (cost=0.00..20.88 rows=651 width=0) (actual time=0.016..0.016 rows=1 loops=1)
Index Cond: ((json_delivery -> 'packageNumbers'::text) ? '0000000596034Q'::text)
Planning time: 0.182 ms
Execution time: 0.050 ms
Indexes can only be used by queries in the following cases:
the WHERE condition contains an expression of the form <indexed expression> <operator> <constant>, where
an index has been created on <indexed expression>
<operator> is an operator in the index family of the operator class of the index
<constant> is an expression that stays constant for the duration of the index scan
the ORDER BY clause has the same or the exact opposite ordering as the index definition, and the index access method supports sorting (from v13 on, an index can also be used if it contains the starting columns of the ORDER BY clause)
the PostgreSQL version is v12 and higher, and the WHERE condition contains an expression of the form bool_func(...), where the function returns boolean and has a planner support function.
Now json_delivery->'packageNumbers' ? '0000000596034Q' satisfies the first condition, so an index scan can be used.
jsonb_exists(json_delivery->'packageNumbers', > '0000000596034Q') could only use an index if there were a planner support function for jsonb_exists, but there is none:
SELECT prosupport FROM pg_proc
WHERE proname = 'jsonb_exists';
prosupport
════════════
-
(1 row)
Table definition:
CREATE TABLE schema.mylogoperation (
id_mylogoperation serial,
data DATE,
myschema VARCHAR(255),
column_var_2 VARCHAR(255),
user VARCHAR(255),
action TEXT,
column_var_1 TEXT,
log_old VARCHAR,
log_new VARCHAR
constraint pk_mylogoperation primary key (id_mylogoperation)
)
WITH (oids = false);
12 million rows
I tried to explain analyze:
explain analyze
SELECT
column_var_1,
column_var_2
column_var_3,
user,
action,
data,
log_old,
log_new
FROM schema.mylogoperation
WHERE
myschema = 'schema'
AND column_var_2 IN ('mydata1', 'mydata2', 'mydata3')
AND log_old <> log_new
AND column_var_1 LIKE 'mydata%';
indexes ( pk_mylogoperation only)
QUERY PLAN
Seq Scan on myschema (cost=0.00..713948.14rows=660 width=222) (actual time=380.308..4467.364 rows=48 loops=1)
Filter: (((log_old)::text <> (log_new)::text) AND (column_var_1 ~~ 'mydata%'::text) AND ((schema)::text = 'schema'::text) AND ((column_var_2)::text = ANY ('{mydata1,mydata2,mydata3}'::text[])))
Rows Removed by Filter: 12525296
Total runtime: 4467.425 ms
Then I tried to create a some index for better performance:
CREATE INDEX idx_mylogoperation_1 ON schema.mylogoperation (myschema, column_var_2);
reindex table schema.mylogoperation;
analyze schema.mylogoperation;
pk_mylogoperation + idx_mylogoperation_1
QUERY PLAN
Index Scan using idx_mylogoperation_qry1 on mylogoperation (cost=0.56..589836.84 rows=658 width=223) (actual time=331.679..4997.507 rows=48 loops=1)
Index Cond: (((myschema)::text = 'schema'::text) AND ((column_var_2)::text = ANY ('{mydata1,mydata2,mydata3}'::text[])))
Filter: (((log_old)::text <> (log_new)::text) AND (column_var_1 ~~ 'mydata%'::text))
Rows Removed by Filter: 7441986
Total runtime: 4997.580 ms
Then I tried again to create a some index for better performance:
CREATE INDEX idx_mylogoperation_2 ON schema.mylogoperation USING gin (column_var_1 gin_trgm_ops);
reindex table schema.mylogoperation;
analyze schema.mylogoperation;
pk_mylogoperation + idx_mylogoperation_1 + idx_mylogoperation_2
QUERY PLAN
Bitmap Heap Scan on idx_mylogoperation_var_1 (cost=1398.58..2765.08 rows=663 width=222) (actual time=5303.481..5303.906 rows=48 loops=1)
Recheck Cond: (column_var_1 ~~ 'mydata%'::text)
Filter: (((log_old)::text <> (log_new)::text) AND ((myschema)::text = 'schema'::text) AND ((column_var_2)::text = ANY ('{mydata1,mydata2,mydata3}'::text[])))
Rows Removed by Filter: 248
-> Bitmap Index Scan on idx_mylogoperation_var_1 (cost=0.00..1398.41 rows=1215 width=0) (actual time=5303.203..5303.203 rows=296 loops=1)
Index Cond: (column_var_1 ~~ 'mydata%'::text)
Total runtime: 5303.950 ms
The question
the cost decreased but the time was practically the same, why?
Notes:
I do not want to make changes to the select operation, just in the database structure.
This test was performed on a server that is in use. But creating these indices was efficient? Or rather do not use them.
I am using Postgres 9.3.22 on Linux 64-bit Red Hat.
This index:
CREATE INDEX idx_mylogoperation_1 ON schema.mylogoperation (myschema, column_var_2);
didn't help because the relevant portion of your where clause matched ~2/3 of the table. The index didn't narrow down the results very much, but the filter did:
Filter: (((log_old)::text <> (log_new)::text) AND (column_var_1 ~~ 'mydata%'::text))
Rows Removed by Filter: 7441986
I'm not sure which of those two things in the filter removed more, but you could try a partial index like:
CREATE INDEX idx_mylogoperation_1 ON schema.mylogoperation (myschema, column_var_2) WHERE log_old <> log_new;
A table with trigram index, does not work if there is mixed case or ILike in the query.
Im not sure what I have missed. Any ideas?
(Im using PostgreSQL 9.6.2)
CREATE TABLE public.tbltest (
"tbltestId" int NOT null ,
"mystring1" text,
"mystring2" character varying,
CONSTRAINT "tbltest_pkey" PRIMARY KEY ("tbltestId")
);
insert into tbltest ("tbltestId","mystring1", "mystring2")
select x.id, x.id || ' Test', x.id || ' Test' from generate_series(1,100000) AS x(id);
CREATE EXTENSION pg_trgm;
CREATE INDEX tbltest_idx1 ON tbltest using gin ("mystring1" gin_trgm_ops);
CREATE INDEX tbltest_idx2 ON tbltest using gin ("mystring2" gin_trgm_ops);
Using lower case text in the query works, and uses the index
explain analyse
select * from tbltest
where "mystring2" Like '%test%';
QUERY PLAN |
-----------------------------------------------------------------------------------------------------------------------------|
Bitmap Heap Scan on tbltest (cost=20.08..56.68 rows=10 width=24) (actual time=29.846..29.846 rows=0 loops=1) |
Recheck Cond: ((mystring2)::text ~~ '%test%'::text) |
Rows Removed by Index Recheck: 100000 |
Heap Blocks: exact=726 |
-> Bitmap Index Scan on tbltest_idx2 (cost=0.00..20.07 rows=10 width=0) (actual time=12.709..12.709 rows=100000 loops=1) |
Index Cond: ((mystring2)::text ~~ '%test%'::text) |
Planning time: 0.086 ms |
Execution time: 29.875 ms |
Like does not use the index if I add mixed case in the search
explain analyse
select * from tbltest
where "mystring2" Like '%Test%';
QUERY PLAN |
--------------------------------------------------------------------------------------------------------------|
Seq Scan on tbltest (cost=0.00..1976.00 rows=99990 width=24) (actual time=0.011..33.376 rows=100000 loops=1) |
Filter: ((mystring2)::text ~~ '%Test%'::text) |
Planning time: 0.083 ms |
Execution time: 51.259 ms |
ILike does not use the index either
explain analyse
select * from tbltest
where "mystring2" ILike '%Test%';
QUERY PLAN |
--------------------------------------------------------------------------------------------------------------|
Seq Scan on tbltest (cost=0.00..1976.00 rows=99990 width=24) (actual time=0.012..87.038 rows=100000 loops=1) |
Filter: ((mystring2)::text ~~* '%Test%'::text) |
Planning time: 0.134 ms |
Execution time: 105.757 ms |
PostgreSQL does not use the index in the last two queries because that is the best way to process the query, not because it cannot use it.
In your EXPLAIN output you can see that the first query returns zero rows (actual ... rows=0), while the other two queries return every single row in the table (actual ... rows=100000).
The PostgreSQL optimizer's estimates reflect that situation accurately.
Since it has to access most of the rows of the table anyway, PostgreSQL knows that it will be able to get the result much cheaper if it scans the table sequentially than by using the more complicated index access method.
My use case, is I need to to a text search on a field, and then order by another column, unrelated to the text search, but I can't seem to create an index that handles both.
Create table:
create table file (
id bigint,
path character varying(2048),
peers bigint,
text_search tsvector
);
Some indices to test:
create index idx_file_text_search_1 on file using gin (text_search);
create index idx_file_text_search_2 on file using gin (peers, text_search);
create index idx_file_peers on file using btree (peers desc);
Here is my main query:
explain analyze
select *
from file_fast
where text_search ## to_tsquery('whatever')
order by peers desc
limit 10;
Yet its only using the peers index:
Limit (cost=0.43..20870.27 rows=10 width=316) (actual time=2507.304..9016.220 rows=10 loops=1)
-> Index Scan using idx_file_peers on file (cost=0.43..18286146.09 rows=8762 width=316) (actual time=2507.301..9016.205 rows=10 loops=1)
Filter: (text_search ## to_tsquery('ole'::text))
Rows Removed by Filter: 497504
Planning time: 0.399 ms
Execution time: 9016.265 ms
(6 rows)
And when I try it without the order by, it appears to use text searching index:
-------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=104.15..143.54 rows=10 width=316) (actual time=76.949..76.977 rows=10 loops=1)
-> Bitmap Heap Scan on file (cost=104.15..34612.36 rows=8762 width=316) (actual time=76.946..76.970 rows=10 loops=1)
Recheck Cond: (text_search ## to_tsquery('ole'::text))
Heap Blocks: exact=10
-> Bitmap Index Scan on idx_file_text_search_1 (cost=0.00..101.96 rows=8762 width=0) (actual time=76.802..76.802 rows=515 loops=1)
Index Cond: (text_search ## to_tsquery('ole'::text))
Planning time: 0.376 ms
Execution time: 175.775 ms
(8 rows)
Does postgres really lack an index to be able to text search, and sort on another field?
dont know if you can improve the index but if second query is the faster one maybe you can split the query
with cte as (
select *
from file_fast
where text_search ## to_tsquery('whatever')
)
SELECT *
FROM cte
order by peers desc
limit 10;