How to access internal representation of JSONb? - postgresql

In big-data queries the intermediary "CAST to text" is a performance bottleneck... The good binary information is there, at the JSONb datatype: how to rescue it?
Typical "select where" example:
with t(x,j) as (select 'hello','{"flag1":true,"flag2":false}'::jsonb)
SELECT x FROM t
WHERE (j->>'flag1')::boolean AND NOT((j->>'flag2')::boolean)
The the "casting to text" is a big loss of performance. Ideal is a mechanism to do direct, from JSONb to Boolean, something as
WHERE (j->'flag1')::magic_boolean AND NOT((j->'flag2')::magic_boolean)
PS: it is possible using C++? Is possible a CREATE CAST C++ implementation to solve this problem?

The feature is implemented in Postgres 11:
E.4.3.4. Data Types
[...]
Add casts from JSONB scalars to numeric and boolean data types (Anastasia Lubennikova)
Db<>Fiddle.

TL;DR
Performance-wise it's best to use #> with an appropriate index covering all JSON attributes including type conversions (to avoid type conversions when accessing the index): https://dbfiddle.uk/?rdbms=postgres_11&fiddle=4da77576874651f4d2cf801142ae34d2
CREATE INDEX idx_flags_btree_jsonb ON t ((j#>'{flag1}'), (j#>'{flag2}'));
Times (all selecting the same 5,195 rows out of 1,000,000):
->>::boolean | ~75 ms
->::boolean | ~55 ms
#> | ~80 ms
#> | ~40 ms
Scalability:
Interestingly, a local test with 40M rows (all cached in memory, no I/O latency here) revealed the following (best) numbers out of 10 runs (excluding the first and last run) for each query:
->>::boolean | 222.333 ms
->::boolean | 268.002 ms
#> | 1644.605 ms
#> | 207.230 ms
So, in fact, the new cast seems to slow things down on larger data sets (which I suspect is due to the fact that it still converts to text before converting to boolean but within a wrapper, not directly).
We also can see that the #> operator using the GIN index doesn't scale very well here, which is expected, as it is much more generic than the other special-purpose indexes and hence, needs to do a lot more under-the-hood.
However, in case these special purpose btree indexes cannot be put in place or I/O becomes a bottleneck, then the GIN index will be superior as it consumes only a fraction of the space on disk (and also in memory), increasing the chance of an index buffer hit.
But that depends on a lot of factors and needs to be decided with all accessing applications understood.
Details:
Preferably use the #> containment operator with a single GIN index as it saves a lot of special-purpose indexes:
with t(x,j) as (select 'hello','{"flag1":true,"flag2":false}'::jsonb)
SELECT x FROM t
WHERE j #> '{"flag1":true, "flag2":false}'::jsonb;
...which gives a plan like:
QUERY PLAN
-----------------------------------------------------------
CTE Scan on t (cost=0.01..0.03 rows=1 width=32)
Filter: (j #> '{"flag1": true, "flag2": false}'::jsonb)
CTE t
-> Result (cost=0.00..0.01 rows=1 width=64)
(4 rows)
As an alternative (if you can afford creating special-purpose indexes and the resulting write penalty) use the #> operator instead of -> or ->> and by that skip any performance-costly type conversions, e.g.
with t(x,j) as (select 'hello','{"flag1":true,"flag2":false}'::jsonb)
SELECT x FROM t
WHERE j#>'{flag1}' = 'true'::jsonb AND j#>'{flag2}' = 'false'::jsonb;
...resulting in a plan like:
QUERY PLAN
--------------------------------------------------------------------------------------------------------
CTE Scan on t (cost=0.01..0.04 rows=1 width=32)
Filter: (((j #> '{flag1}'::text[]) = 'true'::jsonb) AND ((j #> '{flag2}'::text[]) = 'false'::jsonb))
CTE t
-> Result (cost=0.00..0.01 rows=1 width=64)
(4 rows)
So, no more implicit type conversion here (only for the given constants, but that's a one-time operation, not for every row).

Related

What is the proper postgresql index for listing all distinct json array values?

I have the following query
select distinct c1::text
from (select json_array_elements((value::jsonb -> 'boundaries')::json) as c1 from geoinfo) t1;
And I get this query plan:
QUERY PLAN
----------------------------------------------------------------------------------------------------
HashAggregate (cost=912918.87..912921.87 rows=200 width=32)
Group Key: (t1.c1)::text
-> Subquery Scan on t1 (cost=1000.00..906769.25 rows=2459849 width=32)
-> Gather (cost=1000.00..869871.51 rows=2459849 width=32)
Workers Planned: 2
-> ProjectSet (cost=0.00..622886.61 rows=102493700 width=32)
-> Parallel Seq Scan on geoinfo (cost=0.00..89919.37 rows=1024937 width=222)
There are ~500 rows returned from a table with 2.5 Million rows.
What index can I create that will cause this query to execute much faster?
I tried the somewhat obvious, and it didn't work:
# create index gin_boundaries_array on geoinfo using gin (json_array_elements((value::jsonb -> 'boundaries')::json));
ERROR: set-returning functions are not allowed in index expressions
LINE 1: ... index gin_boundaries_array on geoinfo using gin (json_array...
+1000 on Bergi's comment. JSON is a fantastic interchange format. But for searchable storage? It's an edge-case that's become mainstream. Doesn't mean that it's always ill-advised, but when you start having to do joins against nested/embedded elements, spend a lot of mental bandwidth (and syntactic sugar) to get things done, etc., it's worth asking "is the convenience of stuffing things into JSON outweighed by the cost and hassle of seeing inside the objects, and getting things out?"
Specifically in Postgres, you can index JSON elements, but the planner cannot maintain useful statistics in the way that it can on full columns. (As I understand it, I haven't tested this out...I use JSON(B) for raw storage of JSON and search by other columns.)
As you may already know, Postgres has a lot of handy utilities for dealing with JSON. I have one JSONB field that I expand using jsonb_to_recordset and a cross join. If you're in the mood, it would be worth setting yourself up a test database. Unpack the JSON into a real table, and then try your queries against that. See for yourself if it's a better fit for your needs or not.

PostgreSQL concat an int value with a numeric field

I would like to do something like :
update cli_pe set nm_cli_espace_client = 9 || nm_cli_espace_client
where nm_cli_pe = 7006488
nm_cli_espace_client is numeric(0,8).
How to make it easy?
UPDATE cli_pe
SET nm_cli_espace_client = '9' || CAST(nm_cli_espace_client AS text)
WHERE nm_cli_pe = 7006488;
If you need a resulting number, then CAST back the result:
... SET nm_cli_espace_client =
CAST('9' || CAST(nm_cli_espace_client AS text) AS integer)
Your question is identical to this one: concat two int values in postgresql
Prefixing a digit onto a number can be done by textual concatenation, as #vyegorov has demonstrated, but it's also easily done as a mathematical operation.
Edit: Easily, but really not quickly in PostgreSQL. Even though this approach is much faster when programming in languages like C and Java for integer types, it seems to be quite a bit slower in SQL under PostgreSQL when working with an int, and immensely slower for NUMERIC. Probably function call overhead for ints, and the general slowness of numeric maths when working with numeric. In any case, you're better off concatenating as strings per #vyegorov's answer.
I'll preserve the rest of the answer for a laugh; it's a really good example of why you should test multiple approaches to a problem.
A mathematical approach is:
nm_cli_espace_client = 9*pow(10,1+floor(log(nm_cli_pe))) + nm_cli_espace_client;
There's probably a smarter way than that. What it does is basically produces the power of 10 with one more digit than the number being added to and multiplies the input by it.
regress=# SELECT (pow(10,1+floor(log( 7006488 ))) * 9) + 7006488;
?column?
---------------------------
97006488.0000000000000000
(1 row)
I'd love to have the smarter way to do this mathematically pointed out to me; I'm sure there is one.
Anyhow, it turns out you don't want to do this in PostgreSQL. With integer types it's a bit slower:
regress=# explain analyze select ('9'||x::text)::int from testtab;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Seq Scan on testtab (cost=0.00..26925.00 rows=1000000 width=4) (actual time=0.026..293.120 rows=1000000 loops=1)
Total runtime: 319.920 ms
(2 rows)
vs
regress=# explain analyze select 9*pow(10,1+floor(log(x)))+x from testtab;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Seq Scan on testtab (cost=0.00..34425.00 rows=1000000 width=4) (actual time=0.053..443.134 rows=1000000 loops=1)
Total runtime: 470.587 ms
(2 rows)
With NUMERIC types the text approach doesn't change much:
regress=# explain analyze select ('9'||x::text)::int from testtab;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Seq Scan on testtab (cost=0.00..17467.69 rows=579675 width=32) (actual time=0.039..368.980 rows=1000000 loops=1)
Total runtime: 396.376 ms
(2 rows)
but the maths based approach is horrifyingly slow even if I convert the NUMERIC to int after taking the logarithm. Clearly taking the logarithm of a NUMERIC is slow - not surprising, but I didn't expect it to be this slow:
regress=# explain analyze select 9*pow(10,1+floor(log(x)::int))+x from testtab;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Seq Scan on testtab (cost=0.00..18916.88 rows=579675 width=32) (actual time=0.253..86740.383 rows=1000000 loops=1)
Total runtime: 86778.511 ms
(2 rows)
The short version: Do it by text concatenation. In (PostgreSQL) SQL the maths-based approach is more than 200 times slower for NUMERIC, and about 50% slower on integers. It's a handy trick for languages where integer maths is cheap and string manipulation / memory allocation is expensive, though.

How to influence the Postgres Query Analyzer when dealing with text search and geospatial data

I have a quite serious performance issue with the following statement that i can't fix myself.
Given Situation
I have a postgres 8.4 Database with Postgis 1.4 installed
I have a geospatial table with ~9 Million entries. This table has a (postgis) geometry column and a tsvector column
I have a GIST Index on the geometry and a VNAME Index on the vname column
Table is ANALYZE'd
I want to excecute a to_tsquerytext search within a subset of these geometries that should give me all affecteded ids back.
The area to search in will limit the 9 Million datasets to approximately 100.000 and the resultset of the ts_query inside this area will most likely give an output of 0..1000 Entries.
Problem
The query analyzer decides that he wants to do an Bitmap Index Scan on the vname first, and then aggreates and puts a filter on the geometry (and other conditions I have in this statement)
Query Analyzer output:
Aggregate (cost=12.35..12.62 rows=1 width=510) (actual time=5.616..5.616 rows=1 loops=1)
-> Bitmap Heap Scan on mxgeom g (cost=8.33..12.35 rows=1 width=510) (actual time=5.567..5.567 rows=0 loops=1)
Recheck Cond: (vname ## '''hemer'' & ''hauptstrasse'':*'::tsquery)
Filter: (active AND (geom && '0107000020E6100000010000000103000000010000000B0000002AFFFF5FD15B1E404AE254774BA8494096FBFF3F4CC11E40F37563BAA9A74940490200206BEC1E40466F209648A949404DF6FF1F53311F400C9623C206B2494024EBFF1F4F711F404C87835954BD4940C00000B0E7CA1E4071551679E0BD4940AD02004038991E40D35CC68418BE49408EF9FF5F297C1E404F8CFFCB5BBB4940A600006015541E40FAE6468054B8494015040060A33E1E4032E568902DAE49402AFFFF5FD15B1E404AE254774BA84940'::geometry) AND (mandator_id = ANY ('{257,1}'::bigint[])))
-> Bitmap Index Scan on gis_vname_idx (cost=0.00..8.33 rows=1 width=0) (actual time=5.566..5.566 rows=0 loops=1)
Index Cond: (vname ## '''hemer'' & ''hauptstrasse'':*'::tsquery)
which causes a LOT of I/O - AFAIK It would be smarter to limit the geometry first, and do the vname search after.
Attempted Solutions
To achieve the desired behaviour i tried to
I Put the geom ## AREA into a subselect -> Did not change the execution plan
I created a temporary view with the desired area subset -> Did not change the execution plan
I created a temporary table of the desired area -> Takes 4~6 seconds to create, so that made it even worse.
Btw, sorry for not posting the actual query: I think my boss would really be mad at me if I did, also I'm looking more for theoretical pointers for someone to fix my actual query. Please ask if you need further clarification
EDIT
Richard had a very good point: You can achieve the desired behaviour of the Query Planner with the width statement. The bad thing is that this temporary Table (or CTE) messes up the vname index, thus making the query return nothing in some cases.
I was able to fix this with creating a new vname on-the-fly with to_tsvector(), but this is (too) costly - around 300 - 500ms per query.
My Solution
I ditched the vname search and went with a simple LIKE('%query_string%') (10-20 ms/query), but this is only fast in my given environment. YMMV.
There have been some improvements in statistics handling for tsvector (and I think PostGIS too, but I don't use it). If you've got the time, it might be worth trying again with a 9.1 release and see what that does for you.
However, for this single query you might want to look at the WITH construct.
http://www.postgresql.org/docs/8.4/static/queries-with.html
If you put the geometry part as the WITH clause, it will be evaluated first (guaranteed) and then that result-set will filtered by the following SELECT. It might end up slower though, you won't know until you try.
It might be an adjustment to work_mem would help too - you can do this per-session ("SET work_mem = ...") but be careful with setting it too high - concurrent queries can quickly burn through all your RAM.
http://www.postgresql.org/docs/8.4/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY

Postgresql index on xpath expression gives no speed up

We are trying to create OEBS-analog functionality in Postgresql. Let's say we have a form constructor and need to store form results in database (e.g. email bodies). In Oracle you could use a table with 150~ columns (and some mapping stored elsewhere) to store each field in separate column. But in contrast to Oracle we would like to store all the form in postgresql xml field.
The example of the tree is
<row xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<object_id>2</object_id>
<pack_form_id>23</pack_form_id>
<prod_form_id>34</prod_form_id>
</row>
We would like to search through this field.
Test table contains 400k rows and the following select executes in 90 seconds:
select *
from params
where (xpath('//prod_form_id/text()'::text, xmlvalue))[1]::text::int=34
So I created this index:
create index prod_form_idx
ON params using btree(
((xpath('//prod_form_id/text()'::text, xmlvalue))[1]::text::int)
);
And it made no difference. Still 90 seconds execution. EXPLAIN plan show this:
Bitmap Heap Scan on params (cost=40.29..6366.44 rows=2063 width=292)
Recheck Cond: ((((xpath('//prod_form_id/text()'::text, xmlvalue, '{}'::text[]))[1])::text)::integer = 34)
-> Bitmap Index Scan on prod_form_idx (cost=0.00..39.78 rows=2063 width=0)
Index Cond: ((((xpath('//prod_form_id/text()'::text, xmlvalue, '{}'::text[]))[1])::text)::integer = 34)
I am not the great plan interpreter so I suppose this means that index is being used. The question is: where's all the speed? And what can i do in order to optimize this kind of queries?
Well, at least the index is used. You get a bitmap index scan instead of a normal index scan though, which means the xpath() function will be called lots of times.
Let's do a little check :
CREATE TABLE foo ( id serial primary key, x xml, h hstore );
insert into foo (x,h) select XMLPARSE( CONTENT '<row xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<object_id>2</object_id>
<pack_form_id>' || n || '</pack_form_id>
<prod_form_id>34</prod_form_id>
</row>' ),
('object_id=>2,prod_form_id=>34,pack_form_id=>'||n)::hstore
FROM generate_series( 1,100000 ) n;
test=> EXPLAIN ANALYZE SELECT count(*) FROM foo;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Aggregate (cost=4821.00..4821.01 rows=1 width=0) (actual time=24.694..24.694 rows=1 loops=1)
-> Seq Scan on foo (cost=0.00..4571.00 rows=100000 width=0) (actual time=0.006..13.996 rows=100000 loops=1)
Total runtime: 24.730 ms
test=> explain analyze select * from foo where (h->'pack_form_id')='123';
QUERY PLAN
----------------------------------------------------------------------------------------------------
Seq Scan on foo (cost=0.00..5571.00 rows=500 width=68) (actual time=0.075..48.763 rows=1 loops=1)
Filter: ((h -> 'pack_form_id'::text) = '123'::text)
Total runtime: 36.808 ms
test=> explain analyze select * from foo where ((xpath('//pack_form_id/text()'::text, x))[1]::text) = '123';
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on foo (cost=0.00..5071.00 rows=500 width=68) (actual time=4.271..3368.838 rows=1 loops=1)
Filter: (((xpath('//pack_form_id/text()'::text, x, '{}'::text[]))[1])::text = '123'::text)
Total runtime: 3368.865 ms
As we can see,
scanning the whole table with count(*) takes 25 ms
extracting one key/value from a hstore adds a small extra cost, about 0.12 µs/row
extracting one key/value from a xml using xpath adds a huge cost, about 33 µs/row
Conclusions :
xml is slow (but everyone knows that)
if you want to put a flexible key/value store in a column, use hstore
Also since your xml data is pretty big it will be toasted (compressed and stored out of the main table). This makes the rows in the main table much smaller, hence more rows per page, which reduces the efficiency of bitmap scans since all rows on a page have to be rechecked.
You can fix this though. For some reason the xpath() function (which is very slow, since it handles xml) has the same cost (1 unit) as say, the integer operator "+"...
update pg_proc set procost=1000 where proname='xpath';
You may need to tweak the cost value. When given the right info, the planner knows xpath is slow and will avoid a bitmap index scan, using an index scan instead, which doesn't need rechecking the condition for all rows on a page.
Note that this does not solve the row estimates problem. Since you can't ANALYZE the inside of the xml (or hstore) you get default estimates for the number of rows (here, 500). So, the planner may be completely wrong and choose a catastrophic plan if some joins are involved. The only solution to this is to use proper columns.

Can PostgreSQL index array columns?

I can't find a definite answer to this question in the documentation. If a column is an array type, will all the entered values be individually indexed?
I created a simple table with one int[] column, and put a unique index on it. I noticed that I couldn't add the same array of ints, which leads me to believe the index is a composite of the array items, not an index of each item.
INSERT INTO "Test"."Test" VALUES ('{10, 15, 20}');
INSERT INTO "Test"."Test" VALUES ('{10, 20, 30}');
SELECT * FROM "Test"."Test" WHERE 20 = ANY ("Column1");
Is the index helping this query?
Yes you can index an array, but you have to use the array operators and the GIN-index type.
Example:
CREATE TABLE "Test"("Column1" int[]);
INSERT INTO "Test" VALUES ('{10, 15, 20}');
INSERT INTO "Test" VALUES ('{10, 20, 30}');
CREATE INDEX idx_test on "Test" USING GIN ("Column1");
-- To enforce index usage because we have only 2 records for this test...
SET enable_seqscan TO off;
EXPLAIN ANALYZE
SELECT * FROM "Test" WHERE "Column1" #> ARRAY[20];
Result:
Bitmap Heap Scan on "Test" (cost=4.26..8.27 rows=1 width=32) (actual time=0.014..0.015 rows=2 loops=1)
Recheck Cond: ("Column1" #> '{20}'::integer[])
-> Bitmap Index Scan on idx_test (cost=0.00..4.26 rows=1 width=0) (actual time=0.009..0.009 rows=2 loops=1)
Index Cond: ("Column1" #> '{20}'::integer[])
Total runtime: 0.062 ms
Note
it appears that in many cases the gin__int_ops option is required
create index <index_name> on <table_name> using GIN (<column> gin__int_ops)
I have not yet seen a case where it would work with the && and #> operator without the gin__int_ops options
#Tregoreg raised a question in the comment to his offered bounty:
I didn't find the current answers working. Using GIN index on
array-typed column does not increase the performance of ANY()
operator. Is there really no solution?
#Frank's accepted answer tells you to use array operators, which is still correct for Postgres 11. The manual:
... the standard distribution of PostgreSQL includes a GIN operator
class for arrays, which supports indexed queries using these
operators:
<#
#>
=
&&
The complete list of built-in operator classes for GIN indexes in the standard distribution is here.
In Postgres indexes are bound to operators (which are implemented for certain types), not data types alone or functions or anything else. That's a heritage from the original Berkeley design of Postgres and very hard to change now. And it's generally working just fine. Here is a thread on pgsql-bugs with Tom Lane commenting on this.
Some PostGis functions (like ST_DWithin()) seem to violate this principal, but that is not so. Those functions are rewritten internally to use respective operators.
The indexed expression must be to the left of the operator. For most operators (including all of the above) the query planner can achieve this by flipping operands if you place the indexed expression to the right - given that a COMMUTATOR has been defined. The ANY construct can be used in combination with various operators and is not an operator itself. When used as constant = ANY (array_expression) only indexes supporting the = operator on array elements would qualify and we would need a commutator for = ANY(). GIN indexes are out.
Postgres is not currently smart enough to derive a GIN-indexable expression from it. For starters, constant = ANY (array_expression) is not completely equivalent to array_expression #> ARRAY[constant]. Array operators return an error if any NULL elements are involved, while the ANY construct can deal with NULL on either side. And there are different results for data type mismatches.
Related answers:
Check if value exists in Postgres array
Index for finding an element in a JSON array
SQLAlchemy: how to filter on PgArray column types?
Can IS DISTINCT FROM be combined with ANY or ALL somehow?
Asides
While working with integer arrays (int4, not int2 or int8) without NULL values (like your example implies) consider the additional module intarray, that provides specialized, faster operators and index support. See:
How to create an index for elements of an array in PostgreSQL?
Compare arrays for equality, ignoring order of elements
As for the UNIQUE constraint in your question that went unanswered: That's implemented with a btree index on the whole array value (like you suspected) and does not help with the search for elements at all. Details:
How does PostgreSQL enforce the UNIQUE constraint / what type of index does it use?
It's now possible to index the individual array elements. For example:
CREATE TABLE test (foo int[]);
INSERT INTO test VALUES ('{1,2,3}');
INSERT INTO test VALUES ('{4,5,6}');
CREATE INDEX test_index on test ((foo[1]));
SET enable_seqscan TO off;
EXPLAIN ANALYZE SELECT * from test WHERE foo[1]=1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Index Scan using test_index on test (cost=0.00..8.27 rows=1 width=32) (actual time=0.070..0.071 rows=1 loops=1)
Index Cond: (foo[1] = 1)
Total runtime: 0.112 ms
(3 rows)
This works on at least Postgres 9.2.1. Note that you need to build a separate index for each array index, in my example I only indexed the first element.