PostgreSQL concat an int value with a numeric field - postgresql

I would like to do something like :
update cli_pe set nm_cli_espace_client = 9 || nm_cli_espace_client
where nm_cli_pe = 7006488
nm_cli_espace_client is numeric(0,8).
How to make it easy?

UPDATE cli_pe
SET nm_cli_espace_client = '9' || CAST(nm_cli_espace_client AS text)
WHERE nm_cli_pe = 7006488;
If you need a resulting number, then CAST back the result:
... SET nm_cli_espace_client =
CAST('9' || CAST(nm_cli_espace_client AS text) AS integer)
Your question is identical to this one: concat two int values in postgresql

Prefixing a digit onto a number can be done by textual concatenation, as #vyegorov has demonstrated, but it's also easily done as a mathematical operation.
Edit: Easily, but really not quickly in PostgreSQL. Even though this approach is much faster when programming in languages like C and Java for integer types, it seems to be quite a bit slower in SQL under PostgreSQL when working with an int, and immensely slower for NUMERIC. Probably function call overhead for ints, and the general slowness of numeric maths when working with numeric. In any case, you're better off concatenating as strings per #vyegorov's answer.
I'll preserve the rest of the answer for a laugh; it's a really good example of why you should test multiple approaches to a problem.
A mathematical approach is:
nm_cli_espace_client = 9*pow(10,1+floor(log(nm_cli_pe))) + nm_cli_espace_client;
There's probably a smarter way than that. What it does is basically produces the power of 10 with one more digit than the number being added to and multiplies the input by it.
regress=# SELECT (pow(10,1+floor(log( 7006488 ))) * 9) + 7006488;
?column?
---------------------------
97006488.0000000000000000
(1 row)
I'd love to have the smarter way to do this mathematically pointed out to me; I'm sure there is one.
Anyhow, it turns out you don't want to do this in PostgreSQL. With integer types it's a bit slower:
regress=# explain analyze select ('9'||x::text)::int from testtab;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Seq Scan on testtab (cost=0.00..26925.00 rows=1000000 width=4) (actual time=0.026..293.120 rows=1000000 loops=1)
Total runtime: 319.920 ms
(2 rows)
vs
regress=# explain analyze select 9*pow(10,1+floor(log(x)))+x from testtab;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Seq Scan on testtab (cost=0.00..34425.00 rows=1000000 width=4) (actual time=0.053..443.134 rows=1000000 loops=1)
Total runtime: 470.587 ms
(2 rows)
With NUMERIC types the text approach doesn't change much:
regress=# explain analyze select ('9'||x::text)::int from testtab;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Seq Scan on testtab (cost=0.00..17467.69 rows=579675 width=32) (actual time=0.039..368.980 rows=1000000 loops=1)
Total runtime: 396.376 ms
(2 rows)
but the maths based approach is horrifyingly slow even if I convert the NUMERIC to int after taking the logarithm. Clearly taking the logarithm of a NUMERIC is slow - not surprising, but I didn't expect it to be this slow:
regress=# explain analyze select 9*pow(10,1+floor(log(x)::int))+x from testtab;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Seq Scan on testtab (cost=0.00..18916.88 rows=579675 width=32) (actual time=0.253..86740.383 rows=1000000 loops=1)
Total runtime: 86778.511 ms
(2 rows)
The short version: Do it by text concatenation. In (PostgreSQL) SQL the maths-based approach is more than 200 times slower for NUMERIC, and about 50% slower on integers. It's a handy trick for languages where integer maths is cheap and string manipulation / memory allocation is expensive, though.

Related

Why doesn't postresql use all columns in a multi-column index?

I am using the extension
CREATE EXTENSION btree_gin;
I have an index that looks like this...
create index boundaries2 on rets USING GIN(source, isonlastsync, status, (geoinfo::jsonb->'boundaries'), ctcvalidto, searchablePrice, ctcSortOrder);
before I started messing with it, the index looked like this, with the same results that I'm about to share, so it seems minor variations in the index definition don't make a difference:
create index boundaries on rets USING GIN((geoinfo::jsonb->'boundaries'), source, status, isonlastsync, ctcvalidto, searchablePrice, ctcSortOrder);
I give pgsql 11 this query:
explain analyze select id from rets where ((geoinfo::jsonb->'boundaries' ?| array['High School: Torrey Pines']) AND source='SDMLS'
AND searchablePrice>=800000 AND searchablePrice<=1200000 AND YrBlt>=2000 AND EstSF>=2300
AND Beds>=3 AND FB>=2 AND ctcSortOrder>'2019-07-05 16:02:54 UTC' AND Status IN ('ACTIVE','BACK ON MARKET')
AND ctcvalidto='9999-12-31 23:59:59 UTC' AND isonlastsync='true') order by LstDate desc, ctcSortOrder desc LIMIT 3000;
with result...
Limit (cost=120.06..120.06 rows=1 width=23) (actual time=472.849..472.850 rows=1 loops=1)
-> Sort (cost=120.06..120.06 rows=1 width=23) (actual time=472.847..472.848 rows=1 loops=1)
Sort Key: lstdate DESC, ctcsortorder DESC
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on rets (cost=116.00..120.05 rows=1 width=23) (actual time=472.748..472.841 rows=1 loops=1)
Recheck Cond: ((source = 'SDMLS'::text) AND (((geoinfo)::jsonb -> 'boundaries'::text) ?| '{"High School: Torrey Pines"}'::text[]) AND (ctcvalidto = '9999-12-31 23:59:59+00'::timestamp with time zone) AND (searchableprice >= 800000) AND (searchableprice <= 1200000) AND (ctcsortorder > '2019-07-05 16:02:54+00'::timestamp with time zone))
Rows Removed by Index Recheck: 93
Filter: (isonlastsync AND (yrblt >= 2000) AND (estsf >= 2300) AND (beds >= 3) AND (fb >= 2) AND (status = ANY ('{ACTIVE,"BACK ON MARKET"}'::text[])))
Rows Removed by Filter: 10
Heap Blocks: exact=102
-> Bitmap Index Scan on boundaries2 (cost=0.00..116.00 rows=1 width=0) (actual time=471.762..471.762 rows=104 loops=1)
Index Cond: ((source = 'SDMLS'::text) AND (((geoinfo)::jsonb -> 'boundaries'::text) ?| '{"High School: Torrey Pines"}'::text[]) AND (ctcvalidto = '9999-12-31 23:59:59+00'::timestamp with time zone) AND (searchableprice >= 800000) AND (searchableprice <= 1200000) AND (ctcsortorder > '2019-07-05 16:02:54+00'::timestamp with time zone))
Planning Time: 0.333 ms
Execution Time: 474.311 ms
(14 rows)
The Question
Why are the columns status and isonlastsync not used by the Bitmap Index Scan on boundaries2?
It can do so if it predicts that filtering out those columns will be faster. This is usually the case if cardinality on columns is very low and you will fetch large enough portion of all rows; this is true for boolean like isonlastsync and usually true for status columns with just a few distinct values.
Rows Removed by Filter: 10 this is very little to filter out, either because your table does not hold large number of rows or most of them fit into condition you specified for those two columns. You might try generating more data in that table or selecting rows with rare status.
I suggest doing partial indexes (with WHERE condition), at least for boolean value and remove those two columns to make this index a bit more lightweight.
I cannot tell you why, but I can help you optimize the query.
You should not use a multi-column GIN index, but a GIN index on only the jsonb expression and a b-tree index on the other columns.
The order of columns matters: put the oned used in an equality condition first, with the most selective in the beginning. Next put the column with the must selective inequality or IN conditions. For the following columns, the order doesn't matter, as they will only act as filters in the index scan.
Make sure that the indexes are cached in RAM.
I'd expect you to be faster that way.
I think you're asking yourself the wrong question. As Lukasz answered already, PostgreSQL may find inefficient to check all columns in the index. The problem here is that your index is too big on disk.
Probably by trying to make this SQL faster you added as many columns as possible to the index, but it is backfiring.
The trick is to realize how much data PostgreSQL has to read to find your records. If your index contains too much data, it will have to read a lot. Also, be aware that low cardinality columns don't play well with BTree and common index types; generally you want to avoid indexing them.
To have an index as small as possible and it's quick to do lookups you have to find the column with more cardinality, or better, the column that will return less rows for your query. My guess is "ctcSortOrder". This will be the first column of your index.
Now look, after filtering by the 1st column, which column has now the most cardinality or will filter out most rows. I have no idea on your data, but "source" looks like a good candidate.
Try to avoid jsonb searches unless they are the primary source of the cardinality, and keep the index as a Btree. BTree is several times faster.
And like Lukasz suggested, look on partial indexes. For example add "WHERE Status IN ('ACTIVE','BACK ON MARKET') AND isonlastsync='true'" as these two may be common for all your searches.
Bottom line is, having a simpler, smaller index is faster than having all columns indexed. And the order of the columns does matter a lot. Stick with BTree unless there is a good reason (lots of cardinality in non-btree compatible types).
If your table is huge (>10M rows) consider table partitioning, for example by ctcSortOrder. But I don't think this is your case.

How to access internal representation of JSONb?

In big-data queries the intermediary "CAST to text" is a performance bottleneck... The good binary information is there, at the JSONb datatype: how to rescue it?
Typical "select where" example:
with t(x,j) as (select 'hello','{"flag1":true,"flag2":false}'::jsonb)
SELECT x FROM t
WHERE (j->>'flag1')::boolean AND NOT((j->>'flag2')::boolean)
The the "casting to text" is a big loss of performance. Ideal is a mechanism to do direct, from JSONb to Boolean, something as
WHERE (j->'flag1')::magic_boolean AND NOT((j->'flag2')::magic_boolean)
PS: it is possible using C++? Is possible a CREATE CAST C++ implementation to solve this problem?
The feature is implemented in Postgres 11:
E.4.3.4. Data Types
[...]
Add casts from JSONB scalars to numeric and boolean data types (Anastasia Lubennikova)
Db<>Fiddle.
TL;DR
Performance-wise it's best to use #> with an appropriate index covering all JSON attributes including type conversions (to avoid type conversions when accessing the index): https://dbfiddle.uk/?rdbms=postgres_11&fiddle=4da77576874651f4d2cf801142ae34d2
CREATE INDEX idx_flags_btree_jsonb ON t ((j#>'{flag1}'), (j#>'{flag2}'));
Times (all selecting the same 5,195 rows out of 1,000,000):
->>::boolean | ~75 ms
->::boolean | ~55 ms
#> | ~80 ms
#> | ~40 ms
Scalability:
Interestingly, a local test with 40M rows (all cached in memory, no I/O latency here) revealed the following (best) numbers out of 10 runs (excluding the first and last run) for each query:
->>::boolean | 222.333 ms
->::boolean | 268.002 ms
#> | 1644.605 ms
#> | 207.230 ms
So, in fact, the new cast seems to slow things down on larger data sets (which I suspect is due to the fact that it still converts to text before converting to boolean but within a wrapper, not directly).
We also can see that the #> operator using the GIN index doesn't scale very well here, which is expected, as it is much more generic than the other special-purpose indexes and hence, needs to do a lot more under-the-hood.
However, in case these special purpose btree indexes cannot be put in place or I/O becomes a bottleneck, then the GIN index will be superior as it consumes only a fraction of the space on disk (and also in memory), increasing the chance of an index buffer hit.
But that depends on a lot of factors and needs to be decided with all accessing applications understood.
Details:
Preferably use the #> containment operator with a single GIN index as it saves a lot of special-purpose indexes:
with t(x,j) as (select 'hello','{"flag1":true,"flag2":false}'::jsonb)
SELECT x FROM t
WHERE j #> '{"flag1":true, "flag2":false}'::jsonb;
...which gives a plan like:
QUERY PLAN
-----------------------------------------------------------
CTE Scan on t (cost=0.01..0.03 rows=1 width=32)
Filter: (j #> '{"flag1": true, "flag2": false}'::jsonb)
CTE t
-> Result (cost=0.00..0.01 rows=1 width=64)
(4 rows)
As an alternative (if you can afford creating special-purpose indexes and the resulting write penalty) use the #> operator instead of -> or ->> and by that skip any performance-costly type conversions, e.g.
with t(x,j) as (select 'hello','{"flag1":true,"flag2":false}'::jsonb)
SELECT x FROM t
WHERE j#>'{flag1}' = 'true'::jsonb AND j#>'{flag2}' = 'false'::jsonb;
...resulting in a plan like:
QUERY PLAN
--------------------------------------------------------------------------------------------------------
CTE Scan on t (cost=0.01..0.04 rows=1 width=32)
Filter: (((j #> '{flag1}'::text[]) = 'true'::jsonb) AND ((j #> '{flag2}'::text[]) = 'false'::jsonb))
CTE t
-> Result (cost=0.00..0.01 rows=1 width=64)
(4 rows)
So, no more implicit type conversion here (only for the given constants, but that's a one-time operation, not for every row).

How PostgreSQL execute query?

Can anyone explain why PostgreSQL works so:
If I execute this query
SELECT
*
FROM project_archive_doc as PAD, project_archive_doc as PAD2
WHERE
PAD.id = PAD2.id
it will be simple JOIN and EXPLAIN will looks like this:
Hash Join (cost=6.85..13.91 rows=171 width=150)
Hash Cond: (pad.id = pad2.id)
-> Seq Scan on project_archive_doc pad (cost=0.00..4.71 rows=171 width=75)
-> Hash (cost=4.71..4.71 rows=171 width=75)
-> Seq Scan on project_archive_doc pad2 (cost=0.00..4.71 rows=171 width=75)
But if I will execute this query:
SELECT *
FROM project_archive_doc as PAD
WHERE
PAD.id = (
SELECT PAD2.id
FROM project_archive_doc as PAD2
WHERE
PAD2.project_id = PAD.project_id
ORDER BY PAD2.created_at
LIMIT 1)
there will be no joins and EXPLAIN looks like:
Seq Scan on project_archive_doc pad (cost=0.00..886.22 rows=1 width=75)"
Filter: (id = (SubPlan 1))
SubPlan 1
-> Limit (cost=5.15..5.15 rows=1 width=8)
-> Sort (cost=5.15..5.15 rows=1 width=8)
Sort Key: pad2.created_at
-> Seq Scan on project_archive_doc pad2 (cost=0.00..5.14 rows=1 width=8)
Filter: (project_id = pad.project_id)
Why it is so and is there any documentation or articles about this?
Without table definitions and data it's hard to be specific for this case. In general, PostgreSQL is like most SQL databases in that it doesn't treat SQL as a step-by-step program for how to execute a query. It's more like a description of what you want the results to be and a hint about how you want the database to produce those results.
PostgreSQL is free to actually execute the query however it can most efficiently do so, so long as it produces the results you want.
Often it has several choices about how to produce a particular result. It will choose between them based on cost estimates.
It can also "understand" that several different ways of writing a particular query are equivalent, and transform one into another where it's more efficient. For example, it can transform an IN (SELECT ...) into a join, because it can prove they're equivalent.
However, sometimes apparently small changes to a query fundamentally change its meaning, and limit what optimisations/transformations PostgreSQL can make. Adding a LIMIT or OFFSET inside a subquery prevents PostgreSQL from flattening it, i.e. combining it with the outer query by tranforming it into a join. It also prevents PostgreSQL from moving WHERE clause entries between the subquery and outer query, because that'd change the meaning of the query. Without a LIMIT or OFFSET clause, it can do both these things because they don't change the query's meaning.
There's some info on the planner here.

speeding up wildcard text lookups

I have a simple table in Postgres with a bit over 8 million rows. The column of interest holds short text strings, typically one or more words total length less than 100 characters. It is set as 'character varying (100)'. The column is indexed. A simple look up like below takes > 3000 ms.
SELECT a, b, c FROM t WHERE a LIKE '?%'
Yes, for now, the need is to simply find the rows where "a" starts with the entered text. I want to bring the speed of look up down to under 100 ms (the appearance of instantaneous). Suggestions? Seems to me that full text search won't help here as my column of text is too short, but I would be happy to try that if worthwhile.
Oh, btw I also loaded the exact same data in mongodb and indexed column "a". Loading the data in mongodb was amazingly quick (mongodb++). Both mongodb and Postgres are pretty much instantaneous when doing exact lookups. But, Postgres actually shines when doing trailing wildcard searches as above, consistently taking about 1/3 as long as mongodb. I would be happy to pursue mongodb if I could speed that up as this is only a readonly operation.
Update: First, a couple of EXPLAIN ANALYZE outputs
EXPLAIN ANALYZE SELECT a, b, c FROM t WHERE a LIKE 'abcd%'
"Seq Scan on t (cost=0.00..282075.55 rows=802 width=40)
(actual time=1220.132..1220.132 rows=0 loops=1)"
" Filter: ((a)::text ~~ 'abcd%'::text)"
"Total runtime: 1220.153 ms"
I actually want to compare Lower(a) with the search term which is always at least 4 characters long, so
EXPLAIN ANALYZE SELECT a, b, c FROM t WHERE Lower(a) LIKE 'abcd%'
"Seq Scan on t (cost=0.00..302680.04 rows=40612 width=40)
(actual time=4.681..3321.387 rows=788 loops=1)"
" Filter: (lower((a)::text) ~~ 'abcd%'::text)"
"Total runtime: 3321.504 ms"
So I created an index
CREATE INDEX idx_t ON t USING btree (Lower(Substring(a, 1, 4) ));
"Seq Scan on t (cost=0.00..302680.04 rows=40612 width=40)
(actual time=3243.841..3243.841 rows=0 loops=1)"
" Filter: (lower((a)::text) = 'abcd%'::text)"
"Total runtime: 3243.860 ms"
Seems the only time an index is being used is when I am looking for an exact match
EXPLAIN ANALYZE SELECT a, b, c FROM t WHERE a = 'abcd'
"Index Scan using idx_t on geonames (cost=0.00..57.89 rows=13 width=40)
(actual time=40.831..40.923 rows=17 loops=1)"
" Index Cond: ((ascii_name)::text = 'Abcd'::text)"
"Total runtime: 40.940 ms"
Found a solution by implementing an index with varchar_pattern_ops, and am now looking for an even quicker lookups.
The PostgreSQL query planner is smart, but not an AI. To make it use an index on an expression use the exact same form of expression in the query.
With an index like this:
CREATE INDEX t_a_lower_idx ON t (lower(substring(a, 1, 4)));
Or simpler in PostgreSQL 9.1:
CREATE INDEX t_a_lower_idx ON t (lower(left(a, 4)));
Use this query:
SELECT * FROM t WHERE lower(left(a, 4)) = 'abcd';
Which is 100% functionally equivalent to:
SELECT * FROM t WHERE lower(a) LIKE 'abcd%'
Or:
SELECT * FROM t WHERE a ILIKE 'abcd%'
But not:
SELECT * FROM t WHERE a LIKE 'abcd%'
This is a functionally different query and you need a different index:
CREATE INDEX t_a_idx ON t (substring(a, 1, 4));
Or simpler with PostgreSQL 9.1:
CREATE INDEX t_a_idx ON t (left(a, 4));
And use this query:
SELECT * FROM t WHERE left(a, 4) = 'abcd';
Left anchored search terms of variable length
Case insensitive. Index:
Edit: Almost forgot: If you run your db with any other locale than the default 'C', you need to specify the operator class explicitly - text_pattern_ops in my example:
CREATE INDEX t_a_lower_idx
ON t (lower(left(a, <insert_max_length>)) text_pattern_ops);
Query:
SELECT * FROM t WHERE lower(left(a, <insert_max_length>)) ~~ 'abcdef%';
Can utilize the index and is almost as fast as the variant with a fixed length.
You may be interested in this post on dba.SE with more details about pattern matching, especially the last part about the operators ~>=~ and ~<~.
It is clearly documented that a regular expression search does not use any indexes for a variety of implementation. The only possible way for using indexes with regular expressions is limited to a prefix search like a*.

Postgresql index on xpath expression gives no speed up

We are trying to create OEBS-analog functionality in Postgresql. Let's say we have a form constructor and need to store form results in database (e.g. email bodies). In Oracle you could use a table with 150~ columns (and some mapping stored elsewhere) to store each field in separate column. But in contrast to Oracle we would like to store all the form in postgresql xml field.
The example of the tree is
<row xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<object_id>2</object_id>
<pack_form_id>23</pack_form_id>
<prod_form_id>34</prod_form_id>
</row>
We would like to search through this field.
Test table contains 400k rows and the following select executes in 90 seconds:
select *
from params
where (xpath('//prod_form_id/text()'::text, xmlvalue))[1]::text::int=34
So I created this index:
create index prod_form_idx
ON params using btree(
((xpath('//prod_form_id/text()'::text, xmlvalue))[1]::text::int)
);
And it made no difference. Still 90 seconds execution. EXPLAIN plan show this:
Bitmap Heap Scan on params (cost=40.29..6366.44 rows=2063 width=292)
Recheck Cond: ((((xpath('//prod_form_id/text()'::text, xmlvalue, '{}'::text[]))[1])::text)::integer = 34)
-> Bitmap Index Scan on prod_form_idx (cost=0.00..39.78 rows=2063 width=0)
Index Cond: ((((xpath('//prod_form_id/text()'::text, xmlvalue, '{}'::text[]))[1])::text)::integer = 34)
I am not the great plan interpreter so I suppose this means that index is being used. The question is: where's all the speed? And what can i do in order to optimize this kind of queries?
Well, at least the index is used. You get a bitmap index scan instead of a normal index scan though, which means the xpath() function will be called lots of times.
Let's do a little check :
CREATE TABLE foo ( id serial primary key, x xml, h hstore );
insert into foo (x,h) select XMLPARSE( CONTENT '<row xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<object_id>2</object_id>
<pack_form_id>' || n || '</pack_form_id>
<prod_form_id>34</prod_form_id>
</row>' ),
('object_id=>2,prod_form_id=>34,pack_form_id=>'||n)::hstore
FROM generate_series( 1,100000 ) n;
test=> EXPLAIN ANALYZE SELECT count(*) FROM foo;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Aggregate (cost=4821.00..4821.01 rows=1 width=0) (actual time=24.694..24.694 rows=1 loops=1)
-> Seq Scan on foo (cost=0.00..4571.00 rows=100000 width=0) (actual time=0.006..13.996 rows=100000 loops=1)
Total runtime: 24.730 ms
test=> explain analyze select * from foo where (h->'pack_form_id')='123';
QUERY PLAN
----------------------------------------------------------------------------------------------------
Seq Scan on foo (cost=0.00..5571.00 rows=500 width=68) (actual time=0.075..48.763 rows=1 loops=1)
Filter: ((h -> 'pack_form_id'::text) = '123'::text)
Total runtime: 36.808 ms
test=> explain analyze select * from foo where ((xpath('//pack_form_id/text()'::text, x))[1]::text) = '123';
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on foo (cost=0.00..5071.00 rows=500 width=68) (actual time=4.271..3368.838 rows=1 loops=1)
Filter: (((xpath('//pack_form_id/text()'::text, x, '{}'::text[]))[1])::text = '123'::text)
Total runtime: 3368.865 ms
As we can see,
scanning the whole table with count(*) takes 25 ms
extracting one key/value from a hstore adds a small extra cost, about 0.12 µs/row
extracting one key/value from a xml using xpath adds a huge cost, about 33 µs/row
Conclusions :
xml is slow (but everyone knows that)
if you want to put a flexible key/value store in a column, use hstore
Also since your xml data is pretty big it will be toasted (compressed and stored out of the main table). This makes the rows in the main table much smaller, hence more rows per page, which reduces the efficiency of bitmap scans since all rows on a page have to be rechecked.
You can fix this though. For some reason the xpath() function (which is very slow, since it handles xml) has the same cost (1 unit) as say, the integer operator "+"...
update pg_proc set procost=1000 where proname='xpath';
You may need to tweak the cost value. When given the right info, the planner knows xpath is slow and will avoid a bitmap index scan, using an index scan instead, which doesn't need rechecking the condition for all rows on a page.
Note that this does not solve the row estimates problem. Since you can't ANALYZE the inside of the xml (or hstore) you get default estimates for the number of rows (here, 500). So, the planner may be completely wrong and choose a catastrophic plan if some joins are involved. The only solution to this is to use proper columns.