PostgreSQL hierarchical nested set huge database - postgresql

I have a database that must store thousands of scenarios (each scenario with a single unix_timestamp value). Each scenario has 1,800,000 registers organized in a Nested Set structure.
The general table structure is given by:
table_skeleton:
- unix_timestamp integer
- lft integer
- rgt integer
- value
Usually, my SELECTs are will perform taking all nested values within an specific scenario, it means for example:
SELECT * FROM table_skeleton WHERE unix_timestamp = 123 AND lft >= 10 AND rgt <= 53
So I hierarchically divided my table into master / children within groups of dates, for example:
table_skeleton_201303 inherits table_skeleton:
- unix_timestamp integer
- lft integer
- ...
and
table_skeleton_201304 inherits table_skeleton:
- unix_timestamp integer
- lft integer
- ...
And also created index for each children according to the usual search I am expecting, it is for example:
Create Index idx_201303
on table_skeleton_201303
using btree(unix_timestamp, lft, rgt)
It improved the retrieval, but it still takes about 1 minute for each select.
I imagined that this was because the index was too big to be loaded into memory always so I tried to create partial index for each timestamp, for example:
Create Index idx_201303_1362981600
on table_skeleton_201303
using btree(lft, rgt)
WHERE unix_timestamp = 1362981600
And in fact the second type of index created is much, much, much smaller than the general one. However, when I run an EXPLAIN ANALYZE for the SELECT I've previously shown here, the query solver ignores my new partial index and keeps using the giant old one.
Is there a reason for that?
Is there any new approach to optimize such type of huge nested set hierarchical database?

When you filter on a table by field_a > x and field_b > y, then an index for field_a, field_b will (actually just may, depending on the distribution and the percentage of rows with field_a > x, as per the statistics collected) only be used for "field_a > x", and field_b > y will be a sequential search.
In the case above, having two indexes, one for each field, could be used and each of the results joined, the internal equivalent of:
SELECT *
FROM table t
JOIN (
SELECT id table field_a > x) ta ON (ta.id = t.id)
JOIN (
SELECT id table field_b > y) tb ON (tb.id = t.id);
There is a change you could benefit from a GIST index, and treating your lft and rgt fields as points:
CREATE INDEX ON table USING GIST (unix_timestamp, point(lft, rgt));
SELECT * table
WHERE unix_timestamp = 123 AND
point(lft,rgt) <# box(point(10,'-inf'), point('inf',53));

Related

PostgreSQL array of data composite update element using where condition

I have a composite type:
CREATE TYPE mydata_t AS
(
user_id integer,
value character(4)
);
Also, I have a table, uses this composite type as an array of mydata_t.
CREATE TABLE tbl
(
id serial NOT NULL,
data_list mydata_t[],
PRIMARY KEY (id)
);
Here I want to update the mydata_t in data_list, where mydata_t.user_id is 100000
But I don't know which array element's user_id is equal to 100000
So I have to make a search first to find the element where its user_id is equal to 100000 ... that's my problem ... I don't know how to make the query .... in fact, I want to update the value of the array element, where it's user_id is equal to 100000 (Also where the id of tbl is for example 1) ... What will be my query?
Something like this (I know it's wrong !!!)
UPDATE "tbl" SET "data_list"[i]."value"='YYYY'
WHERE "id"=1 AND EXISTS (SELECT ROW_NUMBER() OVER() AS i
FROM unnest("data_list") "d" WHERE "d"."user_id"=10000 LIMIT 1)
For example, this is my tbl data:
Row1 => id = 1, data = ARRAY[ROW(5,'YYYY'),ROW(6,'YYYY')]
Row2 => id = 2, data = ARRAY[ROW(10,'YYYY'),ROW(11,'YYYY')]
Now i want to update tbl where id is 2 and set the value of one of the tbl.data elements to 'XXXX' where the user_id of element is equal to 11
In fact, the final result of Row2 will be this:
Row2 => id = 2, data = ARRAY[ROW(10,'YYYY'),ROW(11,'XXXX')]
If you know the value value, you can use the array_replace() function to make the change:
UPDATE tbl
SET data_list = array_replace(data_list, (11, 'YYYY')::mydata_t, (11, 'XXXX')::mydata_t)
WHERE id = 2
If you do not know the value value then the situation becomes more complex:
UPDATE tbl SET data_list = data_arr
FROM (
-- UPDATE doesn't allow aggregate functions so aggregate here
SELECT array_agg(new_data) AS data_arr
FROM (
-- For the id value, get the data_list values that are NOT modified
SELECT (user_id, value)::mydata_t AS new_data
FROM tbl, unnest(data_list)
WHERE id = 2 AND user_id != 11
UNION
-- Add the values to update
VALUES ((11, 'XXXX')::mydata_t)
) x
) y
WHERE id = 2
You should keep in mind, though, that there is an awful lot of work going on in the background that cannot be optimised. The array of mydata_t values has to be examined from start to finish and you cannot use an index on this. Furthermore, updates actually insert a new row in the underlying file on disk and if your array has more than a few entries this will involve substantial work. This gets even more problematic when your arrays are larger than the pagesize of your PostgreSQL server, typically 8kB. All behind the scene so it will work, but at a performance penalty. Even though array_replace sounds like changes are made in-place (and they indeed are in memory), the UPDATE command will write a completely new tuple to disk. So if you have 4,000 array elements that means that at least 40kB of data will have to be read (8 bytes for the mydata_t type on a typical system x 4,000 = 32kB in a TOAST file, plus the main page of the table, 8kB) and then written to disk after the update. A real performance killer.
As #klin pointed out, this design may be more trouble than it is worth. Should you make data_list as table (as I would do), the update query becomes:
UPDATE data_list SET value = 'XXXX'
WHERE id = 2 AND user_id = 11
This will have MUCH better performance, especially if you add the appropriate indexes. You could then still create a view to publish the data in an aggregated form with a custom type if your business logic so requires.

Fast new row insertion if a value of a column depends on previous value in existing row

I have a table cusers with a primary key:
primary key(uid, lid, cnt)
And I try to insert some values into the table:
insert into cusers (uid, lid, cnt, dyn, ts)
values
(A, B, C, (
select C - cnt
from cusers
where uid = A and lid = B
order by ts desc
limit 1
), now())
on conflict do nothing
Quite often (with the possibility of 98%) a row cannot be inserted to cusers because it violates the primary key constraint, so hard select queries do not need to be executed at all. But as I can see PostgreSQL first counts the select query as a result of dyn column and only then rejects row because of uid, lid, cnt violation.
What is the best way to insert rows quickly in such situation?
Another explanation
I have a system where one row depends on another. Here is an example:
(x, x, 2, 2, <timestamp>)
(x, x, 5, 3, <timestamp>)
Two columns contain an absolute value (2 and 5) and relative value (2, 5 - 2). Each time I insert new row it should:
avoid same rows (see primary key constraint)
if new row differs, it should count a difference and put it into the dyn column (so I take the last inserted row for the user according to the timestamp and subtract values).
Another solution I've found is to use returning uid, lid, ts for inserts and get user ids which were really inserted - this is how I know they have differences from existing rows. Then I update inserted values:
update cusers
set dyn = (
select max(cnt) - min(cnt)
from (
select cnt
from cusers
where uid = A and lid = B
order by ts desc
limit 2) Table
)
where uid = A and lid = B and ts = TS
But it is not a fast approach either, as it seeks all over the ts column to find the two last inserted rows for each user. I need a fast insert query as I insert millions of rows at a time (but I do not write duplicates).
What the solution can be? May be I need a new index for this? Thanks in advance.

Query table by a value in the second dimension of a two dimensional array column

WHAT I HAVE
I have a table with the following definition:
CREATE TABLE "Highlights"
(
id uuid,
chunks numeric[][]
)
WHAT I NEED TO DO
I need to query the data in the table using the following predicate:
... WHERE id = 'some uuid' and chunks[????????][1] > 10 chunks[????????][3] < 20
What should I put instead of [????????] in order to scan all items in the first dimension of the array?
Notes
I'm not entirely sure that chunks[][1] even close to something I need.
All I need is to test a row, whether its chunks column contains a two dimensional array, that has in any of its tuples some specific values.
May be there's better alternative, but this might do - you just go over first dimension of each array and testing your condition:
select *
from highlights as h
where
exists (
select
from generate_series(1, array_length(h.chunks, 1)) as tt(i)
where
-- your condition goes here
h.chunks[tt.i][1] > 10 and h.chunks[tt.i][3] < 20
)
db<>fiddle demo
update as #arie-r pointed out, it'd be better to use generate_subscripts function:
select *
from highlights as h
where
exists (
select *
from generate_subscripts(h.chunks, 1) as tt(i)
where
h.chunks[tt.i][3] = 6
)
db<>fiddle demo

Use sum function in calculated column

Is it possible to use a sum function in a calculated column?
If yes, I would like to create a calculated column, that calculates the sum of a column in the same table where the date is smaller than the date of this entry. is this possible?
And last, would this optimize repeated calls on this value over the exemplified view below?
SELECT ProductGroup, SalesDate, (
SELECT SUM(Sales)
FROM SomeList
WHERE (ProductGroup= KVU.ProductGroup) AND (SalesDate<= KVU.SalesDate)) AS cumulated
FROM SomeList AS KVU
Is it possible to use a sum function in a calculated column?
Yes, it's possible using a scalar valued function (scalar UDF) for you computed column but this would be a disaster. Using scalar UDFs for computed columns destroy performance. Adding a scalar UDF that accesses data (which would be required here) makes things even worse.
It sounds to me like you just need a good ol' fashioned index to speed things up. First some sample data:
IF OBJECT_ID('dbo.somelist','U') IS NOT NULL DROP TABLE dbo.somelist;
GO
CREATE TABLE dbo.somelist
(
ProductGroup INT NOT NULL,
[Month] TINYINT NOT NULL CHECK ([Month] <= 12),
Sales DECIMAL(10,2) NOT NULL
);
INSERT dbo.somelist
VALUES (1,1,22),(2,1,45),(2,1,25),(2,1,19),(1,2,100),(1,2,200),(2,2,50.55);
and the correct index:
CREATE NONCLUSTERED INDEX nc_somelist ON dbo.somelist(ProductGroup,[Month])
INCLUDE (Sales);
With this index in place this query would be extremely efficient:
SELECT s.ProductGroup, s.[Month], SUM(s.Sales)
FROM dbo.somelist AS s
GROUP BY s.ProductGroup, s.[Month];
If you needed to get a COUNT by month & product group you could create an indexed view like so:
CREATE VIEW dbo.vw_somelist WITH SCHEMABINDING AS
SELECT s.ProductGroup, s.[Month], TotalSales = COUNT_BIG(*)
FROM dbo.somelist AS s
GROUP BY s.ProductGroup, s.[Month];
GO
CREATE UNIQUE CLUSTERED INDEX uq_cl__vw_somelist ON dbo.vw_somelist(ProductGroup, [Month]);
Once that indexed view was in place your COUNTs would be pre-aggregated. You cannot, however, include SUM in an indexed view.

Retrieving multiple values from a large jsonb field faster (postgresql 9.4)

tl;dr
Using PSQL 9.4, is there a way to retrieve multiple values from a jsonb field, such as you would with the imaginary function:
jsonb_extract_path(x, ARRAY['a_dictionary_key', 'a_second_dictionary_key', 'a_third_dictionary_key'])
With the hope of speeding up the otherwise almost linear time required to select multiple values (1 value = 300ms, 2 values = 450ms, 3 values = 600ms)
Background
I have the following jsonb table:
CREATE TABLE "public"."analysis" (
"date" date NOT NULL,
"name" character varying (10) NOT NULL,
"country" character (3) NOT NULL,
"x" jsonb,
PRIMARY KEY(date,name)
);
With roughly 100 000 rows where each rows has a jsonb dictionary with 90+ keys and corresponding values. I'm trying to write an SQL query to select a few (< 10) key+values in a fairly quick way (< 500 ms)
Index and querying: 190ms
I started by adding an index:
CREATE INDEX ON analysis USING GIN (x);
This makes querying based on values in the "x" dictionary fast, such as this:
SELECT date, name, country FROM analysis where date > '2014-01-01' and date < '2014-05-01' and cast(x#>> '{a_dictionary_key}' as float) > 100;
This takes ~190 ms (acceptable for us)
Retrieving dictionary values
However, once I start adding keys to return in the SELECT part, execution time rises almost linear:
1 value: 300ms
select jsonb_extract_path(x, 'a_dictionary_key') from analysis where date > '2014-01-01' and date < '2014-05-01' and cast(x#>> '{a_dictionary_key}' as float) > 100;
Takes 366ms (+175ms)
select x#>'{a_dictionary_key}' as gear_down_altitude from analysis where date > '2014-01-01' and date < '2014-05-01' and cast(x#>> '{a_dictionary_key}' as float) > 100 ;
Takes 300ms (+110ms)
3 values: 600ms
select jsonb_extract_path(x, 'a_dictionary_key'), jsonb_extract_path(x, 'a_second_dictionary_key'), jsonb_extract_path(x, 'a_third_dictionary_key') from analysis where date > '2014-01-01' and date < '2014-05-01' and cast(x#>> '{a_dictionary_key}' as float) > 100;
Takes 600ms (+410, or +100 for each value selected)
select x#>'{a_dictionary_key}' as a_dictionary_key, x#>'{a_second_dictionary_key}' as a_second_dictionary_key, x#>'{a_third_dictionary_key}' as a_third_dictionary_key from analysis where date > '2014-01-01' and date < '2014-05-01' and cast(x#>> '{a_dictionary_key}' as float) > 100 ;
Takes 600ms (+410, or +100 for each value selected)
Retrieving more values faster
Is there a way to retrieve multiple values from a jsonb field, such as you would with the imaginary function:
jsonb_extract_path(x, ARRAY['a_dictionary_key', 'a_second_dictionary_key', 'a_third_dictionary_key'])
Which could possibly speed up these lookups. It can return them either as columns or as an list/array or even a json object.
Retrieving an array using PL/Python
Just for the heck of it I made a custom function using PL/Python, but that was much slower (5s+), possibly due to json.loads:
CREATE OR REPLACE FUNCTION retrieve_objects(data jsonb, k VARCHAR[])
RETURNS TEXT[] AS $$
if not data:
return []
import simplejson as json
j = json.loads(data)
l = []
for i in k:
l.append(j[i])
return l
$$ LANGUAGE plpython2u;
# Usage:
# select retrieve_objects(x, ARRAY['a_dictionary_key', 'a_second_dictionary_key', 'a_third_dictionary_key']) from analysis where date > '2014-01-01' and date < '2014-05-01'
Update 2015-05-21
I re-implemented the table using hstore with GIN index and the performance is almost identical to using jsonb, i.e not helpfull in my case.
You're using the #> operator, which looks like it performs a path search. Have you tried a normal -> lookup? Like:
select json_column->'json_field1'
, json_column->'json_field2'
It would be interesting to see what happened if you used a temporary table. Like:
create temporary table tmp_doclist (doc jsonb)
;
insert tmp_doclist
(doc)
select x
from analysis
where ... your conditions here ...
;
select doc->'col1'
, doc->'col2'
, doc->'col3'
from tmp_doclist
;
This is hard to test without the data.
Create a custom type
create type my_query_result_type (
a_dictionary_key float,
a_second_dictionary_key float
)
And your query
select (json_populate_record(null::my_query_result_type,j::json)).* from analysis;
You should be able to use a temporary table instead of type which will be created at, runtime making your query dynamic.
But first check it out if this helps form the performance point of view.