Can (aggregate) functions be used to define a column? - postgresql

Assume a table like this one:
a | b | total
--|---|------
1 | 2 | 3
4 | 7 | 11
…
CREATE TEMPORARY TABLE summedup (
a double precision DEFAULT 0
, b double precision DEFAULT 0
--, total double precision
);
INSERT INTO summedup (a, b) VALUES (1, 2);
INSERT INTO summedup (a, b) VALUES (4, 7);
SELECT a, b, a + b as total FROM summedup;
It's easy to sum up the first two columns on SELECT.
But does Postgres (9.6) also support the ability to define total as the sum of the other two columns? If so:
What is the syntax?
What is this type of operation called (aggregates typically sum up cells over multiple rows, not columns.)

What you are looking for is typically called a "computed column".
Postgres 9.6 does not support that (Postgres 12 - to be released in Q4 2019 - will).
But for such a simple sum, I wouldn't bother storing redundant information.
If you don't want to repeat the expression, create a view.

I think what you want is a View.
CREATE VIEW table_with_sum AS
SELECT id, a, b, a + b as total FROM summedup;
then you can query the view for the sum.
SELECT total FROM table_with_sum where id=5;
The View does not store the sum for each row, the totalcolumn is computed every time you query the View. If your goal is to make your query more efficient, this will not help.
There is an other way: add the column to the table and create triggers for update and insert that update the total column every time a row is modified.

Related

PostGIS: Query z and m dimensions (linestringzm)

Question
I have a system with multiple linestringzm, where the data is structured the following way: [x, y, speed:int, time:int]. The data is structured this way to be able to use ST_SimplifyVW on the x, y and z dimensions, but I still want to be able to query the linestring based on the m dimension e.g. get all linestrings between a time interval.
Is this possible with PostGIS or am I structuring the data incorrectly for my use case?
Example
z = speed e.g. km/h
m = Unix epoch time
CREATE TABLE t (id int NOT NULL, geom geometry(LineStringZM,4326), CONSTRAINT t_pkey PRIMARY KEY (id));
INSERT INTO t VALUES (1, 'SRID=4326;LINESTRING ZM(30 10 5 1620980688, 30 15 10 1618388688, 30 20 15 1615710288, 30 25 20 1620980688)'::geometry);
INSERT INTO t VALUES (2, 'SRID=4326;LINESTRING ZM(50 10 5 1620980688, 50 15 10 1618388688, 50 20 15 1615710288, 50 25 20 1620980688)'::geometry);
INSERT INTO t VALUES (3, 'SRID=4326;LINESTRING ZM(20 10 5 1620980688, 20 15 10 1618388688, 20 20 15 1615710288, 20 25 20 1620980688)'::geometry);
Use case A: Simplify the geometry based on x, y and z
This can be accomplished by e.g. ST_SimplifyVW which keep the m dimension after simplification.
Use case B: Query geometry based on the m dimension
I have a set of linestringzm which I want to query based on my time dimension (m). The result is either the full geometry if all m is between e.g.1618388000 and 1618388700 or the part of the geometry which satisfies the predicate. What is the most efficient way to query the data?
If you want to check every single point of your LineString you could ST_DumpPoints them and get the M dimension with ST_M. After that extract the subset as a LineString containing the overlapping M values and apply ST_MakeLine with a GROUP BY:
WITH j AS (
SELECT id,geom,(ST_DumpPoints(geom)).geom AS p
FROM t
)
SELECT id,ST_AsText(ST_MakeLine(p))
FROM j
WHERE ST_M(p) BETWEEN 1618388000 AND 1618388700
GROUP BY id;
Demo: db<>fiddle
Note: Depending on your table and LineString sizes this query may become pretty slow, as values are being parsed in query time and therefore aren't indexed. Imho a more elegant alternative would be ..
.. 1) to create a tstzrange column
ALTER TABLE t ADD COLUMN line_interval tstzrange;
.. 2) to properly index it
CREATE INDEX idx_t_line_interval ON t USING gist (line_interval);
.. and 3) to populate it with the time of geom's first and last points:
UPDATE t SET line_interval =
tstzrange(
to_timestamp(ST_M(ST_PointN(geom,1))),
to_timestamp(ST_M(ST_PointN(geom,ST_NPoints(geom)))));
After that you can speed things up by checking wether the indexed column overlaps with a given interval. This will significantly improve query time:
SELECT * FROM t
WHERE line_interval && tstzrange(
to_timestamp(1618138148),
to_timestamp(1618388700));
Demo: db<>fiddle
Further reading:
ST_M
ST_PointN
ST_NPoints
PostgreSQL Built-in Range Types

Can PostgreSQL LAG() function refer to itself?

I've just discovered LAG() function in PostgreSQL and I've been experimenting to see what it can achieve. I've though that I might calculate factorial with it and I wrote
SELECT i, i * lag(factorial, 1, 1) OVER (ORDER BY i, 1) as factorial FROM generate_series(1, 10) as i;
But online IDE complains that 42703 column "factorial" does not exist.
Is there any way I can access the result of previous LAG call?
You can't refer to the column recursively in its definition.
However, you can express the factorial calculation as:
SELECT i, EXP(SUM(LN(i)) OVER w)::int factorial
FROM generate_series(1, 10) i
WINDOW w AS (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW);
-- outputs:
i | factorial
----+-----------
1 | 1
2 | 2
3 | 6
4 | 24
5 | 120
6 | 720
7 | 5040
8 | 40320
9 | 362880
10 | 3628800
(10 rows)
Postgresql does support an advanced SQL feature called recursive query, which can also be used to express the factorial table recursively:
WITH RECURSIVE series AS (
SELECT i FROM generate_series(1, 10) i
)
, rec AS (
SELECT i, 1 factorial FROM series WHERE i = 1
UNION ALL
SELECT series.i, series.i * rec.factorial
FROM series
JOIN rec ON series.i = rec.i + 1
)
SELECT *
FROM rec;
what EXP(SUM(LN(i)) OVER w) does:
This exploits the mathematical identities that:
[1]: log(a * b * c) = log (a) + log (b) + log (c)
[2]: exp (log a) = a
[combining 1&2]: exp(log a + log b + log c) = a * b * c
SQL does not have an aggregate multiply operation, so to perform an aggregate multiply operation, we first have to take the log of each value, then we can use the sum aggregate function to give us the the log of the values' product. This we invert with the final exponentiation step.
This works as long as the values being multiplied are positive as log is undefined for 0 and negative numbers. If you have negative numbers, or zero, the trick is to check if any value is 0, then the whole aggregation is 0, and check if the number of negative values is even, then the result is positive, else it is negative. Alternatively, you could also convert the reals to the complex plane and then use the identity Log(z) = ln(r) - iπ
what ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW does
This declares an expanding window frame that includes all preceding rows, and the current row.
e.g.
when i equals 1 the values in this window frame are {1}
when i equals 2 the values in this window frame are {1,2}
when i equals 3 the values in this window frame are {1,2,3}
what is a recursive query
A recursive query lets you express recursive logic using SQL. Recursive queries are often used to generate parent-child relationships from relational data (think manager-report, or product classification hierarchy), but they can generally be used to query any tree like structure.
Here is a SO answer I wrote a while back that illustrates and explains some of the capabilities of recursive queries.
There are also a tonne of useful tutorials on recursive queries. It is a very powerful sql-language feature and solves a type of problem that are very difficult do do without recursion.
Hope this gives you more insight into what the code does. Happy learning!

Best way to maintain an ordered list in PostgreSQL?

Say I have a table called list, where there are items like these (the ids are random uuids):
id rank text
--- ----- -----
x 0 Hello
x 1 World
x 2 Foo
x 3 Bar
x 4 Baz
I want to maintain the property that rank column always goes from 0 to n-1 (n being the number of rows)---if a client asks to insert an item with rank = 3, then the pg server should push the current 3 and 4 to 4 and 5, respectively:
id rank text
--- ----- -----
x 0 Hello
x 1 World
x 2 Foo
x 3 New Item!
x 4 Bar
x 5 Baz
My current strategy is to have a dedicated insertion function add_item(item) that scans through the table, filter out items with rank equal or greater than that of the item being inserted, and increment those ranks by one. However, I think this approach will run into all sorts of problems---like race conditions.
Is there a more standard practice or more robust approach?
Note: The rank column is completely independent of rest of the columns, and insertion is not the only operation I need to support. Think of it as the back-end of a sortable to-do list, and the user can add/delete/reorder the items on the fly.
Doing verbatim what you suggest might be difficult or not possible at all, but I can suggest a workaround. Maintain a new column ts which stores the time a record is inserted. Then, insert the current time along with rest of the record, i.e.
id rank text ts
--- ----- ----- --------------------
x 0 Hello 2017-12-01 12:34:23
x 1 World 2017-12-03 04:20:01
x 2 Foo ...
x 3 New Item! 2017-12-12 11:26:32
x 3 Bar 2017-12-10 14:05:43
x 4 Baz ...
Now we can easily generate the ordering you want via a query:
SELECT id, rank, text,
ROW_NUMBER() OVER (ORDER BY rank, ts DESC) new_rank
FROM yourTable;
This would generate 0 to 5 ranks in the above sample table. The basic idea is to just use the already existing rank column, but to let the timestamp break the tie in ordering should the same rank appear more than once.
you can wrap it up to function if you think its worth of:
t=# with u as (
update r set rank = rank + 1 where rank >= 3
)
insert into r values('x',3,'New val!')
;
INSERT 0 1
the result:
t=# select * from r;
id | rank | text
----+------+----------
x | 0 | Hello
x | 1 | World
x | 2 | Foo
x | 3 | New val!
x | 4 | Bar
x | 5 | Baz
(6 rows)
also worth of mention you might have concurrency "chasing condition" problem on highly loaded systems. the code above is just a sample
You can have a “computed rank” which is a double precision and a “displayed rank” which is an integer that is computed using the row_number window function on output.
When a row is inserted that should rank between two rows, compute the new rank as the arithmetic mean of the two ranks.
The advantage is that you don't have to update existing rows.
The down side is that you have to calculate the displayed ranks before you can insert a new row so that you know where to insert it.
This solution (like all others) are subject to race conditions.
To deal with these, you can either use table locks or serializable transactions.
The only way to prevent a race condition would be to lock the table
https://www.postgresql.org/docs/current/sql-lock.html
Of course this would slow you down if there are lots of updates and inserts.
If can somehow limit the scope of your updates then you can do a SELECT .... FOR UPDATE on that scope. For example if the records have a parent_id you can do a select for update on the parent record first and any other insert who does the same select for update would have to wait till your transaction is done.
https://www.postgresql.org/docs/current/explicit-locking.html#:~:text=5.-,Advisory%20Locks,application%20to%20use%20them%20correctly.
Read the section on advisory locks to see if you can use those in your application. They are not enforced by the system so you'll need to be careful of how you write your application.

PostgreSQL hierarchical nested set huge database

I have a database that must store thousands of scenarios (each scenario with a single unix_timestamp value). Each scenario has 1,800,000 registers organized in a Nested Set structure.
The general table structure is given by:
table_skeleton:
- unix_timestamp integer
- lft integer
- rgt integer
- value
Usually, my SELECTs are will perform taking all nested values within an specific scenario, it means for example:
SELECT * FROM table_skeleton WHERE unix_timestamp = 123 AND lft >= 10 AND rgt <= 53
So I hierarchically divided my table into master / children within groups of dates, for example:
table_skeleton_201303 inherits table_skeleton:
- unix_timestamp integer
- lft integer
- ...
and
table_skeleton_201304 inherits table_skeleton:
- unix_timestamp integer
- lft integer
- ...
And also created index for each children according to the usual search I am expecting, it is for example:
Create Index idx_201303
on table_skeleton_201303
using btree(unix_timestamp, lft, rgt)
It improved the retrieval, but it still takes about 1 minute for each select.
I imagined that this was because the index was too big to be loaded into memory always so I tried to create partial index for each timestamp, for example:
Create Index idx_201303_1362981600
on table_skeleton_201303
using btree(lft, rgt)
WHERE unix_timestamp = 1362981600
And in fact the second type of index created is much, much, much smaller than the general one. However, when I run an EXPLAIN ANALYZE for the SELECT I've previously shown here, the query solver ignores my new partial index and keeps using the giant old one.
Is there a reason for that?
Is there any new approach to optimize such type of huge nested set hierarchical database?
When you filter on a table by field_a > x and field_b > y, then an index for field_a, field_b will (actually just may, depending on the distribution and the percentage of rows with field_a > x, as per the statistics collected) only be used for "field_a > x", and field_b > y will be a sequential search.
In the case above, having two indexes, one for each field, could be used and each of the results joined, the internal equivalent of:
SELECT *
FROM table t
JOIN (
SELECT id table field_a > x) ta ON (ta.id = t.id)
JOIN (
SELECT id table field_b > y) tb ON (tb.id = t.id);
There is a change you could benefit from a GIST index, and treating your lft and rgt fields as points:
CREATE INDEX ON table USING GIST (unix_timestamp, point(lft, rgt));
SELECT * table
WHERE unix_timestamp = 123 AND
point(lft,rgt) <# box(point(10,'-inf'), point('inf',53));

Postgresql Select all columns and column names with a specific value for a row

I have a table with many(+1000) columns and rows(~1M). The columns have either the value 1 , or are NULL.
I want to be able to select, for a specific row (user) retrieve the column names that have a value of 1.
Since there are many columns on the table, specifying the columns would yield a extremely long query.
You're doing something SQL is quite bad at - dynamic access to columns, or treating a row as a set. It'd be nice if this were easier, but it doesn't work well with SQL's typed nature and the concept of a relation. Working with your data set in its current form is going to be frustrating; consider storing an array, json, or hstore of values instead.
Actually, for this particular data model, you could probably use a bitfield. See bit(n) and bit varying(n).
It's still possible to make a working query with your current model PostgreSQL extensions though.
Given sample:
CREATE TABLE blah (id serial primary key, a integer, b integer, c integer);
INSERT INTO blah(a,b,c) VALUES (NULL, NULL, 1), (1, NULL, 1), (NULL, NULL, NULL), (1, 1, 1);
I would unpivot each row into a key/value set using hstore (or in newer PostgreSQL versions, the json functions). SQL its self provides no way to dynamically access columns, so we have to use an extension. So:
SELECT id, hs FROM blah, LATERAL hstore(blah) hs;
then extract the hstores to sets:
SELECT id, k, v FROM blah, LATERAL each(hstore(blah)) kv(k,v);
... at which point your can filter for values matching the criteria. Note that all columns have been converted to text, so you may want to cast it back:
SELECT id, k FROM blah, LATERAL each(hstore(blah)) kv(k,v) WHERE v::integer = 1;
You also need to exclude id from matching, so:
regress=> SELECT id, k FROM blah, LATERAL each(hstore(blah)) kv(k,v) WHERE v::integer = 1 AND
k <> 'id';
id | k
----+---
1 | c
2 | a
2 | c
4 | a
4 | b
4 | c
(6 rows)