Update intermediate result - postgresql

EDIT
As requested a little background of what I want to achieve. I have a table that I want to query but I don't want to change the table itself. Next the result of the SELECT query (what I called the 'intermediate table') needs to be cleaned a bit. For example certain cells of certain rows need to be swapped and some strings need to be trimmed. Of course this could all be done as postprocessing in, e.g., Python, but I was hoping to do all of this with one query statement.
Being new to Postgresql I want to update the intermediate table that results from a SELECT statement. So I basically want to edit the resulting table from a SELECT statement in one query. I'd like to prevent having to store the intermediate result.
I've tried the following 'with clause':
with result as (
select
a
from
b
)
update result as r
set
a = 'd'
...but that results in ERROR: relation "result" does not exist, while the following does work:
with result as (
select
a
from
b
)
select
*
from
result
As I said, I'm new to Postgresql so it is entirely possible that I'm using the wrong approach.

Depending on the complexity of the transformations you want to perform, you might be able to munge it into the SELECT, which would let you get away with a single query:
WITH foo AS (SELECT lower(name), freq, cumfreq, rank, vec FROM names WHERE name LIKE 'G%')
SELECT ... FROM foo WHERE ...
Or, for more or less unlimited manipulation options, you could create a temp table that will disappear at the end of the current transaction. That doesn't get the job done in a single query, but it does get it all done on the SQL server, which might still be worthwhile.
db=# BEGIN;
BEGIN
db=# CREATE TEMP TABLE foo ON COMMIT DROP AS SELECT * FROM names WHERE name LIKE 'G%';
SELECT 4677
db=# SELECT * FROM foo LIMIT 5;
name | freq | cumfreq | rank | vec
----------+-------+---------+------+-----------------------
GREEN | 0.183 | 11.403 | 35 | 'KRN':1 'green':1
GONZALEZ | 0.166 | 11.915 | 38 | 'KNSL':1 'gonzalez':1
GRAY | 0.106 | 15.921 | 69 | 'KR':1 'gray':1
GONZALES | 0.087 | 18.318 | 94 | 'KNSL':1 'gonzales':1
GRIFFIN | 0.084 | 18.659 | 98 | 'KRFN':1 'griffin':1
(5 rows)
db=# UPDATE foo SET name = lower(name);
UPDATE 4677
db=# SELECT * FROM foo LIMIT 5;
name | freq | cumfreq | rank | vec
--------+-------+---------+-------+---------------------
grube | 0.002 | 67.691 | 7333 | 'KRP':1 'grube':1
gasper | 0.001 | 69.999 | 9027 | 'KSPR':1 'gasper':1
gori | 0.000 | 81.360 | 28946 | 'KR':1 'gori':1
goeltz | 0.000 | 85.471 | 47269 | 'KLTS':1 'goeltz':1
gani | 0.000 | 86.202 | 51743 | 'KN':1 'gani':1
(5 rows)
db=# COMMIT;
COMMIT
db=# SELECT * FROM foo;
ERROR: relation "foo" does not exist

Related

UPDATE from temp table picking the "last" row per group

Suppose there is a table with data:
+----+-------+
| id | value |
+----+-------+
| 1 | 0 |
| 2 | 0 |
+----+-------+
I need to do a bulk update. And use COPY FROM STDIN for fast insert to temp table without constraints and so it can contains duplicate values in id column
Temp table to update from:
+----+-------+
| id | value |
+----+-------+
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
+----+-------+
If I simply run a query like with:
UPDATE test target SET value = source.value FROM tmp_test source WHERE target.id = source.id;
I got wrong results:
+----+-------+
| id | value |
+----+-------+
| 1 | 1 |
| 2 | 1 |
+----+-------+
I need the target table to contain the values that appeared last in the temporary table.
What is the most effective way to do this, given that the target table may contain millions of records, and the temporary table may contain tens of thousands?**
Assuming you want to take the value from the row that was inserted last into the temp table, physically, you can (ab-)use the system column ctid, signifying the physical location:
UPDATE test AS target
SET value = source.value
FROM (
SELECT DISTINCT ON (id)
id, value
FROM tmp_test
ORDER BY id, ctid DESC
) source
WHERE target.id = source.id
AND target.value <> source.value; -- skip empty updates
About DISTINCT ON:
Select first row in each GROUP BY group?
This builds on a implementation detail, and is not backed up by the SQL standard. If some insert method should not write rows in sequence (like future "parallel" INSERT), it breaks. Currently, it should work. About ctid:
How do I decompose ctid into page and row numbers?
If you want a safe way, you need to add some user column to signify the order of rows, like a serial column. But do your really care? Your tiebreaker seems rather arbitrary. See:
Temporary sequence within a SELECT
AND target.value <> source.value
skips empty updates - assuming both columns are NOT NULL. Else, use:
AND target.value IS DISTINCT FROM source.value
See:
How do I (or can I) SELECT DISTINCT on multiple columns?

Postgres - Same query returns different results when run again and again

While experimenting PostgreSQL for my new project, I have experienced the following query execution behavior. I would like to know why the same query returns different values when run again and again? What is the rational behind this behavior?
phi=# SELECT P.id, p.resource ->> 'birthDate' BD FROM recorditems P WHERE P.resource #> '{"resourceType":"Patient", "gender":"male"}' AND To_date(P.resource ->> 'birthDate', 'YYYY-MM-DD') > '1975-01-01'::date limit 10;
id | bd
--------+------------
363661 | 1990-03-08
363752 | 2006-02-28
364971 | 2017-10-21
365330 | 1996-11-25
367793 | 2007-10-02
369002 | 2006-09-04
369172 | 1983-09-10
369256 | 2001-05-19
369670 | 1992-03-21
372082 | 2011-07-27
(10 rows)
Time: 15.085 ms
phi=# SELECT P.id, p.resource ->> 'birthDate' BD FROM recorditems P WHERE P.resource #> '{"resourceType":"Patient", "gender":"male"}' AND To_date(P.resource ->> 'birthDate', 'YYYY-MM-DD') > '1975-01-01'::date limit 10;
id | bd
--------+------------
372082 | 2011-07-27
372645 | 1988-11-02
373528 | 1984-07-11
376213 | 1982-01-03
377386 | 1995-01-20
377531 | 2002-02-11
377717 | 1991-11-15
378372 | 2018-09-27
378483 | 2009-01-11
378743 | 1996-02-27
(10 rows)
Time: 18.163 ms
phi=# SELECT P.id, p.resource ->> 'birthDate' BD FROM recorditems P WHERE P.resource #> '{"resourceType":"Patient", "gender":"male"}' AND To_date(P.resource ->> 'birthDate', 'YYYY-MM-DD') > '1975-01-01'::date limit 10;
id | bd
--------+------------
378743 | 1996-02-27
382517 | 1992-01-14
387866 | 1985-07-03
388180 | 1976-11-01
388627 | 1996-07-10
396668 | 1979-03-29
396754 | 2013-05-16
397054 | 1998-01-05
401771 | 1983-11-28
401891 | 2019-03-01
(10 rows)
Time: 44.394 ms
You are not sorting your results. A LIMIT without an ORDER BY will return (seemingly) random results as the rows of a table are not sorted.
If you want to get consistent results using LIMIT you have to use an ORDER BY as well.
When you have multiple sequential scans going against the same table at more or less the same time (loosely defined), PostgreSQL tries to synchronize them so that they can all benefit from the same warmed-up cache. If you turn off this feature with set synchronize_seqscans TO off, you will probably get more predictable results. But your original sin here is expecting an order, when you didn't request one.

Given a row representing a path, union a total column

Say I have a table like the following table that represents a path from 1 -> 2 -> 3 -> 4 -> 5:
+------+----+--------+
| from | to | weight |
+------+----+--------+
| a | b | 1 |
| b | c | 2 |
| c | d | 1 |
| d | e | 1 |
| e | f | 3 |
+------+----+--------+
Each row knows where it came from and where it is going
I would like to union a total row that takes the starting name, ending name, and a total weight like so:
+------+----+--------+
| from | to | weight |
+------+----+--------+
| a | f | 8 |
+------+----+--------+
The first table is a result of a CTE expression, and I can easily get the total of the previous query with SUM, but I'm unable to get the LAST_VALUE to work in a similar way to:
WITH RECURSIVE cte AS (
...
)
SELECT *
FROM cte
UNION ALL
SELECT 'total', FIRST_VALUE(from), LAST_VALUE(to), SUM(weight)
FROM cte
The FIRST_VALUE and LAST_VALUE functions require OVER clauses which seem to add unnecessary complications to what I would expect, so I think I am going the wrong direction with that. Any ideas on how to achieve this?
So I made a strange solution that:
Selects the first from value (partitioned by TRUE)
Selects the last to value (partitioned by TRUE again)
Cross joins the sum of all weights, limited to 1
WITH RECURSIVE cte AS (
...
)
SELECT *
FROM cte
UNION ALL (
SELECT FIRST_VALUE(from) OVER (PARTITION BY TRUE), LAST_VALUE(to) OVER (PARTITION BY TRUE), total
FROM cte
CROSS JOIN (
SELECT SUM(weight) as total
FROM cte
) tmp
LIMIT 1
);
Is it hacky? Yes. Does it work? Also yes. I'm sure there are better solutions, and I would love to hear them.

Column-wise autocomplete

I have a table in a PostgreSQL database with four columns that contain increasingly more detailed information (think state->city->street->number), along with a column where everything is concatenated according to some simple formatting rules. Example:
| kommun | trakt | block | enhet | beteckning |
| Mora | Gislövs Läge | 9 | 16 | Mora Gislövs Läge 9:16 |
| Mora | Gisslaved | * | 8 | Mora Gisslaved 8 |
| Mora | Gisslaved | * | 9 | Mora Gisslaved 9 |
| Lilla Edet | Sanda | GA | 1 | Lilla Edet Sanda GA:1 |
A web service uses this table to implement a word-wise autocomplete, where the user gets input suggestions as they drill down. An input of mora gis will result in
["Mora Gislövs", "Mora Gisslaved"]
Currently, this is done by splitting the concatenated column by word in this query:
select distinct trim(substring(beteckning from '(^(\S+\s?){NUMPARTS})')) as bet
from beteckning_ac
where upper(beteckning) like upper('mora gis%')
order by bet
Where NUMPARTS is the number of words in the input - 2 in this case.
Now I want the autocomplete to be done column-wise rather than word-wise, so mora gis would now result in this instead:
["Mora Gislövs Läge", "Mora Gisslaved"]
Since the first two columns can contain an arbitrary number of words, I can no longer use the input to determine how many columns to include in the response. Is there a way to do this, or have I maybe gone about this autocomplete business all wrong?
CREATE OR REPLACE FUNCTION get_auto(text)
--$1 is here your input
RETURNS setof text
LANGUAGE plpgsql
AS $function$
declare
NUMPARTS int := array_length(regexp_split_to_array($1,' '), 1);
begin
return query
select
case
when (NUMPARTS = 1) then kommun
when (NUMPARTS = 2) then kommun||' '||trakt
when (NUMPARTS = 3) then kommun||' '||trakt||' '||block
when (NUMPARTS = 4) then kommun||' '||trakt||' '||block||' '||enhet
--alter if you want to
end
from
auto_complete --your tablename here
where
beteckning like $1||'%';
end;
$function$;

How to get back aggregate values across 2 dimensions using Python Cubes?

Situation
Using Python 3, Django 1.9, Cubes 1.1, and Postgres 9.5.
These are my datatables in pictorial form:
The same in text format:
Store table
------------------------------
| id | code | address |
|-----|------|---------------|
| 1 | S1 | Kings Row |
| 2 | S2 | Queens Street |
| 3 | S3 | Jacks Place |
| 4 | S4 | Diamonds Alley|
| 5 | S5 | Hearts Road |
------------------------------
Product table
------------------------------
| id | code | name |
|-----|------|---------------|
| 1 | P1 | Saucer 12 |
| 2 | P2 | Plate 15 |
| 3 | P3 | Saucer 13 |
| 4 | P4 | Saucer 14 |
| 5 | P5 | Plate 16 |
| and many more .... |
|1000 |P1000 | Bowl 25 |
|----------------------------|
Sales table
----------------------------------------
| id | product_id | store_id | amount |
|-----|------------|----------|--------|
| 1 | 1 | 1 |7.05 |
| 2 | 1 | 2 |9.00 |
| 3 | 2 | 3 |1.00 |
| 4 | 2 | 3 |1.00 |
| 5 | 2 | 5 |1.00 |
| and many more .... |
| 1000| 20 | 4 |1.00 |
|--------------------------------------|
The relationships are:
Sales belongs to Store
Sales belongs to Product
Store has many Sales
Product has many Sales
What I want to achieve
I want to use cubes to be able to do a display by pagination in the following manner:
Given the stores S1-S3:
-------------------------
| product | S1 | S2 | S3 |
|---------|----|----|----|
|Saucer 12|7.05|9 | 0 |
|Plate 15 |0 |0 | 2 |
| and many more .... |
|------------------------|
Note the following:
Even though there were no records in sales for Saucer 12 under Store S3, I displayed 0 instead of null or none.
I want to be able to do sort by store, say descending order for, S3.
The cells indicate the SUM total of that particular product spent in that particular store.
I also want to have pagination.
What I tried
This is the configuration I used:
"cubes": [
{
"name": "sales",
"dimensions": ["product", "store"],
"joins": [
{"master":"product_id", "detail":"product.id"},
{"master":"store_id", "detail":"store.id"}
]
}
],
"dimensions": [
{ "name": "product", "attributes": ["code", "name"] },
{ "name": "store", "attributes": ["code", "address"] }
]
This is the code I used:
result = browser.aggregate(drilldown=['Store','Product'],
order=[("Product.name","asc"), ("Store.name","desc"), ("total_products_sale", "desc")])
I didn't get what I want.
I got it like this:
----------------------------------------------
| product_id | store_id | total_products_sale |
|------------|----------|---------------------|
| 1 | 1 | 7.05 |
| 1 | 2 | 9 |
| 2 | 3 | 2.00 |
| and many more .... |
|---------------------------------------------|
which is the whole table with no pagination and if the products not sold in that store it won't show up as zero.
My question
How do I get what I want?
Do I need to create another data table that aggregates everything by store and product before I use cubes to run the query?
Update
I have read more. I realised that what I want is called dicing as I needed to go across 2 dimensions. See: https://en.wikipedia.org/wiki/OLAP_cube#Operations
Cross-posted at Cubes GitHub issues to get more attention.
This is a pure SQL solution using crosstab() from the additional tablefunc module to pivot the aggregated data. It typically performs better than any client-side alternative. If you are not familiar with crosstab(), read this first:
PostgreSQL Crosstab Query
And this about the "extra" column in the crosstab() output:
Pivot on Multiple Columns using Tablefunc
SELECT product_id, product
, COALESCE(s1, 0) AS s1 -- 1. ... displayed 0 instead of null
, COALESCE(s2, 0) AS s2
, COALESCE(s3, 0) AS s3
, COALESCE(s4, 0) AS s4
, COALESCE(s5, 0) AS s5
FROM crosstab(
'SELECT s.product_id, p.name, s.store_id, s.sum_amount
FROM product p
JOIN (
SELECT product_id, store_id
, sum(amount) AS sum_amount -- 3. SUM total of product spent in store
FROM sales
GROUP BY product_id, store_id
) s ON p.id = s.product_id
ORDER BY s.product_id, s.store_id;'
, 'VALUES (1),(2),(3),(4),(5)' -- desired store_id's
) AS ct (product_id int, product text -- "extra" column
, s1 numeric, s2 numeric, s3 numeric, s4 numeric, s5 numeric)
ORDER BY s3 DESC; -- 2. ... descending order for S3
Produces your desired result exactly (plus product_id).
To include products that have never been sold replace [INNER] JOIN with LEFT [OUTER] JOIN.
SQL Fiddle with base query.
The tablefunc module is not installed on sqlfiddle.
Major points
Read the basic explanation in the reference answer for crosstab().
I am including with product_id because product.name is hardly unique. This might otherwise lead to sneaky errors conflating two different products.
You don't need the store table in the query if referential integrity is guaranteed.
ORDER BY s3 DESC works, because s3 references the output column where NULL values have been replaced with COALESCE. Else we would need DESC NULLS LAST to sort NULL values last:
PostgreSQL sort by datetime asc, null first?
For building crosstab() queries dynamically consider:
Dynamic alternative to pivot with CASE and GROUP BY
I also want to have pagination.
That last item is fuzzy. Simple pagination can be had with LIMIT and OFFSET:
Displaying data in grid view page by page
I would consider a MATERIALIZED VIEW to materialize results before pagination. If you have a stable page size I would add page numbers to the MV for easy and fast results.
To optimize performance for big result sets, consider:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Optimize query with OFFSET on large table