Postgres with clause to substring large char column into rows - postgresql

Issue: I have limitations in the code that I am writing that won't read in columns > 4K bytes
Want: Turn a single row into multiple rows with a new max length and have the ordinal to keep them in order.
I have done this activity in DB2 previously using the with clause and a recursive query, I am trying to convert the code to work with Postgres. I don't have as much Postgres experience to know if there is a better way to do this.
To start I have a table
Create table test_long_text (
test_id numeric(12,0) NOT NULL,
long_text varchar(64000)
)
Here is the query I have been trying to write and failed.
WITH RECURSIVE rec(test_id, len, ord, pos) as (
select test_id, octet_length(long_text), 1, 1 from test_long_text
union all
select test_id, len, ord+1, pos+4096 from rec where len >=4096)
select A.test_id, ord, substr(long_text, pos, 4096) from test_long_text A
inner join rec using (test_id)
order by A.test_id, ord
Currently, I get an error negative substring length not allowed or it hangs indefinitely.
Expected Results: Where the text is chunks of text at a max of 4096 bytes long. pretend ABC is a longer string.
+--------------+
| ID |ORD|TEXT |
| 1 |1 |ABC |
+--------------+
| 2 |1 |ABC |
+--------------+
| 2 |2 |DEF |
+--------------+
| 3 |1 |ABC |
+--------------+
| 3 |2 |DEF |
+--------------+
| 3 |3 |GHI |
+--------------+

This example shows how to split values of text column into 3-characters parts:
with t(x) as (values('1234567890abcdef'::text), ('qwertyuiop'))
select *, substring(x, f.p, 3)
from t, generate_series(1, length(x), 3) with ordinality as f(p, i);
┌──────────────────┬────┬───┬───────────┐
│ x │ p │ i │ substring │
├──────────────────┼────┼───┼───────────┤
│ 1234567890abcdef │ 1 │ 1 │ 123 │
│ 1234567890abcdef │ 4 │ 2 │ 456 │
│ 1234567890abcdef │ 7 │ 3 │ 789 │
│ 1234567890abcdef │ 10 │ 4 │ 0ab │
│ 1234567890abcdef │ 13 │ 5 │ cde │
│ 1234567890abcdef │ 16 │ 6 │ f │
│ qwertyuiop │ 1 │ 1 │ qwe │
│ qwertyuiop │ 4 │ 2 │ rty │
│ qwertyuiop │ 7 │ 3 │ uio │
│ qwertyuiop │ 10 │ 4 │ p │
└──────────────────┴────┴───┴───────────┘
You can simply adapt it to your data.

Related

Polars, sum column based on other column values in `groupby`

I want to calculate the sum of a column in a groupby based on values of another column. Pretty much what pl.Expr.value_counts do (see example), but I want to apply a function (e.g sum) to a specific column, in this case the Price column.
I know that I could do the groupby on Weather + Windy and then aggregate, but, I can't do that since I have plenty of other aggregations I need to compute on only the Weather groupby.
import polars as pl
df = pl.DataFrame(
data = {
"Weather":["Rain","Sun","Rain","Sun","Rain","Sun","Rain","Sun"],
"Price":[1,2,3,4,5,6,7,8],
"Windy":["Y","Y","Y","Y","N","N","N","N"]
}
)
I can get number of counts per windy day by value_counts
df_agg = (df
.groupby("Weather")
.agg([
pl.col("Windy")
.value_counts()
.alias("Price")
])
)
shape: (2, 2)
┌─────────┬────────────────────┐
│ Weather ┆ Price │
│ --- ┆ --- │
│ str ┆ list[struct[2]] │
╞═════════╪════════════════════╡
│ Sun ┆ [{"Y",2}, {"N",2}] │
│ Rain ┆ [{"Y",2}, {"N",2}] │
└─────────┴────────────────────┘
I would like to do something like this:
df_agg =(df
.groupby("Weather")
.agg([
pl.col("Windy")
.custom_fun_on_other_col("Price",sum)
.alias("Price")
])
)
and, this is the result I want,
shape: (2, 2)
┌─────────┬────────────────────┐
│ Weather ┆ Price │
│ --- ┆ --- │
│ str ┆ list[struct[2]] │
╞═════════╪════════════════════╡
│ Sun ┆ [{"Y",6},{"N",14}] │
│ Rain ┆ [{"Y",4},{"N",12}] │
└─────────┴────────────────────┘
(Using polars version 15.15)
Inside a groupby context - you could combine .repeat_by().flatten() with .value_counts()
df.groupby("Weather").agg(
pl.col("Windy").repeat_by("Price").flatten().value_counts()
.alias("Price")
)
shape: (2, 2)
┌─────────┬─────────────────────┐
│ Weather | Price │
│ --- | --- │
│ str | list[struct[2]] │
╞═════════╪═════════════════════╡
│ Sun | [{"N",14}, {"Y",6}] │
├─────────┼─────────────────────┤
│ Rain | [{"Y",4}, {"N",12}] │
└─────────┴─────────────────────┘
Do you know about Window functions?
df.with_columns(
pl.sum("Price").over(["Weather", "Windy"]).alias("sum")
)
shape: (8, 4)
┌─────────┬───────┬───────┬─────┐
│ Weather | Price | Windy | sum │
│ --- | --- | --- | --- │
│ str | i64 | str | i64 │
╞═════════╪═══════╪═══════╪═════╡
│ Rain | 1 | Y | 4 │
├─────────┼───────┼───────┼─────┤
│ Sun | 2 | Y | 6 │
├─────────┼───────┼───────┼─────┤
│ Rain | 3 | Y | 4 │
├─────────┼───────┼───────┼─────┤
│ Sun | 4 | Y | 6 │
├─────────┼───────┼───────┼─────┤
│ Rain | 5 | N | 12 │
├─────────┼───────┼───────┼─────┤
│ Sun | 6 | N | 14 │
├─────────┼───────┼───────┼─────┤
│ Rain | 7 | N | 12 │
├─────────┼───────┼───────┼─────┤
│ Sun | 8 | N | 14 │
└─────────┴───────┴───────┴─────┘
You could also create the struct if desired:
pl.struct(["Windy", pl.sum("Price").over(["Weather", "Windy"])])
For instance, you can create temporary dataframe and then join it with main dataframe.
tmp = df.groupby(["Weather", "Windy"]).agg(col("Price").sum())\
.select([pl.col("Weather"), pl.struct(["Windy", "Price"])])\
.groupby("Weather").agg(pl.list("Windy"))
df.groupby("Weather").agg([
# your another aggregations ...
]).join(tmp, on="Weather")
┌─────────┬─────────────────────┐
│ Weather ┆ Windy │
│ --- ┆ --- │
│ str ┆ list[struct[2]] │
╞═════════╪═════════════════════╡
│ Rain ┆ [{"Y",4}, {"N",12}] │
│ Sun ┆ [{"N",14}, {"Y",6}] │
└─────────┴─────────────────────┘

How to create a recursive cte query that will push parent ids and grandparent ids into an array

I have a postgresql table that I am trying to create. This is my cte and I am inserting values here
BEGIN;
CREATE TABLE section (
id SERIAL PRIMARY KEY,
parent_id INTEGER REFERENCES section(id) DEFERRABLE,
name TEXT NOT NULL UNIQUE );
SET CONSTRAINTS ALL DEFERRED;
INSERT INTO section VALUES (1, NULL, 'animal');
INSERT INTO section VALUES (2, NULL, 'mineral');
INSERT INTO section VALUES (3, NULL, 'vegetable');
INSERT INTO section VALUES (4, 1, 'dog');
INSERT INTO section VALUES (5, 1, 'cat');
INSERT INTO section VALUES (6, 4, 'doberman');
INSERT INTO section VALUES (7, 4, 'dachshund');
INSERT INTO section VALUES (8, 3, 'carrot');
INSERT INTO section VALUES (9, 3, 'lettuce');
INSERT INTO section VALUES (10, 11, 'paradox1');
INSERT INTO section VALUES (11, 10, 'paradox2');
SELECT setval('section_id_seq', (select max(id) from section));
WITH RECURSIVE last_run(parent_id, id_list, name_list) AS (
???
SELECT id_list, name_list
FROM last_run ???
WHERE ORDER BY id_list;
ROLLBACK;
I know that a recursive query is the best possible way, but am not sure how exactly to implement it. What exactly goes in the ???
What im trying to get is the table below:
id_list | name_list
---------+------------------------
{1} | animal
{2} | mineral
{3} | vegetable
{4,1} | dog, animal
{5,1} | cat, animal
{6,4,1} | doberman, dog, animal
{7,4,1} | dachshund, dog, animal
{8,3} | carrot, vegetable
{9,3} | lettuce, vegetable
{10,11} | paradox1, paradox2
{11,10} | paradox2, paradox1
You could to use several recursive CTEs in single query: one for the valid tree and another one for paradoxes:
with recursive
cte as (
select *, array[id] as ids, array[name] as names
from section
where parent_id is null
union all
select s.*, s.id||c.ids, s.name||c.names
from section as s join cte as c on (s.parent_id = c.id)),
paradoxes as (
select *, array[id] as ids, array[name] as names
from section
where id not in (select id from cte)
union all
select s.*, s.id||p.ids, s.name||p.names
from section as s join paradoxes as p on (s.parent_id = p.id)
where s.id <> all(p.ids) -- To break loops
)
select * from cte
union all
select * from paradoxes;
Result:
┌────┬───────────┬───────────┬─────────┬────────────────────────┐
│ id │ parent_id │ name │ ids │ names │
├────┼───────────┼───────────┼─────────┼────────────────────────┤
│ 1 │ ░░░░ │ animal │ {1} │ {animal} │
│ 2 │ ░░░░ │ mineral │ {2} │ {mineral} │
│ 3 │ ░░░░ │ vegetable │ {3} │ {vegetable} │
│ 4 │ 1 │ dog │ {4,1} │ {dog,animal} │
│ 5 │ 1 │ cat │ {5,1} │ {cat,animal} │
│ 8 │ 3 │ carrot │ {8,3} │ {carrot,vegetable} │
│ 9 │ 3 │ lettuce │ {9,3} │ {lettuce,vegetable} │
│ 6 │ 4 │ doberman │ {6,4,1} │ {doberman,dog,animal} │
│ 7 │ 4 │ dachshund │ {7,4,1} │ {dachshund,dog,animal} │
│ 10 │ 11 │ paradox1 │ {10} │ {paradox1} │
│ 11 │ 10 │ paradox2 │ {11} │ {paradox2} │
│ 11 │ 10 │ paradox2 │ {11,10} │ {paradox2,paradox1} │
│ 10 │ 11 │ paradox1 │ {10,11} │ {paradox1,paradox2} │
└────┴───────────┴───────────┴─────────┴────────────────────────┘
Demo
As you can see the result includes two unwanted rows: {10}, {paradox1} and {11}, {paradox2}. It is up to you how to filter them out.
And it is not clear what is the desired result if you append yet another row like INSERT INTO section VALUES (12, 10, 'paradox3'); for instance.

PostgreSQL query from multiple tables

Question about querying PostgreSQL.
It's simple to do by stored procedure or inside any programming language, but my question is it possible to do by one select statement. This is an example of two tables.
Thank you.
You would normally do the "group headers" in some kind of reporting tool. However, it can all be done in pure SQL, if you wish.
You start with the data in a "standard JOIN":
-- The basic query
SELECT
table_1.table_1_id, phone, name, some_data,
row_number() over(partition by table_1.table_1_id order by some_data)
FROM
table_1
JOIN table_2 ON table_2.table_1_id = table_1.table_1_id ;
That will produce the following table:
table_1_id | phone | name | some_data | row_number
---------: | :---- | :--- | :-------- | ---------:
1 | 502 | aa | a | 1
1 | 502 | aa | b | 2
1 | 502 | aa | j | 3
1 | 502 | aa | n | 4
2 | 268 | bb | a | 1
2 | 268 | bb | s | 2
2 | 268 | bb | y | 3
5 | 984 | ee | a | 1
5 | 984 | ee | n | 2
5 | 984 | ee | w | 3
If you want to add some header rows and also not show the values of phone, name, we need to *add* these data to the query. This is done by making a second specificSELECTfor the headers, andUNION ALL` with the first one.
That is, we will use:
-- This query produces the 'headers'
SELECT DISTINCT
table_1.table_1_id,
phone,
name,
'' AS some_data,
true as is_header -- this marks this row as a 'header'
FROM
table_1
JOIN table_2 ON table_2.table_1_id = table_1.table_1_id
UNION ALL
-- This query produces the actual data
SELECT
table_1.table_1_id,
phone,
name,
some_data,
false as is_header
FROM
table_1
JOIN table_2 ON table_2.table_1_id = table_1.table_1_id ;
Now, we make this a subquery, and make some extra logic to decide which data needs to be shown, and which one needs to be hidden (actually, shown as ''):
-- A few tricks to do the formatting
SELECT
-- Here we decide which information to show, which not
case when is_header then cast(table_1_id as text) else '' end AS table_1_id,
case when is_header then phone else '' end AS phone,
case when is_header then name else '' end as name,
case when is_header then '' else some_data end as some_data
FROM
(-- This query produces the 'headers'
SELECT DISTINCT
table_1.table_1_id,
phone,
name,
'' AS some_data,
true as is_header -- this marks this row as a 'header
FROM
table_1
JOIN table_2 ON table_2.table_1_id = table_1.table_1_id
UNION ALL
-- This query produces the actual data
SELECT
table_1.table_1_id,
phone,
name,
some_data,
false as is_header
FROM
table_1
JOIN table_2 ON table_2.table_1_id = table_1.table_1_id
) AS q
ORDER BY
q.table_1_id, is_header DESC /* Header goes first */, some_data ;
This produces:
table_1_id | phone | name | some_data
:--------- | :---- | :--- | :--------
1 | 502 | aa |
| | | a
| | | b
| | | j
| | | n
2 | 268 | bb |
| | | a
| | | s
| | | y
5 | 984 | ee |
| | | a
| | | n
| | | w
You can see the whole setup and the simulated data at dbfiddle here
Note that the order that you specified is not preserved. You should have some logic on how things should get ordered. SQL doesn't have a concept of "insert order" or "natural order"; you always have to choose which one you want (otherwise, the database will choose the one it is most convenient for it, and that may change from one execution to the next).
There is the nice feature like grouping sets.
Using it you could to get in single query/result set several sets for different grouping conditions. Thanks for example data to joanolo
Lets start at the most simple query:
SELECT
table_1.table_1_id, phone, name, some_data
FROM
table_1 JOIN table_2 ON table_2.table_1_id = table_1.table_1_id;
┌────────────┬───────┬──────┬───────────┐
│ table_1_id │ phone │ name │ some_data │
╞════════════╪═══════╪══════╪═══════════╡
│ 1 │ 502 │ aa │ a │
│ 1 │ 502 │ aa │ b │
│ 1 │ 502 │ aa │ n │
│ 1 │ 502 │ aa │ j │
│ 5 │ 984 │ ee │ w │
│ 5 │ 984 │ ee │ a │
│ 5 │ 984 │ ee │ n │
│ 2 │ 268 │ bb │ s │
│ 2 │ 268 │ bb │ a │
│ 2 │ 268 │ bb │ y │
└────────────┴───────┴──────┴───────────┘
Here wee need to add the groups by first three columns:
SELECT
table_1.table_1_id, phone, name--, some_data
FROM
table_1 JOIN table_2 ON table_2.table_1_id = table_1.table_1_id
GROUP BY GROUPING SETS ((table_1.table_1_id, phone, name));
┌────────────┬───────┬──────┐
│ table_1_id │ phone │ name │
╞════════════╪═══════╪══════╡
│ 2 │ 268 │ bb │
│ 5 │ 984 │ ee │
│ 1 │ 502 │ aa │
└────────────┴───────┴──────┘
Note double parenthesizes. Actually it is equal to the simple GROUP BY table_1.table_1_id, phone, name (that's because I commented out some_data column, it is not in the group). But we wants to add to our query more data:
SELECT
table_1.table_1_id, phone, name, some_data
FROM
table_1 JOIN table_2 ON table_2.table_1_id = table_1.table_1_id
GROUP By GROUPING SETS ((table_1.table_1_id, phone, name),(table_1.table_1_id, phone, name, some_data));
┌────────────┬───────┬──────┬───────────┐
│ table_1_id │ phone │ name │ some_data │
╞════════════╪═══════╪══════╪═══════════╡
│ 1 │ 502 │ aa │ a │
│ 1 │ 502 │ aa │ b │
│ 1 │ 502 │ aa │ j │
│ 1 │ 502 │ aa │ n │
│ 1 │ 502 │ aa │ ░░░░ │
│ 2 │ 268 │ bb │ a │
│ 2 │ 268 │ bb │ s │
│ 2 │ 268 │ bb │ y │
│ 2 │ 268 │ bb │ ░░░░ │
│ 5 │ 984 │ ee │ a │
│ 5 │ 984 │ ee │ n │
│ 5 │ 984 │ ee │ w │
│ 5 │ 984 │ ee │ ░░░░ │
└────────────┴───────┴──────┴───────────┘
It is almost what we wants. We need just to order the data in proper way and clean some columns:
SELECT
case when some_data is null then table_1.table_1_id end as table_1_id,
case when some_data is null then phone end as phone,
case when some_data is null then name end as name,
some_data
FROM
table_1
JOIN table_2 ON table_2.table_1_id = table_1.table_1_id
group by grouping sets((table_1.table_1_id, phone, name), (table_1.table_1_id, phone, name, some_data))
order by table_1.table_1_id, some_data nulls first;
┌────────────┬───────┬──────┬───────────┐
│ table_1_id │ phone │ name │ some_data │
╞════════════╪═══════╪══════╪═══════════╡
│ 1 │ 502 │ aa │ ░░░░ │
│ ░░░░ │ ░░░░ │ ░░░░ │ a │
│ ░░░░ │ ░░░░ │ ░░░░ │ b │
│ ░░░░ │ ░░░░ │ ░░░░ │ j │
│ ░░░░ │ ░░░░ │ ░░░░ │ n │
│ 2 │ 268 │ bb │ ░░░░ │
│ ░░░░ │ ░░░░ │ ░░░░ │ a │
│ ░░░░ │ ░░░░ │ ░░░░ │ s │
│ ░░░░ │ ░░░░ │ ░░░░ │ y │
│ 5 │ 984 │ ee │ ░░░░ │
│ ░░░░ │ ░░░░ │ ░░░░ │ a │
│ ░░░░ │ ░░░░ │ ░░░░ │ n │
│ ░░░░ │ ░░░░ │ ░░░░ │ w │
└────────────┴───────┴──────┴───────────┘
Bingo! Hope it is what you asking for.

Find clusters of values using Postgresql

Consider the following example table:
CREATE TABLE rndtbl AS
SELECT
generate_series(1, 10) AS id,
random() AS val;
and I want to find for each id a cluster_id such that the clusters are far away from each other at least 0.1. How would I calculate such a cluster assignment?
A specific example would be:
select * from rndtbl ;
id | val
----+-------------------
1 | 0.485714662820101
2 | 0.185201027430594
3 | 0.368477711919695
4 | 0.687312887981534
5 | 0.978742253035307
6 | 0.961830694694072
7 | 0.10397826647386
8 | 0.644958863966167
9 | 0.912827260326594
10 | 0.196085536852479
(10 rows)
The result would be: ids (2,7,10) in a cluster and (5,6,9) in another cluster and (4,8) in another, and (1) and (3) as singleton clusters.
From
SELECT * FROM rndtbl ;
┌────┬────────────────────┐
│ id │ val │
├────┼────────────────────┤
│ 1 │ 0.153776332736015 │
│ 2 │ 0.572575284633785 │
│ 3 │ 0.998213059268892 │
│ 4 │ 0.654628816060722 │
│ 5 │ 0.692200613208115 │
│ 6 │ 0.572836415842175 │
│ 7 │ 0.0788379465229809 │
│ 8 │ 0.390280921943486 │
│ 9 │ 0.611408909317106 │
│ 10 │ 0.555164183024317 │
└────┴────────────────────┘
(10 rows)
Use the LAG window function to know whether the current row is in a new cluster or not:
SELECT *, val - LAG(val) OVER (ORDER BY val) > 0.1 AS new_cluster
FROM rndtbl ;
┌────┬────────────────────┬─────────────┐
│ id │ val │ new_cluster │
├────┼────────────────────┼─────────────┤
│ 7 │ 0.0788379465229809 │ (null) │
│ 1 │ 0.153776332736015 │ f │
│ 8 │ 0.390280921943486 │ t │
│ 10 │ 0.555164183024317 │ t │
│ 2 │ 0.572575284633785 │ f │
│ 6 │ 0.572836415842175 │ f │
│ 9 │ 0.611408909317106 │ f │
│ 4 │ 0.654628816060722 │ f │
│ 5 │ 0.692200613208115 │ f │
│ 3 │ 0.998213059268892 │ t │
└────┴────────────────────┴─────────────┘
(10 rows)
Finally you can SUM the number of true (still ordering by val) to get the cluster of the row (counting from 0):
SELECT *, SUM(COALESCE(new_cluster::int, 0)) OVER (ORDER BY val) AS nb_cluster
FROM (
SELECT *, val - LAG(val) OVER (ORDER BY val) > 0.1 AS new_cluster
FROM rndtbl
) t
;
┌────┬────────────────────┬─────────────┬────────────┐
│ id │ val │ new_cluster │ nb_cluster │
├────┼────────────────────┼─────────────┼────────────┤
│ 7 │ 0.0788379465229809 │ (null) │ 0 │
│ 1 │ 0.153776332736015 │ f │ 0 │
│ 8 │ 0.390280921943486 │ t │ 1 │
│ 10 │ 0.555164183024317 │ t │ 2 │
│ 2 │ 0.572575284633785 │ f │ 2 │
│ 6 │ 0.572836415842175 │ f │ 2 │
│ 9 │ 0.611408909317106 │ f │ 2 │
│ 4 │ 0.654628816060722 │ f │ 2 │
│ 5 │ 0.692200613208115 │ f │ 2 │
│ 3 │ 0.998213059268892 │ t │ 3 │
└────┴────────────────────┴─────────────┴────────────┘
(10 rows)

SQL to return distinct records with priority on first found item

table:
id | tag | pk
----+-----+----
1 | 111 | 1
2 | 111 | 2
2 | 112 | 3
3 | 111 | 4
4 | 333 | 5
4 | 334 | 6
4 | 111 | 7
5 | 335 | 8
... for 1,000,000 rows
Desired output
id | tag | pk
----+-----+----
1 | 111 | 1
2 | 111 | 2
3 | 111 | 4
4 | 111 | 7
5 | 335 | 8
... for limit of 500 rows
I want to return distinct id, but I want to first return the row where tag = 111 as the distinct; otherwise, first one found will do. I also want to limit the output to 500 rows.
I looked at unions, intersects... however, was unable to produce required results.
Use a CASE statement in the ORDER BY clause :
SELECT * FROM t;
┌────┬─────┬────┐
│ id │ tag │ pk │
├────┼─────┼────┤
│ 1 │ 111 │ 1 │
│ 2 │ 111 │ 2 │
│ 2 │ 112 │ 3 │
│ 3 │ 111 │ 4 │
│ 4 │ 333 │ 5 │
│ 4 │ 334 │ 6 │
│ 4 │ 111 │ 7 │
│ 5 │ 335 │ 8 │
└────┴─────┴────┘
(8 rows)
SELECT DISTINCT ON (id) *
FROM t
ORDER BY id, (CASE tag WHEN 111 THEN 0 ELSE 1 END);
┌────┬─────┬────┐
│ id │ tag │ pk │
├────┼─────┼────┤
│ 1 │ 111 │ 1 │
│ 2 │ 111 │ 2 │
│ 3 │ 111 │ 4 │
│ 4 │ 111 │ 7 │
│ 5 │ 335 │ 8 │
└────┴─────┴────┘
(5 rows)