PostgreSQL: Joining with Window Function / Partial Table - postgresql

Issue
I'm working a project with kernels in databases, and my PostgreSQL skills have hit a wall. I am joining between two tables to compute the cross product, i.e.
SELECT (d1.a * d2.a + d1.b * d2.b) AS dot FROM data d1, data d2
This gives me the cross product between all vectors. Having the following data in my table
a | b | c
---+---+---
1 | 1 | 1
2 | 2 | 2
3 | 3 | 3
The above command yields
dot
-----
2
4
6
...
If I want to compute the dot product between, say, row 2 with its preceding row and its following row, how would I do that efficiently?
Attempts
I have tried to use window functions, but failed to do so since they only compute aggregate functions. I want to join a row with its neighbouring rows (i.e. its window), but not compute any aggregate over these. Something along these lines:
SELECT a * a + b * b + c * c
OVER(rows between 1 preceding and 1 following) as value FROM data data;
I have also tried to use row_number() OVER() which works. But this seems clumsy and inefficient with nested subqueries.
SELECT d1.a * d3.a + d1.b * d3.b + d1.c * d3.c
FROM data d1,
(SELECT * FROM
(SELECT *, row_number() OVER() as index from data) d2
WHERE d2.index >= 1 AND d2.index <=3) d3;
Lastly, I tried to dig into LATERALs with no luck.
Any thoughts?

You can get the values of preceding/following rows by lag()/lead().
If the order of rows is determined by a, the query would be like:
SELECT
a,
(lag(a, 1, 0) OVER (ORDER BY a)) * (lead(a, 1, 0) OVER (ORDER BY a))
+ (lag(b, 1, 0) OVER (ORDER BY a)) * (lead(b, 1, 0) OVER (ORDER BY a))
+ (lag(c, 1, 0) OVER (ORDER BY a)) * (lead(c, 1, 0) OVER (ORDER BY a)) AS dot_preceding_and_following
FROM ( VALUES
(1, 1, 1),
(2, 2, 2),
(3, 3, 3)
) T(a, b, c)
ORDER BY
a
;

Related

Is there a smarter method to create series with different intervalls for count within a query?

I want to create different intervalls:
0 to 10 steps 1
10 to 100 steps 10
100 to 1.000 steps 100
1.000 to 10.000 steps 1.000
to query a table for count the items.
with "series" as (
(SELECT generate_series(0, 10, 1) AS r_from)
union
(select generate_series(10, 90, 10) as r_from)
union
(select generate_series(100, 900, 100) as r_from)
union
(select generate_series(1000, 9000, 1000) as r_from)
order by r_from
)
, "range" as ( select r_from
, case
when r_from < 10 then r_from + 1
when r_from < 100 then r_from + 10
when r_from < 1000 then r_from + 100
else r_from + 1000
end as r_to
from series)
select r_from, r_to,(SELECT count(*) FROM "my_table" WHERE "my_value" BETWEEN r_from AND r_to) as "Anz."
FROM "range";
I think generate_series is the right way, there is another way, we can use simple math to calculate the numbers.
SELECT 0 as r_from,1 as r_to
UNION ALL
SELECT power(10, steps ) * v ,
power(10, steps ) * v + power(10, steps )
FROM generate_series(1, 9, 1) v
CROSS JOIN generate_series(0, 3, 1) steps
so that might as below
with "range" as
(
SELECT 0 as r_from,1 as r_to
UNION ALL
SELECT power(10, steps) * v ,
power(10, steps) * v + power(10, steps)
FROM generate_series(1, 9, 1) v
CROSS JOIN generate_series(0, 3, 1) steps
)
select r_from, r_to,(SELECT count(*) FROM "my_table" WHERE "my_value" BETWEEN r_from AND r_to) as "Anz."
FROM "range";
sqlifddle
Rather than generate_series you could create defined integer range types (int4range), then test whether your value is included within the range (see Range/Multirange Functions and Operators. So
with ranges (range_set) as
( values ( int4range(0,10,'[)') )
, ( int4range(10,100,'[)') )
, ( int4range(100,1000,'[)') )
, ( int4range(1000,10000,'[)') )
) --select * from ranges;
select lower(range_set) range_start
, upper(range_set) - 1 range_end
, count(my_value) cnt
from ranges r
left join my_table mt
on (mt.my_value <# r.range_set)
group by r.range_set
order by lower(r.range_set);
Note the 3rd parameter in creating the ranges.
Creating a CTE as above is good if your ranges are static, however if dynamic ranges are required you can put the ranges into a table. Changes ranges then becomes a matter to managing the table. Not simple but does not require code updates. The query then reduces to just the Main part of the above:
select lower(range_set) range_start
, upper(range_set) - 1 range_end
, count(my_value) cnt
from range_tab r
left join my_table mt
on (mt.my_value <# r.range_set)
group by r.range_set
order by lower(r.range_set);
See demo for both here.

calculate rank without using rank or rownums function by using single column

Do not use any functions like rank or rownums.
Hint: Formulate matrix operation using sql. A rank of an item indicates how many items are less than or equal to it.
A matrix can be simulated by cross join and rank can be derived by
counting items smaller than the current item.
Table A:-
x
----
d
b
a
g
c
k
k
g
Expected output:
x1 | rank
----+------
a | 1
b | 2
d | 3
g | 4
c | 5
k | 6
select x as x1, count(x) as rank
from (select DISTINCT x from A order by x) as sub
Your current query is on the right track, using a distinct subquery. For a working version, use a correlated subquery in the select clause which takes counts:
SELECT
x AS x1,
(SELECT COUNT(DISTINCT x) FROM A t WHERE t.x <= sub.x) rank
FROM (SELECT DISTINCT x FROM A) AS sub
ORDER BY
x;
Demo

How can you generate a date list from a range in Amazon Redshift?

Getting date list in a range in PostgreSQL shows how to get a date range in PostgreSQL. However, Redshift does not support generate_series():
ans=> select (generate_series('2012-06-29', '2012-07-03', '1 day'::interval))::date;
ERROR: function generate_series("unknown", "unknown", interval) does not exist
HINT: No function matches the given name and argument types. You may need to add explicit type casts.
Is there way to replicate what generate_series() does in Redshift?
a hack, but works:
use a table with many many rows, and a window function to generate the series
this works as long as you are generating a series that is smaller than the number of rows in the table you're using to generate the series
WITH x(dt) AS (SELECT '2016-01-01'::date)
SELECT
dateadd(
day,
COUNT(*) over(rows between unbounded preceding and current row) - 1,
dt)
FROM users, x
LIMIT 100
the initial date 2016-01-01 controls the start date, and the limit controls the number of days in the generated series.
Update: * Will only run on the leader node
Redshift has partial support for the generate_series function but unfortunately does not mention it in their documentation.
This will work and is the shortest & most legible way of generating a series of dates as of this date (2018-01-29):
SELECT ('2016-01-01'::date + x)::date
FROM generate_series(1, 100, 1) x
One option if you don't want to rely on any existing tables is to pre-generate a series table filled with a range of numbers, one for each row.
create table numbers as (
select
p0.n
+ p1.n*2
+ p2.n * power(2,2)
+ p3.n * power(2,3)
+ p4.n * power(2,4)
+ p5.n * power(2,5)
+ p6.n * power(2,6)
+ p7.n * power(2,7)
+ p8.n * power(2,8)
+ p9.n * power(2,9)
+ p10.n * power(2,10)
as number
from
(select 0 as n union select 1) p0,
(select 0 as n union select 1) p1,
(select 0 as n union select 1) p2,
(select 0 as n union select 1) p3,
(select 0 as n union select 1) p4,
(select 0 as n union select 1) p5,
(select 0 as n union select 1) p6,
(select 0 as n union select 1) p7,
(select 0 as n union select 1) p8,
(select 0 as n union select 1) p9,
(select 0 as n union select 1) p10
order by 1
);
This will create a table with numbers from 0 to 2^10, if you need more numbers, just add more clauses :D
Once you have this table, you can join to it as a substitute for generate_series
with date_range as (select
'2012-06-29'::timestamp as start_date ,
'2012-07-03'::timestamp as end_date
)
select
dateadd(day, number::int, start_date)
from date_range
inner join numbers on number <= datediff(day, start_date, end_date)
#michael_erasmus It's interesting, and I make a change for maybe better performance.
CREATE OR REPLACE VIEW v_series_0_to_1024 AS SELECT
p0.n
| (p1.n << 1)
| (p2.n << 2)
| (p3.n << 3)
| (p4.n << 4)
| (p5.n << 5)
| (p6.n << 6)
| (p7.n << 7)
| (p8.n << 8)
| (p9.n << 9)
as number
from
(select 0 as n union select 1) p0,
(select 0 as n union select 1) p1,
(select 0 as n union select 1) p2,
(select 0 as n union select 1) p3,
(select 0 as n union select 1) p4,
(select 0 as n union select 1) p5,
(select 0 as n union select 1) p6,
(select 0 as n union select 1) p7,
(select 0 as n union select 1) p8,
(select 0 as n union select 1) p9
order by number
Last 30 days date series:
select dateadd(day, -number, current_date) as dt from v_series_0_to_1024 where number < 30

how to do dead reckoning on column of table, postgresql

I have a table looks like,
x y
1 2
2 null
3 null
1 null
11 null
I want to fill the null value by conducting a rolling
function to apply y_{i+1}=y_{i}+x_{i+1} with sql as simple as possible (inplace)
so the expected result
x y
1 2
2 4
3 7
1 8
11 19
implement in postgresql. I may encapsulate it in a window function, but the implementation of custom function seems always complex
WITH RECURSIVE t AS (
select x, y, 1 as rank from my_table where y is not null
UNION ALL
SELECT A.x, A.x+ t.y y , t.rank + 1 rank FROM t
inner join
(select row_number() over () rank, x, y from my_table ) A
on t.rank+1 = A.rank
)
SELECT x,y FROM t;
You can iterate over rows using a recursive CTE. But in order to do so, you need a way to jump from row to row. Here's an example using an ID column:
; with recursive cte as
(
select id
, y
from Table1
where id = 1
union all
select cur.id
, prev.y + cur.x
from Table1 cur
join cte prev
on cur.id = prev.id + 1
)
select *
from cte
;
You can see the query at SQL Fiddle. If you don't have an ID column, but you do have another way to order the rows, you can use row_number() to get an ID:
; with recursive sorted as
(
-- Specify your ordering here. This example sorts by the dt column.
select row_number() over (order by dt) as id
, *
from Table1
)
, cte as
(
select id
, y
from sorted
where id = 1
union all
select cur.id
, prev.y + cur.x
from sorted cur
join cte prev
on cur.id = prev.id + 1
)
select *
from cte
;
Here's the SQL Fiddle link.

Window functions and more "local" aggregation

Suppose I have this table:
select * from window_test;
k | v
---+---
a | 1
a | 2
b | 3
a | 4
Ultimately I want to get:
k | min_v | max_v
---+-------+-------
a | 1 | 2
b | 3 | 3
a | 4 | 4
But I would be just as happy to get this (since I can easily filter it with distinct):
k | min_v | max_v
---+-------+-------
a | 1 | 2
a | 1 | 2
b | 3 | 3
a | 4 | 4
Is it possible to achieve this with PostgreSQL 9.1+ window functions? I'm trying to understand if I can get it to use separate partition for the first and last occurrence of k=a in this sample (ordered by v).
This returns your desired result with the sample data. Not sure if it will work for real world data:
select k,
min(v) over (partition by group_nr) as min_v,
max(v) over (partition by group_nr) as max_v
from (
select *,
sum(group_flag) over (order by v,k) as group_nr
from (
select *,
case
when lag(k) over (order by v) = k then null
else 1
end as group_flag
from window_test
) t1
) t2
order by min_v;
I left out the DISTINCT though.
EDIT: I've came up with the following query — without window functions at all:
WITH RECURSIVE tree AS (
SELECT k, v, ''::text as next_k, 0 as next_v, 0 AS level FROM window_test
UNION ALL
SELECT c.k, c.v, t.k, t.v + level, t.level + 1
FROM tree t JOIN window_test c ON c.k = t.k AND c.v + 1 = t.v),
partitions AS (
SELECT t.k, t.v, t.next_k,
coalesce(nullif(t.next_v, 0), t.v) AS next_v, t.level
FROM tree t
WHERE NOT EXISTS (SELECT 1 FROM tree WHERE next_k = t.k AND next_v = t.v))
SELECT min(k) AS k, v AS min_v, max(next_v) AS max_v
FROM partitions p
GROUP BY v
ORDER BY 2;
I've provided 2 working queries now, I hope one of them will suite you.
SQL Fiddle for this variant.
Another way how to achieve this is to use a support sequence.
Create a support sequence:
CREATE SEQUENCE wt_rank START WITH 1;
The query:
WITH source AS (
SELECT k, v,
coalesce(lag(k) OVER (ORDER BY v), k) AS prev_k
FROM window_test
CROSS JOIN (SELECT setval('wt_rank', 1)) AS ri),
ranking AS (
SELECT k, v, prev_k,
CASE WHEN k = prev_k THEN currval('wt_rank')
ELSE nextval('wt_rank') END AS rank
FROM source)
SELECT r.k, min(s.v) AS min_v, max(s.v) AS max_v
FROM ranking r
JOIN source s ON r.v = s.v
GROUP BY r.rank, r.k
ORDER BY 2;
Would this not do the job for you, without the need for windows, partitions or coalescing. It just uses a traditional SQL trick for finding nearest tuples via a self join, and a min on the difference:
SELECT k, min(v), max(v) FROM (
SELECT k, v, v + min(d) lim FROM (
SELECT x.*, y.k n, y.v - x.v d FROM window_test x
LEFT JOIN window_test y ON x.k <> y.k AND y.v - x.v > 0)
z GROUP BY k, v, n)
w GROUP BY k, lim ORDER BY 2;
I think this is probably a more 'relational' solution, but I'm not sure about its efficiency.