A Postgres query to get subtraction of a value in a row by the value in the next row - postgresql

I have a table like(mytable):
id | value
=========
1 | 4
2 | 5
3 | 8
4 | 16
5 | 8
...
I need a query to give me subtraction on each rows by next row:
id | value | diff
=================
1 | 4 | 4 (4-Null)
2 | 5 | 1 (5-4)
3 | 8 | 3 (8-5)
4 | 16 | 8 (16-8)
5 | 8 | -8 (8-16)
...
Right now I use a python script to do so, but I guess it's faster if I create a view from this table.

You should use window functions - LAG() in this case:
SELECT id, value, value - LAG(value, 1) OVER (ORDER BY id) AS diff
FROM mytable
ORDER BY id;

Related

How to get the change number?

How to increase value when source value is changed?
I have tried rank, dense_rank, row_number without success =(
id | src | how to get this?
--------
1 | 1 | 1
2 | 1 | 1
3 | 7 | 2
4 | 1 | 3
5 | 3 | 4
6 | 3 | 4
7 | 1 | 5
NOTICE: src is guaranteed to be in this order you see
is there simple way to do this?
You can achieve this by nesting two window functions - the first to get whether the src value changed from the previous row, the second to sum the number of changes. Unfortunately Postgres doesn't allow nesting window functions directly, but you can work around that with a subquery:
SELECT
id,
src,
sum(incr) OVER (ORDER BY id)
FROM (
SELECT
*,
(lag(src) OVER (ORDER BY id) IS DISTINCT FROM src)::int AS incr
FROM example
) AS _;
(online demo)

PostgresQL for each row, generate new rows and merge

I have a table called example that looks as follows:
ID | MIN | MAX |
1 | 1 | 5 |
2 | 34 | 38 |
I need to take each ID and loop from it's min to max, incrementing by 2 and thus get the following WITHOUT using INSERT statements, thus in a SELECT:
ID | INDEX | VALUE
1 | 1 | 1
1 | 2 | 3
1 | 3 | 5
2 | 1 | 34
2 | 2 | 36
2 | 3 | 38
Any ideas of how to do this?
The set-returning function generate_series does exactly that:
SELECT
id,
generate_series(1, (max-min)/2+1) AS index,
generate_series(min, max, 2) AS value
FROM
example;
(online demo)
The index can alternatively be generated with RANK() (example, see also #a_horse_­with_­no_­name's answer) if you don't want to rely on the parallel sets.
Use generate_series() to generate the numbers and a window function to calculate the index:
select e.id,
row_number() over (partition by e.id order by g.value) as index,
g.value
from example e
cross join generate_series(e.min, e.max, 2) as g(value);

How would you create a group identifier based on one column, but sorted by another?

I am attempting to create column Group via T-SQL.
If a cluster of accounts are in a row, consider that as one group. if the account is seen again lower in the list (cluster or not), then consider it a new group. This seems straight forward, but I cannot seem to see the solution... Below there are three clusters of account 3456, each having a different group number (Group 1,4, and 6)
+-------+---------+------+
| Group | Account | Sort |
+-------+---------+------+
| 1 | 3456 | 1 |
| 1 | 3456 | 2 |
| 2 | 9878 | 3 |
| 3 | 5679 | 4 |
| 4 | 3456 | 5 |
| 4 | 3456 | 6 |
| 4 | 3456 | 7 |
| 5 | 1295 | 8 |
| 6 | 3456 | 9 |
+-------+---------+------+
UPDATE: I left this out of the original requirements, but a cluster of accounts could have more than two accounts. I updated the example data to include this scenario.
Here's how I'd do it:
--Sample Data
DECLARE #table TABLE (Account INT, Sort INT);
INSERT #table
VALUES (3456,1),(3456,2),(9878,3),(5679,4),(3456,5),(3456,6),(1295,7),(3456,8);
--Solution
SELECT [Group] = DENSE_RANK() OVER (ORDER BY grouper.groupID), grouper.Account, grouper.Sort
FROM
(
SELECT t.*, groupID = ROW_NUMBER() OVER (ORDER BY t.sort) +
CASE t.Account WHEN LEAD(t.Account,1) OVER (ORDER BY t.sort) THEN 1 ELSE 0 END
FROM #table AS t
) AS grouper;
Results:
Group Account Sort
------- ----------- -----------
1 3456 1
1 3456 2
2 9878 3
3 5679 4
4 3456 5
4 3456 6
5 1295 7
6 3456 8
Update based on OPs comment below (20190508)
I spent a couple days banging my head on how to handle groups of three or more; it was surprisingly difficult but what I came up with handles bigger clusters and is way better than my first answer. I updated the sample data to include bigger clusters.
Note that I include a UNIQUE constraint for the sort column - this creates a unique index. You don't need the constraint for this solution to work but, having an index on that column (clustered, nonclustered unique or just nonclustered) will improve the performance dramatically.
--Sample Data
DECLARE #table TABLE (Account INT, Sort INT UNIQUE);
INSERT #table
VALUES (3456,1),(3456,2),(9878,3),(5679,4),(3456,5),(3456,6),(1295,7),(1295,8),(1295,9),(1295,10),(3456,11);
-- Better solution
WITH Groups AS
(
SELECT t.*, Grouper =
CASE t.Account WHEN LAG(t.Account,1,t.Account) OVER (ORDER BY t.Sort) THEN 0 ELSE 1 END
FROM #table AS t
)
SELECT [Group] = SUM(sg.Grouper) OVER (ORDER BY sg.Sort)+1, sg.Account, sg.Sort
FROM Groups AS sg;
Results:
Group Account Sort
----------- ----------- -----------
1 3456 1
1 3456 2
2 9878 3
3 5679 4
4 3456 5
4 3456 6
5 1295 7
5 1295 8
5 1295 9
5 1295 10
6 3456 11

PostgreSQL WITH RECURSIVE query to get ordered parent-child chain by a Partition Key

I have the issue writing a sql script on PostgreSQL 9.6.6 which orders steps in a process by using the steps' parent-child ID's, and this grouped/partitioned per process ID. I couldn't find this special case here, so I apologize if I missed it and would please you to provide me the link to the solution in the comments.
The case: I have a table which looks like this:
processID | stepID | parentID
1 1 NULL
1 3 5
1 2 4
1 4 3
1 5 1
2 1 NULL
2 3 5
2 2 4
2 4 3
2 5 1
Now I have to order the steps by starting with the step where parentID is NULL for each processID .
Note: I cannot simply order StepID or parentID as new steps I put within the whole process get a higher stepID then the last step in the process (continuous generating surrogate key).
I have to order the steps for every processID, that I will receive the following output:
processID | stepID | parentID
1 1 NULL
1 5 1
1 3 5
1 4 3
1 2 4
2 1 NULL
2 5 1
2 3 5
2 4 3
2 2 4
I tried to do this with the CTE function WITH RECURSIVE:
WITH RECURSIVE
starting (processID,stepID, parentID) AS
(
SELECT b.processID,b.stepID, b.parentID
FROM process b
WHERE b.parentID ISNULL
),
descendants (processID,stepID, parentID) AS
(
SELECT b.processID,b.stepID, b.stepparentID
FROM starting b
UNION ALL
SELECT b.processID,b.stepID, b.parentID
FROM process b
JOIN descendants AS c ON b.parentID = c.stepID
)
SELECT * FROM descendants
The result is not what I am searching for. As we have hundreds of processes, I receive a list where the first records are the different processIDs which have a NULL value as parentID.
I guess I have to recursive the whole script on the processID again, but have no idea how.
Thank you for your help!
You should calculate the level of each step:
with recursive starting as (
select processid, stepid, parentid, 0 as level
from process
where parentid is null
union all
select p.processid, p.stepid, p.parentid, level+ 1
from starting s
join process p on s.stepid = p.parentid and s.processid = p.processid
)
select *
from starting
order by processid, level
processid | stepid | parentid | level
-----------+--------+----------+-------
1 | 1 | | 0
1 | 5 | 1 | 1
1 | 3 | 5 | 2
1 | 4 | 3 | 3
1 | 2 | 4 | 4
2 | 1 | | 0
2 | 5 | 1 | 1
2 | 3 | 5 | 2
2 | 4 | 3 | 3
2 | 2 | 4 | 4
(10 rows)
Of course, you can skip the last column in the final select if you do not need it.

Add a key element for n rows in PySpark Dataframe

I have a dataframe like the one shown below.
id | run_id
--------------
4 | 12345
6 | 12567
10 | 12890
13 | 12450
I wish to add a new column say Key that will have value 1 for the first n rows and 2 for the next n rows. The result will be like:
id | run_id | key
----------------------
4 | 12345 | 1
6 | 12567 | 1
10 | 12890 | 2
13 | 12450 | 2
Is it possibile to do the same with PySpark?. Thanks in advance for the help.
Here is one way to do it using zipWithIndex:
# sample rdd
rdd=sc.parallelize([[4,12345], [6,12567], [10,12890], [13,12450]])
# group size for key
n=2
# add rownumber and then label in batches of size n
rdd=rdd.zipWithIndex().map(lambda (x, rownum): x+[int(rownum/n)+1])
# convert to dataframe
df=rdd.toDF(schema=['id', 'run_id', 'key'])
df.show(4)