pySpark sequential window function / access previously computed valued (previous row)

pySpark sequential window function / access previously computed valued (previous row) - pyspark

I have a pySpark dataframe and need to compute a column which depends on the value of the same column in the previous row. But instead of using the old value of this column of the previous row, I need the new one, to which the calculation was already applied.
Specifically, I need to compute the following: Given a dataframe with a column A in a specific order order. Compute a column B sequentially as B = MAX(0, LAG(B) - LAG(A)), starting with a default value of 0 for the first row.
Example:
Input:
order | A
------|----
0 | -1
1 | -2
2 | 4
3 | 4
4 | -1
5 | 4
6 | -1
Wanted output:
order | A | B
------|----|---
0 | -1 | 0 <- B here is here set to 0
1 | -2 | 1
2 | 4 | 3
3 | 4 | 0
4 | -1 | 0
5 | 4 | 1
6 | -1 | 0
Using the default F.lag window function does not work, since this one yields only the old previous row, since otherwise distributed computing is no longer possible, if it needs to be computed sequentially.

Related

Spark PIVOT performance is very slow on High volume Data

I have one dataframe with 3 columns and 20,000 no of rows. i need to be convert all 20,000 transid into column.
table macro:
prodid
transid
flag
A
1
1
B
2
1
C
3
1
so on..
Expected Op be like upto 20,000 no of columns:
prodid
1
2
3
A
1
1
1
B
1
1
1
C
1
1
1
I have tried with PIVOT/transpose function but its taking too long time for high volume data. for processing 20,000 rows to column its taking around 10 hrs.
eg.
val array =a1.select("trans_id").distinct.collect.map(x => x.getString(0)).toSeq
val a2=a1.groupBy("prodid").pivot("trans_id",array).sum("flag")
When i used pivot on 200-300 no of rows then it is working fast but when no of rows increase PIVOT is not good.
can anyone please help me to find out the solution.is there any method to avoid PIVOT function as PIVOT is good for low volume conversion only.How to deal with high volume data.
I need this type of conversion for matrix multiplication.
for matrix multiplication my input be like below table and final results will be in matrix multiplication.
|col1|col2|col3|col4|
|----|----|----|----|
|1 | 0 | 1 | 0 |
|0 | 1 | 0 | 0 |
|1 | 1 | 1 | 1 |

A Postgres query to get subtraction of a value in a row by the value in the next row

I have a table like(mytable):
id | value
=========
1 | 4
2 | 5
3 | 8
4 | 16
5 | 8
...
I need a query to give me subtraction on each rows by next row:
id | value | diff
=================
1 | 4 | 4 (4-Null)
2 | 5 | 1 (5-4)
3 | 8 | 3 (8-5)
4 | 16 | 8 (16-8)
5 | 8 | -8 (8-16)
...
Right now I use a python script to do so, but I guess it's faster if I create a view from this table.

You should use window functions - LAG() in this case:
SELECT id, value, value - LAG(value, 1) OVER (ORDER BY id) AS diff
FROM mytable
ORDER BY id;

Add a key element for n rows in PySpark Dataframe

I have a dataframe like the one shown below.
id | run_id
--------------
4 | 12345
6 | 12567
10 | 12890
13 | 12450
I wish to add a new column say Key that will have value 1 for the first n rows and 2 for the next n rows. The result will be like:
id | run_id | key
----------------------
4 | 12345 | 1
6 | 12567 | 1
10 | 12890 | 2
13 | 12450 | 2
Is it possibile to do the same with PySpark?. Thanks in advance for the help.

Here is one way to do it using zipWithIndex:
# sample rdd
rdd=sc.parallelize([[4,12345], [6,12567], [10,12890], [13,12450]])
# group size for key
n=2
# add rownumber and then label in batches of size n
rdd=rdd.zipWithIndex().map(lambda (x, rownum): x+[int(rownum/n)+1])
# convert to dataframe
df=rdd.toDF(schema=['id', 'run_id', 'key'])
df.show(4)

Tableau shift data

I am trying to push my data in Tableau by one cell. My current data is in the format below:
Month | Actual
--------------
1 | 0
2 | x
3 | y
4 | z
I want to create a calculated field that will push the data in actual by one cell based on a condition such that I have a new field as expected below:
Month | Actual | Expected
-------------------------
1 | 0 | 0
2 | x | 0
3 | y | x
4 | z | y
It will be helpful if anybody can tell me a correct way of updating the value.

You can do this using a table calculation, but it is important how you show the data in your view.
Create a new calculated field and use the following logic:
IFNULL(LOOKUP(MAX([Actual]),-1),'0')
Your view and output should look like this:

BIRT How to correctly chart date time axis with sparse data

I have a query that returns a count of events on dates over the last year.
e.g.
|Date | ev_count|
------------+----------
|2015-09-23 | 12 |
|2016-01-01 | 56 |
|2016-01-15 | 34 |
|2016-04-08 | 65 |
| ...
I want to produce a graph (date on the X-axis and value on Y) that will either show values for all dates (0 when no data), or at least place the dates where there are values in a correctly scaled place for the date along the time axis.
My current graph has just the values one after another. I have previously used dimple for generating graphs, and if you tell it that it's a time axis, it automagically places dates correctly spaced.
This is what I get
|
| *
| *
| *
|*_______________
9 1 1 4
This is what I want to have
|
| *
| *
| *
|_________*________________________________________
0 0 1 1 1 0 0 0 0 .....
8 9 0 1 2 1 2 3 4
Is there a function/trick in BIRT that will allow me to fill in the gaps with 0 or position/scale the date somehow (e.g. based on a max/min)? Or do I have to join my data with a date generator query to fill in the gaps?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

pySpark sequential window function / access previously computed valued (previous row) - pyspark

Related

Spark PIVOT performance is very slow on High volume Data

A Postgres query to get subtraction of a value in a row by the value in the next row

Add a key element for n rows in PySpark Dataframe

Tableau shift data

BIRT How to correctly chart date time axis with sparse data

Categories

Resources