Define custom window functions in pySpark - pyspark

I have a dataframe in Spark for which I would like to create a column defined recursively like that:
new_column_row = f(last_column_row, other_parameters)
The best way to do that would be to define my custom window function, but I cannot find a way to do it in PySpark, anybody encountered the same problem ?
The case I am working on is:
My problem is about reconstructing an order book from a list of orders:
I have a dataframe like that (value is what I would like to calculate)
size | price | output
1 | 1 | {1:1}
1.2 | 1.1 | {1:1, 1.2:1.1}
1.3 | - 1. | {1.2:1.1}
Output is updated at each row like that (in python pseudocode)
- if price not in Output:
Output[price] = size
- if price in Output:
Output[price] = output[price] + size
if Output[price] = 0:
del Output[price]

Related

pyspark dataframe: add a new indicator column with random sampling

I have a spark dataframe containing the following schema:
StructField(email_address,StringType,true),StructField(subject_line,StringType,true)))
I want randomly sample 50% of the population into control and test. Currently I am doing it the following way:
df_segment_ctl = df_segment.sample(False, 0.5, seed=0)
df_segment_tmt = df_segment.join(df_segment_ctl, ["email_address"], "leftanti")
But I am certain there must be a better way to create a column instead like the following
| email_address| segment_id|group |
+--------------------+---------------+---------+
|xxxxxxxxxx#gmail.com| 1.1|treatment|
| xxxxxxx#gmail.com| 1.6|control |
Any help is appreciated. I am new to this world
UPDATE:
I dont want to split the dataframe into two. Just want to add an indicator column
UPDATE:
Is it possible to have multiple splits elegantly. Suppose instead of two groups I want a single control and two treatment
| email_address| segment_id|group |
+--------------------+---------------+---------+
|xxxxxxxxxx#gmail.com| 1.1|treat_1. |
| xxxxxxx#gmail.com| 1.6|control |
| xxxxx#gmail.com | 1.6|treat_2 |
You can split the spark dataframe using random split like below
df_segment_ctl, df_segment_tmt = df_segment.randomSplit(weights=[0.5,0.5], seed=0)

scala fast range lookup on 2 columns

I have a spark dataframe that I am broadcasting as Array[Array[String]].
My requirement is to do a range lookup on 2 columns.
Right now I have something like ->
val cols = data.filter(_(0).toLong <= ip).filter(_(1).toLong >= ip).take(1) match {
case Array(t) => t
case _ => Array()
}
The following data file is stored as Array[Array[String]] (except for the header row that I have shown below only as reference.) and passed to the filter function shown above.
sample data file ->
startIPInt | endIPInt | lat | lon
676211200 | 676211455 | 37.33053 | -121.83823
16777216 | 16777342 | -34.9210644736842 | 138.598709868421
17081712 | 17081712 | 0 | 0
sample value to search ->
ip = 676211325
based on the range of the startIPInt and endIPInt values, I want the rest of the mapping rows.
This lookup takes 1-2 sec for each, and I am not even sure the 2nd filter condition is getting executed(in debug mode always it only seems to execute the 1st condition). Can someone suggest me a faster and more reliable lookup here?
Thanks!

How to create a column of row id in Spark dataframe for each distinct column value using Scala

I have a data frame in scala spark as
category | score |
A | 0.2
A | 0.3
A | 0.3
B | 0.9
B | 0.8
B | 1
I would like to
add a row id column as
category | score | row-id
A | 0.2 | 0
A | 0.3 | 1
A | 0.3 | 2
B | 0.9 | 0
B | 0.8 | 1
B | 1 | 2
Basically I want the row id to be monotonically increasing for each distinct value in column category. I already have a sorted dataframe so all the rows with same category are grouped together. However, I still don't know how to generate the row_id that restarts when a new category appears. Please help!
This is a good use case for Window aggregation functions
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
import df.sparkSession.implicits._
val window = Window.partitionBy('category).orderBy('score)
df.withColumn("row-id", row_number.over(window))
Window functions work kind of like groupBy except that instead of each group returning a single value, each row in each group returns a single value. In this case the value is the row's position within the group of rows of the same category. Also, if this is the effect that you are trying to achieve, then you don't need to have pre-sorted the column category beforehand.

Calculate frequency of column in data frame using spark sql

I'm trying to get the Frequency of distinct values in a Spark dataframe column, something like "value_counts" from Python Pandas. By frequency I mean, the highest occurring value in a table column (such as rank 1 value, rank 2, rank 3 etc. In the expected output, 1 has occurred 9 times in column a, so it has topmost frequency.
I'm using Spark SQL but it is not working out, may be because of the reduce operation I have written is wrong.
**Pandas Example**
value_counts().index[1]
**Current Code in Spark**
val x= parquetRDD_subset.schema.fieldNames
val dfs = x.map(field => spark.sql
(s"select 'ParquetRDD' as TableName,
'$field' as column,
min($field) as min, max($field) as max,
SELECT number_cnt FROM (SELECT $field as value,
approx_count_distinct($field) as number_cnt FROM peopleRDDtable
group by $field) as frequency from peopleRDDtable"))
val withSum = dfs.reduce((x, y) => x.union(y)).distinct()
withSum.show()
The problem area is with query below.
SELECT number_cnt FROM (SELECT $field as value,
approx_count_distinct($field) as number_cnt FROM peopleRDDtable
group by $field)
**Expected output**
TableName | column | min | max | frequency1 |
_____________+_________+______+_______+____________+
ParquetRDD | a | 1 | 30 | 9 |
_____________+_________+______+_______+____________+
ParquetRDD | b | 2 | 21 | 5 |
How do I solve this ? please help.
I could solve the issue with below with using count($field) instead of approx_count_distinct($field). Then I used Rank analytical function to get the first rank of value. It worked.

How to automatically calculate the SUS Score for a given spreadsheet in LibreOffice Calc?

I have several spreadsheets for a SUS-Score usability test.
They have this form:
| Strongly disagree | | | | Strongly agree |
I think, that I would use this system often | x | | | | |
I found the system too complex | |x| | | |
(..) | | | | | x |
(...) | x | | | | |
To calculate the SUS-Score you have 3 rules:
Odd item: Pos - 1
Even item: 5 - Pos
Add Score, multiply by 2.5
So for the first entry (odd item) you have: Pos - 1 = 1 - 1 = 0
Second item (even): 5 - Pos = 5 - 2 = 3
Now I have several of those spreadsheets and want to calculate the SUS-Score automatically. I've changed the x to a 1 and tried to use IF(F5=1,5-1). But I would need an IF-condition for every column: =IF(F5=1;5-1;IF(E5=1;4-1;IF(D5=1;3-1;IF(C5=1;2-1;IF(B5=1;1-1))))), so is there an easier way to calculate the score, based on the position in the table?
I would use a helper table and then SUM() all the cells of the helper table and multiply by 2.5. This formula (modified as needed, see notes below) can start your helper table and be copy-pasted to fill out the entire table:
=IF(D2="x";IF(MOD(ROW();2)=1;5-D$1;D$1-1);"")
Here D is an answer column
Depending on what row (odd/even) your answers start you may need to change the =1 after the MOD function to =0
This assumes the position number is in row 1; if position numbers are in a different row change the number after the $ appropriately