Calculate the autocorrelation in pyspark

Calculate the autocorrelation in pyspark - pyspark

I am currently migrate my scripts from pandas to pyspark. I want to calculate the autocorrelation of return for each stock in each day. My data looks like:
+-----+--------+-------+----------+----------+
|stock| date | hour | minute | return |
+-----+--------+-------+----------+----------+
VOD | 01-02 | 10 | 13 | 0.05 |
VOD | 01-02 | 10 | 14 | 0.02 |
VOD | 01-02 | 10 | 16 | -0.02 |
VOD | 01-02 | 11 | 13 | 0.05 |
VOD | 01-02 | 12 | 03 | 0.02 |
VOD | 01-02 | 13 | 45 | -0.02 |
... ... ... .... ...
ABC | 01-02 | 11 | 13 | 0.01 |
ABC | 01-02 | 11 | 14 | 0.02 |
ABC | 01-02 | 11 | 15 | 0.03 |
The desired output should be like
+-----+--------+-------+
|stock| date | auto |
+-----+--------+-------+
VOD | 01-02 | 0.04 |
VOD | 01-03 | 0.07 |
VOD | 01-04 | 0.01 |
VOD | 01-05 | 0.05 |
Its very simple to do this in pandas
df_auto=df.groupby[('stock','date')]['return'].apply(pd.Series.autocorr,lag=1).reset_index(name='auto')
However, can someone let me know how to get the autocorrelation of a factor in pyspark? Thanks.

sample data:
df.show()
+-----+-----+----+------+------+
|stock| date|hour|minute|return|
+-----+-----+----+------+------+
| VOD|01-02| 10| 13| 0.05|
| VOD|01-02| 10| 14| 0.02|
| VOD|01-02| 10| 16| -0.02|
| VOD|01-02| 11| 13| 0.05|
| VOD|01-02| 12| 3| 0.02|
| VOD|01-02| 13| 45| -0.02|
+-----+-----+----+------+------+
Use groupby with collectist and apply udf on collected list. There is no autocorr function in spark so we have to use pandas/series:
from pyspark.sql import functions as F
def autocorr(ret):
import pandas as pd
s = pd.Series(ret)
return float(s.autocorr(lag=1))
auto=F.udf(autocorr, FloatType())
df.groupBy("stock","date").agg(F.collect_list(F.col("return")).alias("return")).withColumn("auto", auto("return")).select("stock","date","auto").show(truncate=False)
+-----+-----+-----------+
|stock|date |auto |
+-----+-----+-----------+
|VOD |01-02|-0.28925422|
+-----+-----+-----------+

Related

PySpark rename multiple columns based on regex pattern list

I have a dataframe as shown below. I want to rename columns based on regex patterns.
patterns = ["price-usd-([0-9]+)", "list_price_([0-9]+)", "price_per_([0-9]+)_units", "pricefor([0-9]+)", "([0-9]+)_plus_price", "break_price_([0-9]+)", "price_break_pricing_([a-z]+)"]
Based on the above patterns i want to rename columns in dataframe as below.
------------------------------------------------------------------------------------------------------------------------------------------
| item_name | price-usd-1 | break_price_7 | pricefor5 | price_per_9_units | price_break_pricing_a | 2_plus_price | list_price_8 |
------------------------------------------------------------------------------------------------------------------------------------------
| Samsung Z | 10000 | 5 | 9000 | 10 | 7000 | 4 | 21 |
| Moto G4 | 12000 | 10 | 10000 | 20 | 6000 | 3 | 43 |
| Mi 4i | 15000 | 8 | 12000 | 20 | 10000 | 5 | 25 |
| Moto G3 | 20000 | 5 | 18000 | 12 | 15000 | 10 | 15 |
------------------------------------------------------------------------------------------------------------------------------------------
Output:
----------------------------------------------------------------------------------------------------------------------
| item_name | price_1 | price_7 | price_5 | price_9 | price_a | price_2 | price_8 |
----------------------------------------------------------------------------------------------------------------------
| Samsung Z | 10000 | 5 | 9000 | 10 | 7000 | 4 | 21 |
| Moto G4 | 12000 | 10 | 10000 | 20 | 6000 | 3 | 43 |
| Mi 4i | 15000 | 8 | 12000 | 20 | 10000 | 5 | 25 |
| Moto G3 | 20000 | 5 | 18000 | 12 | 15000 | 10 | 15 |
----------------------------------------------------------------------------------------------------------------------

I would go your way. I would use regex to extract values and then rename.
Data
df=spark.createDataFrame ([('Samsung Z ', 10000 , 5 , 9000 , 10 , 7000 , 20 , 'amazon.com') ,
('Moto G4' , 12000 , 10 , 10000 , 20 , 6000 , 50 , 'ebay.com' ) ,
('Mi 4i ' , 15000 , 8 , 12000 , 20 , 10000 , 25 ,' deals.com') ,
( 'Moto G3' , 20000 , 5 , 18000 , 12 , 15000 , 30 , 'ebay.com' ) ] ,
('item_name' , ' price-usd-1' , 'break_price_7 ' , 'pricefor5 ' , 'price_per_9_units' , 'price_3' , 'price_break_pricing_a6' , '2_plus_price' ))
+----------+------------+--------------+-----------+-----------------+-------+----------------------+------------+
| item_name| price-usd-1|break_price_7 |pricefor5 |price_per_9_units|price_3|price_break_pricing_a6|2_plus_price|
+----------+------------+--------------+-----------+-----------------+-------+----------------------+------------+
|Samsung Z | 10000| 5| 9000| 10| 7000| 20| amazon.com|
| Moto G4| 12000| 10| 10000| 20| 6000| 50| ebay.com|
| Mi 4i | 15000| 8| 12000| 20| 10000| 25| deals.com|
| Moto G3| 20000| 5| 18000| 12| 15000| 30| ebay.com|
+----------+------------+--------------+-----------+-----------------+-------+----------------------+------------+
Solution
import re
x = ['_'.join(sorted(re.findall(r'price|\d', x),reverse=True)) for x in df.columns if x!='item_name']#extract price and digits into a list, and concat
df.toDF('item_name',*x).show()#Pass new names into df
+----------+-------+-------+-------+-------+-------+-------+----------+
| item_name|price_1|price_7|price_5|price_9|price_3|price_6| price_2|
+----------+-------+-------+-------+-------+-------+-------+----------+
|Samsung Z | 10000| 5| 9000| 10| 7000| 20|amazon.com|
| Moto G4| 12000| 10| 10000| 20| 6000| 50| ebay.com|
| Mi 4i | 15000| 8| 12000| 20| 10000| 25| deals.com|
| Moto G3| 20000| 5| 18000| 12| 15000| 30| ebay.com|
+----------+-------+-------+-------+-------+-------+-------+----------+

Create a Dataframe based on ranges of other Dataframe

I have a Spark Dataframe containing ranges of numbers (column start and column end), and a column containing the type of this range.
I want to create a new Dataframe with two columns, the first one lists all ranges (incremented by one), and the second one lists the range's type.
To explain more, this is the input Dataframe :
+-------+------+---------+
| start | end | type |
+-------+------+---------+
| 10 | 20 | LOW |
| 21 | 30 | MEDIUM |
| 31 | 40 | HIGH |
+-------+------+---------+
And this is the desired result :
+-------+---------+
| nbr | type |
+-------+---------+
| 10 | LOW |
| 11 | LOW |
| 12 | LOW |
| 13 | LOW |
| 14 | LOW |
| 15 | LOW |
| 16 | LOW |
| 17 | LOW |
| 18 | LOW |
| 19 | LOW |
| 20 | LOW |
| 21 | MEDIUM |
| 22 | MEDIUM |
| .. | ... |
+-------+---------+
Any ideas ?

Try this.
val data = List((10, 20, "Low"), (21, 30, "MEDIUM"), (31, 40, "High"))
import spark.implicits._
val df = data.toDF("start", "end", "type")
df.withColumn("nbr", explode(sequence($"start", $"end"))).drop("start","end").show(false)
output:
+------+---+
|type |nbr|
+------+---+
|Low |10 |
|Low |11 |
|Low |12 |
|Low |13 |
|Low |14 |
|Low |15 |
|Low |16 |
|Low |17 |
|Low |18 |
|Low |19 |
|Low |20 |
|MEDIUM|21 |
|MEDIUM|22 |
|MEDIUM|23 |
|MEDIUM|24 |
|MEDIUM|25 |
|MEDIUM|26 |
|MEDIUM|27 |
|MEDIUM|28 |
|MEDIUM|29 |
+------+---+
only showing top 20 rows

The solution provided by #Learn-Hadoop works if you're on Spark 2.4+ .
For older Spark version, consider creating a simple UDF to mimic the sequence function:
val sequence = udf{ (lower: Int, upper: Int) =>
Seq.iterate(lower, upper - lower + 1)(_ + 1)
}
df.withColumn("nbr",explode(sequence($"start",$"end"))).drop("start","end").show(false)

Mean with differents columns ignoring Null values, Spark Scala

I have a dataframe with different columns, what I am trying to do is the mean of this diff columns ignoring null values. For example:
+--------+-------+---------+-------+
| Baller | Power | Vision | KXD |
+--------+-------+---------+-------+
| John | 5 | null | 10 |
| Bilbo | 5 | 3 | 2 |
+--------+-------+---------+-------+
The output have to be:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | 7.5 |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
What I am doing:
val a_cols = Array(col("Power"), col("Vision"), col("KXD"))
val avgFunc = a_cols.foldLeft(lit(0)){(x, y) => x+y}/a_cols.length
val avg_calc = df.withColumn("MEAN", avgFunc)
But I get the null values:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | null |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+

You can explode the columns and do a group by + mean, then join back to the original dataframe using the Baller column:
val result = df.join(
df.select(
col("Baller"),
explode(array(col("Power"), col("Vision"), col("KXD")))
).groupBy("Baller").agg(mean("col").as("MEAN")),
Seq("Baller")
)
result.show
+------+-----+------+---+------------------+
|Baller|Power|Vision|KXD| MEAN|
+------+-----+------+---+------------------+
| John| 5| null| 10| 7.5|
| Bilbo| 5| 3| 2|3.3333333333333335|
+------+-----+------+---+------------------+

Time series with scala and spark. Rolling window

I'm trying to work on the following exercise using Scala and spark.
Given a file containing two columns: a time in seconds and a value
Example:
|---------------------|------------------|
| seconds | value |
|---------------------|------------------|
| 225 | 1,5 |
| 245 | 0,5 |
| 300 | 2,4 |
| 319 | 1,2 |
| 320 | 4,6 |
|---------------------|------------------|
and given a value V to be used for the rolling window this output should be created:
Example with V=20
|--------------|---------|--------------------|----------------------|
| seconds | value | num_row_in_window |sum_values_in_windows |
|--------------|---------|--------------------|----------------------|
| 225 | 1,5 | 1 | 1,5 |
| 245 | 0,5 | 2 | 2 |
| 300 | 2,4 | 1 | 2,4 |
| 319 | 1,2 | 2 | 3,6 |
| 320 | 4,6 | 3 | 8,2 |
|--------------|---------|--------------------|----------------------|
num_row_in_window is the number of rows contained in the current window and
sum_values_in_windows is the sum of the values contained in the current window.
I've been trying with the sliding function or using the sql api but it's a bit unclear to me which is the best solution to tackle this problem considering that I'm a spark/scala novice.

This is a perfect application for window-functions. By using rangeBetween you can set your sliding window to 20s. Note that in the example below no partitioning is specified (no partitionBy). Without a partitioning, this code will not scale:
import ss.implicits._
val df = Seq(
(225, 1.5),
(245, 0.5),
(300, 2.4),
(319, 1.2),
(320, 4.6)
).toDF("seconds", "value")
val window = Window.orderBy($"seconds").rangeBetween(-20L, 0L) // add partitioning here
df
.withColumn("num_row_in_window", sum(lit(1)).over(window))
.withColumn("sum_values_in_window", sum($"value").over(window))
.show()
+-------+-----+-----------------+--------------------+
|seconds|value|num_row_in_window|sum_values_in_window|
+-------+-----+-----------------+--------------------+
| 225| 1.5| 1| 1.5|
| 245| 0.5| 2| 2.0|
| 300| 2.4| 1| 2.4|
| 319| 1.2| 2| 3.6|
| 320| 4.6| 3| 8.2|
+-------+-----+-----------------+--------------------+

Spark groupby filter sorting with top 3 read articles each city

I have a table data like following :
+-----------+--------+-------------+
| City Name | URL | Read Count |
+-----------+--------+-------------+
| Gurgaon | URL1 | 3 |
| Gurgaon | URL3 | 6 |
| Gurgaon | URL6 | 5 |
| Gurgaon | URL4 | 1 |
| Gurgaon | URL5 | 5 |
| Delhi | URL3 | 4 |
| Delhi | URL7 | 2 |
| Delhi | URL5 | 1 |
| Delhi | URL6 | 6 |
| Punjab | URL6 | 5 |
| Punjab | URL4 | 1 |
| Mumbai | URL5 | 5 |
+-----------+--------+-------------+
I would like to see somthing like -> Top 3 Read article(if exists) each city
+-----------+--------+--------+
| City Name | URL | Count |
+-----------+--------+--------+
| Gurgaon | URL3 | 6 |
| Gurgaon | URL6 | 5 |
| Gurgaon | URL5 | 5 |
| Delhi | URL6 | 6 |
| Delhi | URL3 | 4 |
| Delhi | URL1 | 3 |
| Punjab | URL6 | 5 |
| Punjab | URL4 | 1 |
| Mumbai | URL5 | 5 |
+-----------+--------+--------+
I am working on Spark 2.0.2, Scala 2.11.8

You can use window function to get the output.
import org.apache.spark.sql.expressions.Window
val df = sc.parallelize(Seq(
("Gurgaon","URL1",3), ("Gurgaon","URL3",6), ("Gurgaon","URL6",5), ("Gurgaon","URL4",1),("Gurgaon","URL5",5)
("DELHI","URL3",4), ("DELHI","URL7",2), ("DELHI","URL5",1), ("DELHI","URL6",6),("Mumbai","URL5",5)
("Punjab","URL6",6), ("Punjab","URL4",1))).toDF("City", "URL", "Count")
df.show()
+-------+----+-----+
| City| URL|Count|
+-------+----+-----+
|Gurgaon|URL1| 3|
|Gurgaon|URL3| 6|
|Gurgaon|URL6| 5|
|Gurgaon|URL4| 1|
|Gurgaon|URL5| 5|
| DELHI|URL3| 4|
| DELHI|URL7| 2|
| DELHI|URL5| 1|
| DELHI|URL6| 6|
| Mumbai|URL5| 5|
| Punjab|URL6| 6|
| Punjab|URL4| 1|
+-------+----+-----+
val w = Window.partitionBy($"City").orderBy($"Count".desc)
val dfTop = df.withColumn("row", rowNumber.over(w)).where($"row" <= 3).drop("row")
dfTop.show
+-------+----+-----+
| City| URL|Count|
+-------+----+-----+
|Gurgaon|URL3| 6|
|Gurgaon|URL6| 5|
|Gurgaon|URL5| 5|
| Mumbai|URL5| 5|
| DELHI|URL6| 6|
| DELHI|URL3| 4|
| DELHI|URL7| 2|
| Punjab|URL6| 6|
| Punjab|URL4| 1|
+-------+----+-----+
Output tested on Spark 1.6.2

Window functions are probably the way to go, and there is a built-in function for this purpose:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{rank, desc}
val window = Window.partitionBy($"City").orderBy(desc("Count"))
val dfTop = df.withColumn("rank", rank.over(window)).where($"rank" <= 3)