Spark Exponential Moving Average - scala

I have a dataframe of timeseries pricing data, with an ID, Date and Price.
I need to compute the Exponential Moving Average for the Price Column, and add it as a new column to the dataframe.
I have been using Spark's window functions before, and it looked like a fit for this use case, but given the formula for the EMA:
EMA: {Price - EMA(previous day)} x multiplier + EMA(previous day)
where
multiplier = (2 / (Time periods + 1)) //let's assume Time period is 10 days for now
I got a bit confused as to how can I access to the previous computed value in the column, while actually window-ing over the column.
With a simple moving average, it's simple, since all you need to do is compute a new column while averaging the elements in the window:
var window = Window.partitionBy("ID").orderBy("Date").rowsBetween(-windowSize, Window.currentRow)
dataFrame.withColumn(avg(col("Price")).over(window).alias("SMA"))
But it seems that with EMA its a bit more complicated since at every step I need the previous computed value.
I have also looked at Weighted moving average in Pyspark but I need an approach for Spark/Scala, and for a 10 or 30 days EMA.
Any ideas?

In the end, I've analysed how exponential moving average is implemented in pandas dataframes. Besides the recursive formula which I described above, and which is difficult to implement in any sql or window function(because its recursive), there is another one, which is detailed on their issue tracker:
y[t] = (x[t] + (1-a)*x[t-1] + (1-a)^2*x[t-2] + ... + (1-a)^n*x[t-n]) /
((1-a)^0 + (1-a)^1 + (1-a)^2 + ... + (1-a)^n).
Given this, and with additional spark implementation help from here, I ended up with the following implementation, which is roughly equivalent with doing pandas_dataframe.ewm(span=window_size).mean().
def exponentialMovingAverage(partitionColumn: String, orderColumn: String, column: String, windowSize: Int): DataFrame = {
val window = Window.partitionBy(partitionColumn)
val exponentialMovingAveragePrefix = "_EMA_"
val emaUDF = udf((rowNumber: Int, columnPartitionValues: Seq[Double]) => {
val alpha = 2.0 / (windowSize + 1)
val adjustedWeights = (0 until rowNumber + 1).foldLeft(new Array[Double](rowNumber + 1)) { (accumulator, index) =>
accumulator(index) = pow(1 - alpha, rowNumber - index); accumulator
}
(adjustedWeights, columnPartitionValues.slice(0, rowNumber + 1)).zipped.map(_ * _).sum / adjustedWeights.sum
})
dataFrame.withColumn("row_nr", row_number().over(window.orderBy(orderColumn)) - lit(1))
.withColumn(s"$column$exponentialMovingAveragePrefix$windowSize", emaUDF(col("row_nr"), collect_list(column).over(window)))
.drop("row_nr")
}
(I am presuming the type of the column for which I need to compute the exponential moving average is Double.)
I hope this helps others.

Related

Spark ml cosine similarity: how to get 1 to n similarity score

I read that I could use the columnSimilarities method that comes with RowMatrix to find the cosine similarity of various records (content-based). My data looks something like this:
genre,actor
horror,mohanlal shobhana pranav
comedy,mammooty suraj dulquer
romance,fahad dileep manju
comedy,prithviraj
Now,I have created a spark-ml pipeline to calculate the tf-idf of the above text features (genre, actor) and uses the VectorAssembler in my pipeline to assemble both the features into a single column "features". After that, I convert my obtained DataFrame using this :
val vectorRdd = finalDF.map(row => row.getAs[Vector]("features"))
to convert it into an RDD[Vector]
Then, I obtain my RowMatrix by
val matrix = new RowMatrix(vectorRdd)
I am following this guide for a reference to cosine similarity and what I need is a method in spark-mllib to find the similarity between a particular record and all the others like this method in sklearn, as shown in the guide :
cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)
But, I am unable to find how to do this. I don't understand what matrix.columnSimilarities() is comparing and returning. Can someone help me with what I am looking for?
Any help is appreciated! Thanks.
I had calculated it myself with 2 small functions. Call cosineSimilarity on crossJoin of 2 dataframes.(seperate 1st line and others into 2)
def cosineSimilarity(vectorA: SparseVector,
vectorB:SparseVector,normASqrt:Double,normBSqrt:Double) :
(Double,Double) = {
var dotProduct = 0.0
for (i <- vectorA.indices){
dotProduct += vectorA(i) * vectorB(i)
}
val div = (normASqrt * normBSqrt)
if (div == 0 )
(dotProduct,0)
else
(dotProduct,dotProduct / div)
}
val normSqrt : (org.apache.spark.ml.linalg.SparseVector => Double) = (vector: org.apache.spark.ml.linalg.SparseVector) => {
var norm = 0.0
for (i <- vector.indices ) {
norm += Math.pow(vector(i), 2)
}
Math.sqrt(norm)
}

Apache Spark: Exponential Moving Average

I am writing an application in Spark/Scala in which I need to calculate the exponential moving average of a column.
EMA_t = (price_t * 0.4) + (EMA_t-1 * 0.6)
The problem I am facing is that I need the previously calculated value (EMA_t-1) of the same column. Via mySQL this would be possible by using MODEL or by creating an EMA column which you can then update row per row, but I've tried this and neither work with the Spark SQL or Hive Context... Is there any way I can access this EMA_t-1?
My data looks like this :
timestamp price
15:31 132.3
15:32 132.48
15:33 132.76
15:34 132.66
15:35 132.71
15:36 132.52
15:37 132.63
15:38 132.575
15:39 132.57
So I would need to add a new column where my first value is just the price of the first row and then I would need to use the previous value: EMA_t = (price_t * 0.4) + (EMA_t-1 * 0.6) to calculate the following rows in that column.
My EMA column would have to be:
EMA
132.3
132.372
132.5272
132.58032
132.632192
132.5873152
132.6043891
132.5926335
132.5835801
I am currently trying to do it using Spark SQL and Hive but if it is possible to do it in another way, this would be just as welcome! I was also wondering how I could do this with Spark Streaming. My data is in a dataframe and I am using Spark 1.4.1.
Thanks a lot for any help provided!
To answer your question:
The problem I am facing is that I need the previously calculated value (EMA_t-1) of the same column
I think you need two functions: Window and Lag. (I also make null value to zero for convenience when calculating EMA)
my_window = Window.orderBy("timestamp")
df.withColumn("price_lag_1",when(lag(col("price"),1).over(my_window).isNull,lit(0)).otherwise(lag(col("price"),1).over(my_window)))
I am new to Spark Scala also, and am trying to see if I can define an UDF to do the exponential average. But for now an obvious walk around would be manually adding up all lag column ( 0.4 * lag0 + 0.4*0.6*lag1 + 0.4 * 0.6^2*lag2 ...) Something like this
df.withColumn("ema_price",
price * lit(0.4) * Math.pow(0.6,0) +
lag(col("price"),1).over(my_window) * 0.4 * Math.pow(0.6,1) +
lag(col("price"),2).over(my_window) * 0.4 * Math.pow(0.6,2) + .... )
I ignored the when.otherwise to make it more clear. And this method works for me now..
----Update----
def emaFunc (y: org.apache.spark.sql.Column, group: org.apache.spark.sql.Column, order: org.apache.spark.sql.Column, beta: Double, lookBack: Int) : org.apache.spark.sql.Column = {
val ema_window = Window.partitionBy(group).orderBy(order)
var i = 1
var result = y
while (i < lookBack){
result = result + lit(1) * ( when(lag(y,i).over(ema_window).isNull,lit(0)).otherwise(lag(y,i).over(ema_window)) * beta * Math.pow((1-beta),i)
- when(lag(y,i).over(ema_window).isNull,lit(0)).otherwise(y * beta * Math.pow((1-beta),i)) )
i = i + 1
}
return result }
By using this fuction you should be able to get EMA of price like..
df.withColumn("one",lit(1))
.withColumn("ema_price", emaFunc('price,'one,'timestamp,0.1,10)
This will look back 10 days and calculate estimate EMA with beta=0.1. Column "one" is just a place holder since you don't have grouping column.
You should be able to do this with Spark Window Functions, which were introduced in 1.4: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
w = Window().partitionBy().orderBy(col("timestamp"))
df.select("*", lag("price").over(w).alias("ema"))
This would select the last price for you so you can do your calculations on it

What is the difference between Apache Spark compute and slice?

I trying to make unit test on DStream.
I put data in my stream with a mutable queue ssc.queueStream(red)
I set the ManualClock to 0
Start my streaming context.
Advance my ManualClock to batchDuration milis
When i'm doing a
stream.slice(Time(0), Time(clock.getTimeMillis())).map(_.collect().toList)
I got a result.
when I do
for (time <- 0L to stream.slideDuration.milliseconds + 10) {
println("time "+ time + " " +stream.compute(Time(time)).map(_.collect().toList))
}
None of them contain a result event the stream.compute(Time(clock.getTimeMillis()))
So what is the difference between this two functions without considerings the parameters differences?
Compute will return an RDD only if the provided time is a correct time in a sliding window i.e it's the zero time + a multiple of the slide duration.
Slice will align both the from and to times to slide duration and compute for each of them.
In slide you provide time interval and from time interval as long as it is valid we generate Seq[Time]
def to(that: Time, interval: Duration): Seq[Time] = {
(this.milliseconds) to (that.milliseconds) by (interval.milliseconds) map (new Time(_))
}
and then we "compute" for each instance of Seq[Time]
alignedFromTime.to(alignedToTime, slideDuration).flatMap { time =>
if (time >= zeroTime) getOrCompute(time) else None
}
as oppose to compute we only compute for the instance of Time that we pass the compute method...

Different result returned using Scala Collection par in a series of runs

I have tasks that I want to execute concurrently and each task takes substantial amount of memory so I have to execute them in batches of 2 to conserve memory.
def runme(n: Int = 120) = (1 to n).grouped(2).toList.flatMap{tuple =>
tuple.par.map{x => {
println(s"Running $x")
val s = (1 to 100000).toList // intentionally to make the JVM allocate a sizeable chunk of memory
s.sum.toLong
}}
}
val result = runme()
println(result.size + " => " + result.sum)
The result I expected from the output was 120 => 84609924480 but the output was rather random. The returned collection size differed from execution to execution. Most of the time there was missing count even though all the futures were executed looking at the console. I thought flatMap waits the parallel executions in map to complete before returning the complete. What should I do to always get the right result using par? Thanks
Just for the record: changing the underlying collection in this case shouldn't change the output of your program. The problem is related to this known bug. It's fixed from 2.11.6, so if you use that (or higher) Scala version, you should not see the strange behavior.
And about the overflow, I still think that your expected value is wrong. You can check that the sum is overflowing because the list is of integers (which are 32 bit) while the total sum exceeds the integer limits. You can check it with the following snippet:
val n = 100000
val s = (1 to n).toList // your original code
val yourValue = s.sum.toLong // your original code
val correctValue = 1l * n * (n + 1) / 2 // use math formula
var bruteForceValue = 0l // in case you don't trust math :) It's Long because of 0l
for (i ← 1 to n) bruteForceValue += i // iterate through range
println(s"yourValue = $yourValue")
println(s"correctvalue = $correctValue")
println(s"bruteForceValue = $bruteForceValue")
which produces the output
yourValue = 705082704
correctvalue = 5000050000
bruteForceValue = 5000050000
Cheers!
Thanks #kaktusito.
It worked after I changed the grouped list to Vector or Seq i.e. (1 to n).grouped(2).toList.flatMap{... to (1 to n).grouped(2).toVector.flatMap{...

Consolidating a data table in Scala

I am working on a small data analysis tool, and practicing/learning Scala in the process. However I got stuck at a small problem.
Assume data of type:
X Gr1 x_11 ... x_1n
X Gr2 x_21 ... x_2n
..
X GrK x_k1 ... x_kn
Y Gr1 y_11 ... y_1n
Y Gr3 y_31 ... y_3n
..
Y Gr(K-1) ...
Here I have entries (X,Y...) that may or may not exist in up to K groups, with a series of values for each group. What I want to do is pretty simple (in theory), I would like to consolidate the rows that belong to the same "entity" in different groups. so instead of multiple lines that start with X, I want to have one row with all values from x_11 to x_kn in columns.
What makes things complicated however is that not all entities exist in all groups. So wherever there's "missing data" I would like to pad with for instance zeroes, or some string that denotes a missing value. So if I have (X,Y,Z) in up to 3 groups, the type I table I want to have is as follows:
X x_11 x_12 x_21 x_22 x_31 x_32
Y y_11 y_12 N/A N/A y_31 y_32
Z N/A N/A z_21 z_22 N/A N/A
I have been stuck trying to figure this out, is there a smart way to use List functions to solve this?
I wrote this simple loop:
for {
(id, hitlist) <- hits.groupBy(_.acc)
h <- hitlist
} println(id + "\t" + h.sampleId + "\t" + h.ratios.mkString("\t"))
to able to generate the tables that look like the example above. Note that, my original data is of a different format and layout,but that has little to do with the problem at hand, thus I have skipped all steps regarding parsing. I should be able to use groupBy in a better way that actually solves this for me, but I can't seem to get there.
Then I modified my loop mapping the hits to ratios and appending them to one another:
for ((id, hitlist) <- hits.groupBy(_.acc)){
val l = hitlist.map(_.ratios).foldRight(List[Double]()){
(l1: List[Double], l2: List[Double]) => l1 ::: l2
}
println(id + "\t" + l.mkString("\t"))
//println(id + "\t" + h.sampleId + "\t" + h.ratios.mkString("\t"))
}
That gets me one step closer but still no cigar! Instead of a fully padded "matrix" I get a jagged table. Taking the example above:
X x_11 x_12 x_21 x_22 x_31 x_32
Y y_11 y_12 y_31 y_32
Z z_21 z_22
Any ideas as to how I can pad the table so that values from respective groups are aligned with one another? I should be able to use _.sampleId, which holds the "group membersip" for each "hit", but I am not sure how exactly. ´hits´ is a List of type Hit which is practically a wrapper for each row, giving convenience methods for getting individual values, so essentially a tuple which have "named indices" (such as .acc, .sampleId..)
(I would like to solve this problem without hardcoding the number of groups, as it might change from case to case)
Thanks!
This is a bit of a contrived example, but I think you can see where this is going:
case class Hit(acc:String, subAcc:String, value:Int)
val hits = List(Hit("X", "x_11", 1), Hit("X", "x_21", 2), Hit("X", "x_31", 3))
val kMax = 4
val nMax = 2
for {
(id, hitlist) <- hits.groupBy(_.acc)
k <- 1 to kMax
n <- 1 to nMax
} yield {
val subId = "x_%s%s".format(k, n)
val row = hitlist.find(h => h.subAcc == subId).getOrElse(Hit(id, subId, 0))
println(row)
}
//Prints
Hit(X,x_11,1)
Hit(X,x_12,0)
Hit(X,x_21,2)
Hit(X,x_22,0)
Hit(X,x_31,3)
Hit(X,x_32,0)
Hit(X,x_41,0)
Hit(X,x_42,0)
If you provide more information on your hits lists then we could probably come with something a little more accurate.
I have managed to solve this problem with the following code, I am putting it here as an answer in case someone else runs into a similar problem and requires some help. The use of find() from Noah's answer was definitely very useful, so do give him a +1 in case this code snippet helps you out.
val samples = hits.groupBy(_.sampleId).keys.toList.sorted
for ((id, hitlist) <- hits.groupBy(_.acc)) {
val ratios =
for (sample <- samples)
yield hitlist.find(h => h.sampleId == sample).map(_.ratios)
.getOrElse(List(Double.NaN, Double.NaN, Double.NaN, Double.NaN, Double.NaN, Double.NaN))
println(id + "\t" + ratios.flatten.mkString("\t"))
}
I figure it's not a very elegant or efficient solution, as I have two calls to groupBy and I would be interested to see better solutions to this problem.