I am writing an application in Spark/Scala in which I need to calculate the exponential moving average of a column.
EMA_t = (price_t * 0.4) + (EMA_t-1 * 0.6)
The problem I am facing is that I need the previously calculated value (EMA_t-1) of the same column. Via mySQL this would be possible by using MODEL or by creating an EMA column which you can then update row per row, but I've tried this and neither work with the Spark SQL or Hive Context... Is there any way I can access this EMA_t-1?
My data looks like this :
timestamp price
15:31 132.3
15:32 132.48
15:33 132.76
15:34 132.66
15:35 132.71
15:36 132.52
15:37 132.63
15:38 132.575
15:39 132.57
So I would need to add a new column where my first value is just the price of the first row and then I would need to use the previous value: EMA_t = (price_t * 0.4) + (EMA_t-1 * 0.6) to calculate the following rows in that column.
My EMA column would have to be:
EMA
132.3
132.372
132.5272
132.58032
132.632192
132.5873152
132.6043891
132.5926335
132.5835801
I am currently trying to do it using Spark SQL and Hive but if it is possible to do it in another way, this would be just as welcome! I was also wondering how I could do this with Spark Streaming. My data is in a dataframe and I am using Spark 1.4.1.
Thanks a lot for any help provided!
To answer your question:
The problem I am facing is that I need the previously calculated value (EMA_t-1) of the same column
I think you need two functions: Window and Lag. (I also make null value to zero for convenience when calculating EMA)
my_window = Window.orderBy("timestamp")
df.withColumn("price_lag_1",when(lag(col("price"),1).over(my_window).isNull,lit(0)).otherwise(lag(col("price"),1).over(my_window)))
I am new to Spark Scala also, and am trying to see if I can define an UDF to do the exponential average. But for now an obvious walk around would be manually adding up all lag column ( 0.4 * lag0 + 0.4*0.6*lag1 + 0.4 * 0.6^2*lag2 ...) Something like this
df.withColumn("ema_price",
price * lit(0.4) * Math.pow(0.6,0) +
lag(col("price"),1).over(my_window) * 0.4 * Math.pow(0.6,1) +
lag(col("price"),2).over(my_window) * 0.4 * Math.pow(0.6,2) + .... )
I ignored the when.otherwise to make it more clear. And this method works for me now..
----Update----
def emaFunc (y: org.apache.spark.sql.Column, group: org.apache.spark.sql.Column, order: org.apache.spark.sql.Column, beta: Double, lookBack: Int) : org.apache.spark.sql.Column = {
val ema_window = Window.partitionBy(group).orderBy(order)
var i = 1
var result = y
while (i < lookBack){
result = result + lit(1) * ( when(lag(y,i).over(ema_window).isNull,lit(0)).otherwise(lag(y,i).over(ema_window)) * beta * Math.pow((1-beta),i)
- when(lag(y,i).over(ema_window).isNull,lit(0)).otherwise(y * beta * Math.pow((1-beta),i)) )
i = i + 1
}
return result }
By using this fuction you should be able to get EMA of price like..
df.withColumn("one",lit(1))
.withColumn("ema_price", emaFunc('price,'one,'timestamp,0.1,10)
This will look back 10 days and calculate estimate EMA with beta=0.1. Column "one" is just a place holder since you don't have grouping column.
You should be able to do this with Spark Window Functions, which were introduced in 1.4: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
w = Window().partitionBy().orderBy(col("timestamp"))
df.select("*", lag("price").over(w).alias("ema"))
This would select the last price for you so you can do your calculations on it
Related
I have a Pyspark dataframe with below values -
[Row(id='ABCD123', score='28.095238095238095'), Row(id='EDFG456', score='36.2962962962963'), Row(id='HIJK789', score='37.56218905472637'), Row(id='LMNO1011', score='36.82352941176471')]
I want only the values from the DF which have score between the input score value and input score value + 1, say, the input score value is 36 then I want the output DF with only two ids - EDFG456 & LMNO1011 as their score falls between 36 & 37. I achieved this by doing as follows -
input_score_value = 36
input_df = my_df.withColumn("score_num", substring(my_df.score, 1,2))
output_matched = input_df.filter(input_df.score_num == input_score_value)
print(output_matched.take(5))
The above code gives the below output, but it takes too long to process 2 mil rows. I was thinking if there is some better way to do this to reduce the response time.
[Row(id='EDFG456', score='36.2962962962963'), Row(id='LMNO1011',score='36.82352941176471')]
You can use the function floor.
from pyspark.sql.functions import floor
output_matched = input_df.filter(foor(input_df.score_num) == input_score_value)
print(output_matched.take(5))
It should be much faster compared to substring. Let me know.
I have a dataframe of timeseries pricing data, with an ID, Date and Price.
I need to compute the Exponential Moving Average for the Price Column, and add it as a new column to the dataframe.
I have been using Spark's window functions before, and it looked like a fit for this use case, but given the formula for the EMA:
EMA: {Price - EMA(previous day)} x multiplier + EMA(previous day)
where
multiplier = (2 / (Time periods + 1)) //let's assume Time period is 10 days for now
I got a bit confused as to how can I access to the previous computed value in the column, while actually window-ing over the column.
With a simple moving average, it's simple, since all you need to do is compute a new column while averaging the elements in the window:
var window = Window.partitionBy("ID").orderBy("Date").rowsBetween(-windowSize, Window.currentRow)
dataFrame.withColumn(avg(col("Price")).over(window).alias("SMA"))
But it seems that with EMA its a bit more complicated since at every step I need the previous computed value.
I have also looked at Weighted moving average in Pyspark but I need an approach for Spark/Scala, and for a 10 or 30 days EMA.
Any ideas?
In the end, I've analysed how exponential moving average is implemented in pandas dataframes. Besides the recursive formula which I described above, and which is difficult to implement in any sql or window function(because its recursive), there is another one, which is detailed on their issue tracker:
y[t] = (x[t] + (1-a)*x[t-1] + (1-a)^2*x[t-2] + ... + (1-a)^n*x[t-n]) /
((1-a)^0 + (1-a)^1 + (1-a)^2 + ... + (1-a)^n).
Given this, and with additional spark implementation help from here, I ended up with the following implementation, which is roughly equivalent with doing pandas_dataframe.ewm(span=window_size).mean().
def exponentialMovingAverage(partitionColumn: String, orderColumn: String, column: String, windowSize: Int): DataFrame = {
val window = Window.partitionBy(partitionColumn)
val exponentialMovingAveragePrefix = "_EMA_"
val emaUDF = udf((rowNumber: Int, columnPartitionValues: Seq[Double]) => {
val alpha = 2.0 / (windowSize + 1)
val adjustedWeights = (0 until rowNumber + 1).foldLeft(new Array[Double](rowNumber + 1)) { (accumulator, index) =>
accumulator(index) = pow(1 - alpha, rowNumber - index); accumulator
}
(adjustedWeights, columnPartitionValues.slice(0, rowNumber + 1)).zipped.map(_ * _).sum / adjustedWeights.sum
})
dataFrame.withColumn("row_nr", row_number().over(window.orderBy(orderColumn)) - lit(1))
.withColumn(s"$column$exponentialMovingAveragePrefix$windowSize", emaUDF(col("row_nr"), collect_list(column).over(window)))
.drop("row_nr")
}
(I am presuming the type of the column for which I need to compute the exponential moving average is Double.)
I hope this helps others.
I am writing the following code in Spark, with the DataFrame API.
val cond = "col("firstValue") >= 0.5 & col("secondValue") >= 0.5 & col("thirdValue") >= 0.5"
val Output1 = InputDF.where(cond)
I am passing all conditions as strings from external arguments but it throws a parse error as cond should be of type Column.
For example:
col("firstValue") >= 0.5 & col("secondValue") >= 0.5 & col("thirdValue") >= 0.5
As I want to pass multiple conditions dynamically, how can I convert a String to a Column?
Edit
Is there anything through which I can read list of condition externally as Column, because I have not found anything to convert a String to a Column using Scala code.
I believe you may want to do something like the following:
InputDF.where("firstValue >= 0.5 and secondValue >= 0.5 and thirdValue >= 0.5")
The error you are facing is a parse error at runtime, if the error was caused by a wrong type passed in it would not even have compiled.
As you can see in the official documentation (here provided for Spark 2.3.0) the where method can either take a sequence of Columns (like in your latter snippet) or a string representing a SQL predicate (as in my example).
The SQL predicate will be interpreted by Spark. However I believe it's worth mentioning that you may be interested in composing your Columns instead of concatenating strings, as the former approach minimizes the error surface by getting rid of entire classes of possible errors (for example parse errors).
You can achieve the same with the following code:
InputDF.where(col("firstValue") >= 0.5 and col("secondValue") >= 0.5 and col("thirdValue") >= 0.5)
or more concisely:
import spark.implicits._ // necessary for the $"" notation
InputDF.where($"firstValue" >= 0.5 and $"secondValue" >= 0.5 and $"thirdValue" >= 0.5)
Columns are easily composable and more robustly so than raw strings. If you want a set of conditions to apply you can easily and them together in a function that can be verified even before you even run the program:
def allSatisfied(condition: Column, conditions: Column*): Column =
conditions.foldLeft(condition)(_ and _)
InputDF.where(allSatisfied($"firstValue" >= 0.5, $"secondValue" >= 0.5, $"thirdValue" >= 0.5))
You can achieve the same with strings of course, but this would end up being less robust:
def allSatisfied(condition: String, conditions: String*): String =
conditions.foldLeft(condition)(_ + " and " + _)
InputDF.where(allSatisfied("firstValue >= 0.5", "secondValue >= 0.5", "thirdValue >= 0.5"))
I was trying to achieve similar thing and for Scala the below code worked for me.
import org.apache.spark.sql.functions.{col, _}
val cond = (col("firstValue") >= 0.5 &
col("secondValue") >= 0.5 &
col("thirdValue") >= 0.5)
val Output1 = InputDF.where(cond)
I read that I could use the columnSimilarities method that comes with RowMatrix to find the cosine similarity of various records (content-based). My data looks something like this:
genre,actor
horror,mohanlal shobhana pranav
comedy,mammooty suraj dulquer
romance,fahad dileep manju
comedy,prithviraj
Now,I have created a spark-ml pipeline to calculate the tf-idf of the above text features (genre, actor) and uses the VectorAssembler in my pipeline to assemble both the features into a single column "features". After that, I convert my obtained DataFrame using this :
val vectorRdd = finalDF.map(row => row.getAs[Vector]("features"))
to convert it into an RDD[Vector]
Then, I obtain my RowMatrix by
val matrix = new RowMatrix(vectorRdd)
I am following this guide for a reference to cosine similarity and what I need is a method in spark-mllib to find the similarity between a particular record and all the others like this method in sklearn, as shown in the guide :
cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)
But, I am unable to find how to do this. I don't understand what matrix.columnSimilarities() is comparing and returning. Can someone help me with what I am looking for?
Any help is appreciated! Thanks.
I had calculated it myself with 2 small functions. Call cosineSimilarity on crossJoin of 2 dataframes.(seperate 1st line and others into 2)
def cosineSimilarity(vectorA: SparseVector,
vectorB:SparseVector,normASqrt:Double,normBSqrt:Double) :
(Double,Double) = {
var dotProduct = 0.0
for (i <- vectorA.indices){
dotProduct += vectorA(i) * vectorB(i)
}
val div = (normASqrt * normBSqrt)
if (div == 0 )
(dotProduct,0)
else
(dotProduct,dotProduct / div)
}
val normSqrt : (org.apache.spark.ml.linalg.SparseVector => Double) = (vector: org.apache.spark.ml.linalg.SparseVector) => {
var norm = 0.0
for (i <- vector.indices ) {
norm += Math.pow(vector(i), 2)
}
Math.sqrt(norm)
}
I have some cassandra data that is of the type double that I need to round down in spark to 1 decimal place.
The problem is how to extract it from cassandra, convert it to a decimal, round down to 1 decimal point and then write back to a table in cassandra. My rounding code is as follows:
BigDecimal(number).setScale(1, BigDecimal.RoundingMode.DOWN).toDouble
This works great if the number going in is a decimal but I dont know how to convert the double to a decimal before rouding. My Double needs to be divided by 1000000 prior to rounding.
For example 510999000 would be 510.990 before being rounded down to 510.9
EDIT: I was able to get it to do what I wanted with the following command.
BigDecimal(((number).toDouble) / 1000000).setScale(1, BigDecimal.RoundingMode.DOWN).toDouble
Not sure how good this is but it works.
Great answer guys. Just chiming other ways to do the same
1. If using Spark DataFrame then ( x and y are DataFrames )
import org.apache.spark.sql.functions.round
val y = x.withColumn("col1", round($"col1", 3))
2. val y = x.rdd.map( x => (x(0)*1000).round / 1000.toDouble )
The answer I was able to work with was:
BigDecimal(((number).toDouble) / 1000000).setScale(1, BigDecimal.RoundingMode.DOWN).toDouble