Pass multiple conditions as a string in where clause in Spark - scala

I am writing the following code in Spark, with the DataFrame API.
val cond = "col("firstValue") >= 0.5 & col("secondValue") >= 0.5 & col("thirdValue") >= 0.5"
val Output1 = InputDF.where(cond)
I am passing all conditions as strings from external arguments but it throws a parse error as cond should be of type Column.
For example:
col("firstValue") >= 0.5 & col("secondValue") >= 0.5 & col("thirdValue") >= 0.5
As I want to pass multiple conditions dynamically, how can I convert a String to a Column?
Edit
Is there anything through which I can read list of condition externally as Column, because I have not found anything to convert a String to a Column using Scala code.

I believe you may want to do something like the following:
InputDF.where("firstValue >= 0.5 and secondValue >= 0.5 and thirdValue >= 0.5")
The error you are facing is a parse error at runtime, if the error was caused by a wrong type passed in it would not even have compiled.
As you can see in the official documentation (here provided for Spark 2.3.0) the where method can either take a sequence of Columns (like in your latter snippet) or a string representing a SQL predicate (as in my example).
The SQL predicate will be interpreted by Spark. However I believe it's worth mentioning that you may be interested in composing your Columns instead of concatenating strings, as the former approach minimizes the error surface by getting rid of entire classes of possible errors (for example parse errors).
You can achieve the same with the following code:
InputDF.where(col("firstValue") >= 0.5 and col("secondValue") >= 0.5 and col("thirdValue") >= 0.5)
or more concisely:
import spark.implicits._ // necessary for the $"" notation
InputDF.where($"firstValue" >= 0.5 and $"secondValue" >= 0.5 and $"thirdValue" >= 0.5)
Columns are easily composable and more robustly so than raw strings. If you want a set of conditions to apply you can easily and them together in a function that can be verified even before you even run the program:
def allSatisfied(condition: Column, conditions: Column*): Column =
conditions.foldLeft(condition)(_ and _)
InputDF.where(allSatisfied($"firstValue" >= 0.5, $"secondValue" >= 0.5, $"thirdValue" >= 0.5))
You can achieve the same with strings of course, but this would end up being less robust:
def allSatisfied(condition: String, conditions: String*): String =
conditions.foldLeft(condition)(_ + " and " + _)
InputDF.where(allSatisfied("firstValue >= 0.5", "secondValue >= 0.5", "thirdValue >= 0.5"))

I was trying to achieve similar thing and for Scala the below code worked for me.
import org.apache.spark.sql.functions.{col, _}
val cond = (col("firstValue") >= 0.5 &
col("secondValue") >= 0.5 &
col("thirdValue") >= 0.5)
val Output1 = InputDF.where(cond)

Related

What is the difference between generating Range and NumericRange in Scala

I am new to Scala, and I tried to generate some Range objects.
val a = 0 to 10
// val a: scala.collection.immutable.Range.Inclusive = Range 0 to 10
This statement works perfectly fine and generates a range from 0 to 10. And to keyword works without any imports.
But when I try to generate a NumericRange with floating point numbers, I have to import some functions from BigDecimal object as follows, to use to keyword.
import scala.math.BigDecimal.double2bigDecimal
val f = 0.1 to 10.1 by 0.5
// val f: scala.collection.immutable.NumericRange.Inclusive[scala.math.BigDecimal] = NumericRange 0.1 to 10.1 by 0.5
Can someone explain the reason for this and the mechanism behind range generation.
Thank you.
The import you are adding adds "automatic conversion" from Double to BigDecimal as the name suggests.
It's necessary because NumericRange only works with types T for which Integral[T] exists and unfortunately it doesn't exist for Double but exists for BigDecimal.
Bringing tha automatic conversion in scope makes the Doubles converted in BigDecimal so that NumericRange can be applied/defined.
You could achieve the same range without the import by declaring directly the numbers as BigDecimals:
BigDecimal("0.1") to BigDecimal("10.1") by BigDecimal("0.5")

Make absolute work inside filtering in Scala

I want to return a percentage of results from a dataset. Being a noob in Scala, tried the following
ds.filter(abs(hash(col("source"))) % 100 < percentage)
but getting abs cannot be applied to (org.apache.spark.sql.Column). I don't want to sample it, I want to return based on the hash of a column so that it's deterministic even when dataset changes.
This works just fine:
ds.filter(abs(hash(col("source"))) % 100 < percentage)
Probabely you have multiple abs in your namespace (e.g. from imports like import math._ etc. To be sure, use
ds.filter(org.apache.spark.sql.functions.abs(hash(col("source"))) % 100 < percentage)
But I think this will not garantee that you get the exact percentage, because hash values may not be equally distributed (think about a dataframe with only 1 unique value of source, hash values will all be the same.... you get either all records or none. To get the exact percentage, you would need something like :
val newDF = df
.withColumn("rnb",row_number().over(Window.orderBy($"source"))) // or order by hash if you wish
.withColumn("count",count("*").over())
.where($"rnb" < lit(fraction)*$"count")

PySpark filtering gives inconsistent behavior

So I have a data set where I do some transformations and the last step is to filter out rows that have a 0 in a column called frequency. The code that does the filtering is super simple:
def filter_rows(self, name: str = None, frequency_col: str = 'frequency', threshold: int = 1):
df = getattr(self, name)
df = df.where(df[frequency_col] >= threshold)
setattr(self, name, df)
return self
The problem is a very strange behavior where if I put a rather high threshold like 10, it works fine, filtering out all the rows below 10. But if I make the threshold just 1, it does not remove the 0s! Here is an example of the former (threshold=10):
{"user":"XY1677KBTzDX7EXnf-XRAYW4ZB_vmiNvav7hL42BOhlcxZ8FQ","domain":"3a899ebbaa182778d87d","frequency":12}
{"user":"lhoAWb9U9SXqscEoQQo9JqtZo39nutq3NgrJjba38B10pDkI","domain":"3a899ebbaa182778d87d","frequency":9}
{"user":"aRXbwY0HcOoRT302M8PCnzOQx9bOhDG9Z_fSUq17qtLt6q6FI","domain":"33bd29288f507256d4b2","frequency":23}
{"user":"RhfrV_ngDpJex7LzEhtgmWk","domain":"390b4f317c40ac486d63","frequency":14}
{"user":"qZqqsNSNko1V9eYhJB3lPmPp0p5bKSq0","domain":"390b4f317c40ac486d63","frequency":11}
{"user":"gsmP6RG13azQRmQ-RxcN4MWGLxcx0Grs","domain":"f4765996305ccdfa9650","frequency":10}
{"user":"jpYTnYjVkZ0aVexb_L3ZqnM86W8fr082HwLliWWiqhnKY5A96zwWZKNxC","domain":"f4765996305ccdfa9650","frequency":15}
{"user":"Tlgyxk_rJF6uE8cLM2sArPRxiOOpnLwQo2s","domain":"f89838b928d5070c3bc3","frequency":17}
{"user":"qHu7fpnz2lrBGFltj98knzzbwWDfU","domain":"f89838b928d5070c3bc3","frequency":11}
{"user":"k0tU5QZjRkBwqkKvMIDWd565YYGHfg","domain":"f89838b928d5070c3bc3","frequency":17}
And now here is some of the data with threshold=1:
{"user":"KuhSEPFKACJdNyMBBD2i6ul0Nc_b72J4","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"EP1LomZ3qAMV3YtduC20","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"UxulBfshmCro-srE3Cs5znxO5tnVfc0_yFps","domain":"d69cb6f62b885fec9b7d","frequency":1}
{"user":"v2OX7UyvMVnWlDeDyYC8Opk-va_i8AwxZEsxbk","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"4hu1uE2ucAYZIrNLeOY2y9JMaArFZGRqjgKzlKenC5-GfxDJQQbLcXNSzj","domain":"68b588cedbc66945c442","frequency":0}
{"user":"5rFMWm_A-7N1E9T289iZ65TIR_JG_OnZpJ-g","domain":"68b588cedbc66945c442","frequency":1}
{"user":"RLqoxFMZ7Si3CTPN1AnI4hj6zpwMCJI","domain":"68b588cedbc66945c442","frequency":1}
{"user":"wolq9L0592MGRfV_M-FxJ5Wc8UUirjqjMdaMDrI","domain":"68b588cedbc66945c442","frequency":0}
{"user":"9spTLehI2w0fHcxyvaxIfo","domain":"68b588cedbc66945c442","frequency":1}
I should note that before this step I perform some other transformations, and I've noticed weird behaviors in Spark in the past sometimes doing very simple things like this after a join or a union can give very strange results where eventually the only solution is to write out the data and read it back in again and do the operation in a completely separate script. I hope there is a better solution than this!

Apache Spark: Exponential Moving Average

I am writing an application in Spark/Scala in which I need to calculate the exponential moving average of a column.
EMA_t = (price_t * 0.4) + (EMA_t-1 * 0.6)
The problem I am facing is that I need the previously calculated value (EMA_t-1) of the same column. Via mySQL this would be possible by using MODEL or by creating an EMA column which you can then update row per row, but I've tried this and neither work with the Spark SQL or Hive Context... Is there any way I can access this EMA_t-1?
My data looks like this :
timestamp price
15:31 132.3
15:32 132.48
15:33 132.76
15:34 132.66
15:35 132.71
15:36 132.52
15:37 132.63
15:38 132.575
15:39 132.57
So I would need to add a new column where my first value is just the price of the first row and then I would need to use the previous value: EMA_t = (price_t * 0.4) + (EMA_t-1 * 0.6) to calculate the following rows in that column.
My EMA column would have to be:
EMA
132.3
132.372
132.5272
132.58032
132.632192
132.5873152
132.6043891
132.5926335
132.5835801
I am currently trying to do it using Spark SQL and Hive but if it is possible to do it in another way, this would be just as welcome! I was also wondering how I could do this with Spark Streaming. My data is in a dataframe and I am using Spark 1.4.1.
Thanks a lot for any help provided!
To answer your question:
The problem I am facing is that I need the previously calculated value (EMA_t-1) of the same column
I think you need two functions: Window and Lag. (I also make null value to zero for convenience when calculating EMA)
my_window = Window.orderBy("timestamp")
df.withColumn("price_lag_1",when(lag(col("price"),1).over(my_window).isNull,lit(0)).otherwise(lag(col("price"),1).over(my_window)))
I am new to Spark Scala also, and am trying to see if I can define an UDF to do the exponential average. But for now an obvious walk around would be manually adding up all lag column ( 0.4 * lag0 + 0.4*0.6*lag1 + 0.4 * 0.6^2*lag2 ...) Something like this
df.withColumn("ema_price",
price * lit(0.4) * Math.pow(0.6,0) +
lag(col("price"),1).over(my_window) * 0.4 * Math.pow(0.6,1) +
lag(col("price"),2).over(my_window) * 0.4 * Math.pow(0.6,2) + .... )
I ignored the when.otherwise to make it more clear. And this method works for me now..
----Update----
def emaFunc (y: org.apache.spark.sql.Column, group: org.apache.spark.sql.Column, order: org.apache.spark.sql.Column, beta: Double, lookBack: Int) : org.apache.spark.sql.Column = {
val ema_window = Window.partitionBy(group).orderBy(order)
var i = 1
var result = y
while (i < lookBack){
result = result + lit(1) * ( when(lag(y,i).over(ema_window).isNull,lit(0)).otherwise(lag(y,i).over(ema_window)) * beta * Math.pow((1-beta),i)
- when(lag(y,i).over(ema_window).isNull,lit(0)).otherwise(y * beta * Math.pow((1-beta),i)) )
i = i + 1
}
return result }
By using this fuction you should be able to get EMA of price like..
df.withColumn("one",lit(1))
.withColumn("ema_price", emaFunc('price,'one,'timestamp,0.1,10)
This will look back 10 days and calculate estimate EMA with beta=0.1. Column "one" is just a place holder since you don't have grouping column.
You should be able to do this with Spark Window Functions, which were introduced in 1.4: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
w = Window().partitionBy().orderBy(col("timestamp"))
df.select("*", lag("price").over(w).alias("ema"))
This would select the last price for you so you can do your calculations on it

Pythonic way to write a for loop that doesn't use the loop index [duplicate]

This question already has answers here:
Is it possible to implement a Python for range loop without an iterator variable?
(15 answers)
Closed 7 months ago.
This is to do with the following code, which uses a for loop to generate a series of random offsets for use elsewhere in the program.
The index of this for loop is unused, and this is resulting in the 'offending' code being highlighted as a warning by Eclipse / PyDev
def RandomSample(count):
pattern = []
for i in range(count):
pattern.append( (random() - 0.5, random() - 0.5) )
return pattern
So I either need a better way to write this loop that doesn't need a loop index, or a way to tell PyDev to ignore this particular instance of an unused variable.
Does anyone have any suggestions?
Just for reference for ignoring variables in PyDev
By default pydev will ignore following variables
['_', 'empty', 'unused', 'dummy']
You can add more by passing supression parameters
-E, --unusednames ignore unused locals/arguments if name is one of these values
Ref:
http://eclipse-pydev.sourcearchive.com/documentation/1.0.3/PyCheckerLauncher_8java-source.html
How about itertools.repeat:
import itertools
count = 5
def make_pat():
return (random() - 0.5, random() - 0.5)
list(x() for x in itertools.repeat(make_pat, count))
Sample output:
[(-0.056940506273799985, 0.27886450895662607),
(-0.48772848046066863, 0.24359038079935535),
(0.1523758626306998, 0.34423337290256517),
(-0.018504578280469697, 0.33002406492294756),
(0.052096928160727196, -0.49089780124549254)]
randomSample = [(random() - 0.5, random() - 0.5) for _ in range(count)]
Sample output, for count=10 and assuming that you mean the Standard Library random() function:
[(-0.07, -0.40), (0.39, 0.18), (0.13, 0.29), (-0.11, -0.15),\
(-0.49, 0.42), (-0.20, 0.21), (-0.44, 0.36), (0.22, -0.08),\
(0.21, 0.31), (0.33, 0.02)]
If you really need to make it a function, then you can abbreviate by using a lambda:
f = lambda count: [(random() - 0.5, random() - 0.5) for _ in range(count)]
This way you can call it like:
>>> f(1)
f(1)
[(0.03, -0.09)]
>>> f(2)
f(2)
[(-0.13, 0.38), (0.10, -0.04)]
>>> f(5)
f(5)
[(-0.38, -0.14), (0.31, -0.16), (-0.34, -0.46), (-0.45, 0.28), (-0.01, -0.18)]
>>> f(10)
f(10)
[(0.01, -0.24), (0.39, -0.11), (-0.06, 0.09), (0.42, -0.26), (0.24, -0.44) , (-0.29, -0.30), (-0.27, 0.45), (0.10, -0.41), (0.36, -0.07), (0.00, -0.42)]
>>>
you get the idea...
Late to the party, but here's a potential idea:
def RandomSample(count):
f = lambda: random() - 0.5
r = range if count < 100 else xrange # or some other number
return [(f(), f()) for _ in r(count)]
Strictly speaking, this is more or less the same as the other answers, but it does two things that look kind of nice to me.
First, it removes that duplicate code you have from writing random() - 0.5 twice by putting that into a lambda.
Second, for a certain size range, it chooses to use xrange() instead of range() so as not to unnecessarily generate a giant list of numbers you're going to throw away. You may want to adjust the exact number, because I haven't played with it at all, I just thought it might be a potential efficiency concern.
There should be a way to suppress code analysis errors in PyDev, like this:
http://pydev.org/manual_adv_assistants.html
Also, PyDev will ignore unused variables that begin with an underscore, as shown here:
http://pydev.org/manual_adv_code_analysis.html
Try this:
while count > 0:
pattern.append((random() - 0.5, random() - 0.5))
count -= 1
import itertools, random
def RandomSample2D(npoints, get_random=lambda: random.uniform(-.5, .5)):
return ((r(), r()) for r in itertools.repeat(get_random, npoints))
uses random.uniform() explicitly
returns an iterator instead of list