My UDF is comparing if time difference between two columns is within 5 days limit. If == operator is used, expression compiles properly, but <= (or lt) fails with type mismatch error. Code:
val isExpiration : (Column, Column, Column) =>
Column = (BCED, termEnd, agrEnd) =>
{
if(abs(datediff(if(termEnd == null) {agrEnd} else {termEnd}, BCED)) lt 6)
{lit(0)}
else
{lit(1)}
}
Error:
notebook:3: error: type mismatch;
found : org.apache.spark.sql.Column
required: Boolean
if(abs(datediff(if(termEnd == null) {agrEnd} else {termEnd}, BCED)) lt 6) {lit(0)}...
^
I must be missing something obvious - can anyone advice how to test if Column value is smaller or equal to a constant?
It looks like you have mixed udf and Spark functions, you need to use only one of them. When possible it's always preferable not to use and udf since those can not be optimized (and are thus generally slower). Without udf it could be done as follows:
df.withColumn("end", when($"termEnd".isNull, $"agrEnd").otherwise($"termEnd"))
.withColumn("expired", when(abs(datediff($"end", $"BCED")) lt 6, 0).otherwise(1))
I introduced a temporary column to make the code a bit more readable.
Using an udf it could, for example, be done as follows:
val isExpired = udf((a: Date, b: Date) => {
if ((math.abs(a.getTime() - b.getTime()) / (1000 * 3600 * 24)) < 6) {
0
} else {
1
}
})
df.withColumn("end", when($"termEnd".isNull, $"agrEnd").otherwise($"termEnd"))
.withColumn("expired", isExpired($"end", $"BCED"))
Here, I again made use of a temporary column but this logic could be moved into the udf if preferred.
That's because abs(col).lt(6) returns an object of type Column and if expects the condition to be evaluated to true or false which is a Scala Boolean type.
Plus, UDF doesn't work on Column Data Type, it works with Scala Objects (Int, String, Boolean etc)
Since all you're doing is using Spark SQL functions, you can rewrite your UDF like this:
val isExpiration = (
when(
abs(datediff(coalesce($"termEnd", $"agrEnd") , $"BCED")) <= 6, lit(0)
).otherwise(lit(1))
)
And, the usage would be:
df.show
//+----------+----------+----------+
//| BCED| termEnd| agrEnd|
//+----------+----------+----------+
//|2018-06-10|2018-06-25|2018-06-25|
//|2018-06-10| null|2018-06-15|
//+----------+----------+----------+
df.withColumn("x", isExpiration).show
//+----------+----------+----------+---+
//| BCED| termEnd| agrEnd| x|
//+----------+----------+----------+---+
//|2018-06-10|2018-06-25|2018-06-25| 1|
//|2018-06-10| null|2018-06-15| 0|
//+----------+----------+----------+---+
Related
I have a big dataframe with millions of rows as follows:
A B C Eqn
12 3 4 A+B
32 8 9 B*C
56 12 2 A+B*C
How to evaluate the expressions in the Eqn column?
You could create a custom UDF that evaluates these arithmetic functions
def evalUDF = udf((a:Int, b:Int, c:Int, eqn:String) => {
val eqnParts = eqn
.replace("A", a.toString)
.replace("B", b.toString)
.replace("C", c.toString)
.split("""\b""")
.toList
val (sum, _) = eqnParts.tail.foldLeft((eqnParts.head.toInt, "")){
case ((runningTotal, "+"), num) => (runningTotal + num.toInt, "")
case ((runningTotal, "-"), num) => (runningTotal - num.toInt, "")
case ((runningTotal, "*"), num) => (runningTotal * num.toInt, "")
case ((runningTotal, _), op) => (runningTotal, op)
}
sum
})
evalDf
.withColumn("eval", evalUDF('A, 'B, 'C, 'Eqn))
.show()
Output:
+---+---+---+-----+----+
| A| B| C| Eqn|eval|
+---+---+---+-----+----+
| 12| 3| 4| A+B| 15|
| 32| 8| 9| B*C| 72|
| 56| 12| 2|A+B*C| 136|
+---+---+---+-----+----+
As you can see this works, but is very fragile (spaces, unknown operators, etc will break the code) and doesn't adhere to order of operations (otherwise the last should have been 92)
So you could write all that yourself or find some library that already does that perhaps (like https://gist.github.com/daixque/1610753)?
Maybe the performance overhead will be very large (especially it you start using recursive parsers), But at least you can perform it on a dataframe instead of collecting it first
I think the only way to execute SQLs that are inside a DataFrame is to select("Eqn").collect first followed by executing the SQLs iteratively on the source Dataset.
Since the SQLs are in a DataFrame that is nothing else but a description of a distributed computation that will be executed on Spark executors there is no way you could submit Spark jobs while processing the SQLs on executors. It is simply too late in the execution pipeline. You should be back on the driver to be able to submit new Spark jobs, say to execute SQLs.
With SQLs on the driver you'd then take the corresponding row per SQL and simply withColumn to execute SQLs (with their rows).
I think it's easier to write it than develop a working Spark application, but that's how I'd go about it.
I am late But incase someone is looking for
Generic Math expression interpreter using variables
Complex/unknown expression that cannot be hardcoded into the UDF (accepted answer)
Then you can use javax.script.ScriptEngineManager
import javax.script.SimpleBindings;
import javax.script.ScriptEngineManager
import java.util.Map
import java.util.HashMap
def calculateFunction = (mathExpression: String, A : Double, B : Double, C : Double ) => {
val vars: Map[String, Object] = new HashMap[String, Object]();
vars.put("A",A.asInstanceOf[Object])
vars.put("B",B.asInstanceOf[Object])
vars.put("C",C.asInstanceOf[Object])
val engine = new ScriptEngineManager().getEngineByExtension("js");
val result = engine.eval(mathExpression, new SimpleBindings(vars));
result.asInstanceOf[Double]
}
val calculateUDF = spark.udf.register("calculateFunction",calculateFunction)
NOTE : This will handle generic expressions and is robust but it give a lot more worse performance than the accepted answer and heavy on memory.
I have an array as a broadcast variable and it contains Integers:
broadcast_array.value
Array(72159153, 72159163, 72159202, 72159203, 72159238, 72159398, 72159447, 72159448, 72159455, 72159492...
I have a column in a dataset (call is col_id which contains IntegerType values that might be in the broadcast_array, but they might not.
I am only trying to create a new column (call it new_col) that checks if col_id value for each row is in broadcast_array. If so, the new column value should be Available, else it can be null
So I have something like:
val my_new_df = df.withColumn("new_col", when(broadcast_array.value.contains($"col_id"), "Available"))
But I keep getting this error:
Name: Unknown Error
Message: <console>:45: error: type mismatch;
found : Boolean
required: org.apache.spark.sql.Column
val my_new_df = df.withColumn("new_col", when(broadcast_array.value.contains($"col_id"), "Available"))
^
StackTrace:
What is most confusing to me is that I thought the when statement requires a conditional that outputs some Boolean, but here it's saying it requires a Column.
How should I go about adding a value to a new column based on whether the value in an existing column can be found in a predefined Array or not?
If you look at the api of when function
def when(condition : org.apache.spark.sql.Column, value : scala.Any) : org.apache.spark.sql.Column
Its clear that the condition required is a column and not a boolean.
So you can do complicated lit combinations to convert your boolean to column as
import org.apache.spark.sql.functions._
df.withColumn("new_col", when(lit(broadcast_array.value.mkString(",")).contains($"col_id"), lit("Available"))).show(false)
OR
You can achieve what you are trying by writing a simple udf function as
import org.apache.spark.sql.functions._
val broadcastContains = udf((id: Int) => broadcast_array.value.contains(id))
and just call the function as
df.withColumn("new_col", when(broadcastContains($"col_id"), lit("Available"))).show(false)
I added a broadcastArrayContains function to spark-daria that makes Ramesh's solution more reusable / accessible.
def broadcastArrayContains[T](col: Column, broadcastedArray: Broadcast[Array[T]]) = {
when(col.isNull, null)
.when(lit(broadcastedArray.value.mkString(",")).contains(col), lit(true))
.otherwise(lit(false))
}
Suppose you have the following DataFrame (df):
+----+
| num|
+----+
| 123|
| hi|
|null|
+----+
You can identify all the values in the broadcasted array as follows:
val specialNumbers = spark.sparkContext.broadcast(Array("123", "456"))
df.withColumn(
"is_special_number",
functions.broadcastArrayContains[String](col("num"), specialNumbers)
)
//dataset
michael,swimming,silve,2016,USA
usha,running,silver,2014,IND
lisa,javellin,gold,2014,USA
michael,swimming,silver,2017,USA
Questions --
1) How many silver medals have been won by the USA in each sport -- and the code throws the error value === is not the member of string
val rdd = sc.textFile("/home/training/mydata/sports.txt")
val text =rdd.map(lines=>lines.split(",")).map(arrays=>arrays(0),arrays(1),arrays(2),arrays(3),arrays(4)).toDF("first_name","sports","medal_type","year","country")
text.filter(text("medal_type")==="silver" && ("country")==="USA" groupBy("year").count().show
2) What is the difference between === and ==
When I use filter and select with === with just one condition in it (no && or ||), it shows me the string result and boolean result respectively but when I use select and filter with ==, errors throws
using this:
text.filter(text("medal_type")==="silver" && text("country")==="USA").groupBy("year").count().show
+----+-----+
|year|count|
+----+-----+
|2017| 1|
+----+-----+
Will just answer your first question. (note that there is a typo in silver in first line)
About the second question:
== and === is just a functions in Scala
In spark === is using equalTo method which is the equality test
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Column.html#equalTo-java.lang.Object-
// Scala:
df.filter( df("colA") === df("colB") )
// Java
import static org.apache.spark.sql.functions.*;
df.filter( col("colA").equalTo(col("colB")) );
and == is using euqals method which just test if two references are the same object.
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Column.html#equals-java.lang.Object-
Notice the return types of each function == (equals) returns boolean while === (equalTo) returns a Column of the results.
I am trying to compare count of 2 different queries/tables. Is it possible to perform this operation in Scala(Spark SQL)?
Here is my code:
val parquetFile1 = sqlContext.read.parquet("/user/njrbars2/ars/mbr_addr/2016/2016_000_njars_09665_mbr_addr.20161222031015221601.parquet")
val parquetFile2 =sqlContext.read.parquet("/user/njrbars2/ars/mbr_addr/2017/part-r-00000-70ce4958-57fe-487f-a45b-d73b7ef20289.snappy.parquet")
parquetFile1.registerTempTable("parquetFile1")
parquetFile2.registerTempTable("parquetFile2")
scala> var first_table_count=sqlContext.sql("select count(*) from parquetFile1")
first_table_count: org.apache.spark.sql.DataFrame = [_c0: bigint]
scala> var second_table_count=sqlContext.sql("select count(*) from parquetFile1 where LINE1_ADDR is NULL and LINE2_ADDR is NULL")
second_table_count: org.apache.spark.sql.DataFrame = [_c0: bigint]
scala> first_table_count.show()
+------+
| _c0|
+------+
|119928|
+------+
scala> second_table_count.show()
+---+
|_c0|
+---+
|617|
+---+
I am trying to get difference value of both these queries but getting error.
scala> first_table_count - second_table_count
<console>:30: error: value - is not a member of org.apache.spark.sql.DataFrame
first_table_count - second_table_count
whereas if I do normal substraction, it works
scala> 2 - 1
res7: Int = 1
It seems I have to do some data conversion but not able to find appropriate solution.
In newer version of spark count not return Long value instead it is reaped inside dataframe object i.e. Dataframe[BigInt].
you can try this
val diffrence = first_table_count.first.getLong(0) - second_table_count.first.getLong(0);
And subtract method is not available on dataframe.
You need something like the following to do the conversion:
first_table_count.first.getLong(0)
And here is why you need it:
A DataFrame represents a tabular data structure. So although your SQL seems to return a single value, it actually returns a table containing a single row, and the row contains a single column. Hence we use the above code to extract the first column (index 0) of the first row.
I have a dataframe df with the following columns:
ts: Timestamp
val: String
From my master df, I want to select dataframes that only match a certain ts value. I can achieve that using between like:
df.filter($"ts".between(targetDate, targetDate)) Here targetDate is the date I want to filter my df on. Is there an equivalent equal such as df.filter($"ts".equal(targetDate)) ?
As you can see in the Column's documentation, you can use the === method to compare column's values with Any type of variable.
val df = sc.parallelize(
("2016-02-24T22:54:17Z", "foo") ::
("2010-08-01T00:00:12Z", "bar") ::
Nil
).toDF("ts", "val").withColumn("ts", $"ts".cast("timestamp"))
df.where($"ts" === "2010-08-01T00:00:12Z").show(10, false)
// +---------------------+---+
// |ts |val|
// +---------------------+---+
// |2010-08-01 02:00:12.0|bar|
// +---------------------+---+
If you want to be explicit about types you can replace
=== "2010-08-01T00:00:12Z"
with
=== lit("2010-08-01T00:00:12Z").cast("timestamp")
There is also Column.equalTo method designed for Java interoperability:
df.where($"ts".equalTo("2010-08-01T00:00:12Z")).show(10, false)
Finally Spark supports NULL safe equality operators (<=>, Column.eqNullSafe) but these require Cartesian product in Spark < 1.6 (see SPARK-11111).