Dynamic evaluation of Boolean expressions in a Spark DataFrame - scala

Suppose I have a Spark DataFrame (in Scala) like
+---+---+---------------+
| a| b| expr|
+---+---+---------------+
| 0| 0|a = 1 AND b = 0|
| 0| 1| a = 0|
| 1| 0|a = 1 AND b = 1|
| 1| 1|a = 1 AND b = 1|
| 1| 1| null|
| 1| 1| a = 0 OR b = 1|
+---+---+---------------+
in which the string column expr contains nullable Boolean expressions that refer to the other numeric columns in the same DataFrame (a and b).
I would like to derive a column eval(expr) that evaluates the Boolean expression expr row-wise, i.e.,
+---+---+---------------+----------+
| a| b| expr|eval(expr)|
+---+---+---------------+----------+
| 0| 0|a = 1 AND b = 0| false|
| 0| 1| a = 0| true|
| 1| 0|a = 1 AND b = 1| false|
| 1| 1|a = 1 AND b = 1| true|
| 1| 1| null| true|
| 1| 1| a = 0 OR b = 1| true|
+---+---+---------------+----------+
(in particular, although this is an optional specification, null evaluates to true).
Question
What's the best way to create eval(expr)?
That is, how can I create a column in a Spark DataFrame that evaluates a column of Boolean expressions that refer to other columns in the DataFrame?
I have some not-quite-satisfactory solutions below. They assume the following DataFrame in scope:
val df: DataFrame = Seq(
(0, 0, "a = 1 AND b = 0"),
(0, 1, "a = 0"),
(1, 0, "a = 1 AND b = 1"),
(1, 1, "a = 1 AND b = 1"),
(1, 1, null),
(1, 1, "a = 0 OR b = 1")
).toDF("a", "b", "expr")
Solution 1
Create a large global expression out of the individual expressions:
val exprs: Column = concat(
df.columns
.filter(_ != "expr")
.zipWithIndex
.flatMap {
case (name, i) =>
if (i == 0)
Seq(lit(s"($name = "), col(name))
else
Seq(lit(s" AND $name = "), col(name))
} :+ lit(" AND (") :+ col("expr") :+ lit("))"): _*
)
// exprs: org.apache.spark.sql.Column = concat((a = , a, AND b = , b, AND (, expr, )))
val bigExprString = df.select(exprs).na.drop.as[String].collect.mkString(" OR ")
// bigExprString: String = (a = 0 AND b = 0 AND (a = 1 AND b = 0)) OR (a = 0 AND b = 1 AND (a = 0)) OR (a = 1 AND b = 0 AND (a = 1 AND b = 1)) OR (a = 1 AND b = 1 AND (a = 1 AND b = 1)) OR (a = 1 AND b = 1 AND (a = 0 OR b = 1))
val result: DataFrame = df.withColumn("eval(expr)", expr(bigExprString))
The downside here is the resulting string is very large. In my actual use case, it would be many tens of thousands of characters long, if not longer. I'm not too sure whether this would cause problems.
Solution 2
Separate the DataFrame into multiple based on the value of the expression column, operate on each individually, and recombine into one DataFrame.
val exprs: Seq[String] = df.select("expr").distinct.as[String].collect
// exprs: Seq[String] = WrappedArray(a = 1 AND b = 1, a = 1 AND b = 0, null, a = 0, a = 0 OR b = 1)
val result: DataFrame = exprs.map(e =>
df.filter(col("expr") === e)
.withColumn("eval(expr)", if (e == null) lit(true) else when(expr(e), true).otherwise(false))
).reduce(_.union(_))
.show()
I think the downside of this approach is that it creates many intermediate tables (one for each distinct expression). In my actual use case, this count is potentially hundreds or thousands.

Using this answer the scala.tools.reflect.ToolBox can be used to evaluate the expression after transforming it into a valid Scala expression:
case class Result(a: Integer, b: Integer, expr: String, result: Boolean)
df.mapPartitions(it => {
import scala.reflect.runtime.universe
import scala.tools.reflect.ToolBox
val tb = universe.runtimeMirror(this.getClass.getClassLoader).mkToolBox()
val res = it.map(r => {
val a = r.getInt(0)
val b = r.getInt(1)
val expr = r.getString(2)
val exprResult =
if ( expr == null) {
true
}
else {
val scalaExpr = expr.replace("=", "==").replace("AND", "&").replace("OR", "|")
val scalaExpr2 = s"var a=${a}; var b=${b}; ${scalaExpr}"
tb.eval(tb.parse(scalaExpr2)).asInstanceOf[Boolean]
}
Result(a, b, expr, exprResult)
})
res
}).show()
Output:
+---+---+---------------+------+
| a| b| expr|result|
+---+---+---------------+------+
| 0| 0|a = 1 AND b = 0| false|
| 0| 1| a = 0| true|
| 1| 0|a = 1 AND b = 1| false|
| 1| 1|a = 1 AND b = 1| true|
| 1| 1| null| true|
| 1| 1| a = 0 OR b = 1| true|
+---+---+---------------+------+
I am using mapPartitions here instead of a simple udf as the initialization of the the toolbox takes some time. Instead of initializing it once per row it is now initialized only once per partition.

Related

Scala spark, input dataframe, return columns where all values equal to 1

Given a dataframe, say that it contains 4 columns and 3 rows. I want to write a function to return the columns where all the values in that column are equal to 1.
This is a Scala code. I want to use some spark transformations to transform or filter the dataframe input. This filter should be implemented in a function.
case class Grade(c1: Integral, c2: Integral, c3: Integral, c4: Integral)
val example = Seq(
Grade(1,3,1,1),
Grade(1,1,null,1),
Grade(1,10,2,1)
)
val dfInput = spark.createDataFrame(example)
After I call the function filterColumns()
val dfOutput = dfInput.filterColumns()
it should return 3 row 2 columns dataframe with value all 1.
A bit more readable approach using Dataset[Grade]
import org.apache.spark.sql.functions.col
import scala.collection.mutable
import org.apache.spark.sql.Column
val tmp = dfInput.map(grade => grade.dropWhenNotEqualsTo(1))
val rowsCount = dfInput.count()
val colsToRetain = mutable.Set[Column]()
for (column <- tmp.columns) {
val withoutNullsCount = tmp.select(column).na.drop().count()
if (rowsCount == withoutNullsCount) colsToRetain += col(column)
}
dfInput.select(colsToRetain.toArray:_*).show()
+---+---+
| c4| c1|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
And the case object
case class Grade(c1: Integer, c2: Integer, c3: Integer, c4: Integer) {
def dropWhenNotEqualsTo(n: Integer): Grade = {
Grade(nullOrValue(c1, n), nullOrValue(c2, n), nullOrValue(c3, n), nullOrValue(c4, n))
}
def nullOrValue(c: Integer, n: Integer) = if (c == n) c else null
}
grade.dropWhenNotEqualsTo(1) -> returns a new Grade with values that not satisfies the condition replaced to nulls
+---+----+----+---+
| c1| c2| c3| c4|
+---+----+----+---+
| 1|null| 1| 1|
| 1| 1|null| 1|
| 1|null|null| 1|
+---+----+----+---+
(column <- tmp.columns) -> iterate over the columns
tmp.select(column).na.drop() -> drop rows with nulls
e.g for c2 this will return
+---+
| c2|
+---+
| 1|
+---+
if (rowsCount == withoutNullsCount) colsToRetain += col(column) -> if column contains nulls just drop it
one of the options is reduce on rdd:
import spark.implicits._
val df= Seq(("1","A","3","4"),("1","2","?","4"),("1","2","3","4")).toDF()
df.show()
val first = df.first()
val size = first.length
val diffStr = "#"
val targetStr = "1"
def rowToArray(row: Row): Array[String] = {
val arr = new Array[String](row.length)
for (i <- 0 to row.length-1){
arr(i) = row.getString(i)
}
arr
}
def compareArrays(a1: Array[String], a2: Array[String]): Array[String] = {
val arr = new Array[String](a1.length)
for (i <- 0 to a1.length-1){
arr(i) = if (a1(i).equals(a2(i)) && a1(i).equals(targetStr)) a1(i) else diffStr
}
arr
}
val diff = df.rdd
.map(rowToArray)
.reduce(compareArrays)
val cols = (df.columns zip diff).filter(!_._2.equals(diffStr)).map(s=>df(s._1))
df.select(cols:_*).show()
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| A| 3| 4|
| 1| 2| ?| 4|
| 1| 2| 3| 4|
+---+---+---+---+
+---+
| _1|
+---+
| 1|
| 1|
| 1|
+---+
I would try to prepare dataset for processing without nulls. In case of few columns this simple iterative approach might work fine (don't forget to import spark implicits before import spark.implicits._):
val example = spark.sparkContext.parallelize(Seq(
Grade(1,3,1,1),
Grade(1,1,0,1),
Grade(1,10,2,1)
)).toDS().cache()
def allOnes(colName: String, ds: Dataset[Grade]): Boolean = {
val row = ds.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns = example.columns.filter(col => allOnes(col, example))
example.selectExpr(resultColumns: _*).show()
result is:
+---+---+
| c1| c4|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
If nulls are inevitable, use untyped dataset (aka dataframe):
val schema = StructType(Seq(
StructField("c1", IntegerType, nullable = true),
StructField("c2", IntegerType, nullable = true),
StructField("c3", IntegerType, nullable = true),
StructField("c4", IntegerType, nullable = true)
))
val example = spark.sparkContext.parallelize(Seq(
Row(1,3,1,1),
Row(1,1,null,1),
Row(1,10,2,1)
))
val dfInput = spark.createDataFrame(example, schema).cache()
def allOnes(colName: String, df: DataFrame): Boolean = {
val row = df.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns= dfInput.columns.filter(col => allOnes(col, dfInput))
dfInput.selectExpr(resultColumns: _*).show()

How to retain the column structure of a Spark Dataframe following a map operation on rows

I am trying to apply a function to each row of a Spark DataFrame, as in the example.
val df = sc.parallelize(
Seq((1, 2, 0), (0, 0, 1), (0, 0, 0))).toDF("x", "y", "z")
df.show()
which yields
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 0|
| 0| 0| 1|
| 0| 0| 0|
+---+---+---+
Suppose I want to do something to the values in each row, for example changing 0 to 5.
val b = df.map(row => row.toSeq.map(x => x match{
case 0 => 5
case x: Int => x
}))
b.show()
+---------+
| value|
+---------+
|[1, 2, 5]|
|[5, 5, 1]|
|[5, 5, 5]|
+---------+
The function worked, but I now have one column whose entries are Lists, instead of 3 columns of Ints. I would like my named columns back.
You can define an UDF to apply this substitution. For example:
def subsDef(k: Int): Int = if(k==0) 5 else k
val subs = udf[Int, Int](subsDef)
Then you can apply the UDF to a specific column or, if you desire, to every columns of the DF:
// to a single column, for example "x"
df = df.withColumn("x", subs(col("x")))
df.show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 0|
| 5| 0| 1|
| 5| 0| 0|
+---+---+---+
// to every columns of DF
df.columns.foreach(c => {
df = df.withColumn(c, subs(col(c)))
})
df.show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 5|
| 5| 5| 1|
| 5| 5| 5|
+---+---+---+
Rather than transforming the DataFrame row-wise, consider using built-in Spark API function when/otherwise, as follows:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq((1, 2, 0), (0, 0, 1), (0, 0, 0)).toDF("x", "y", "z")
val vFrom = 0
val vTo = 5
val cols = df.columns // Filter for specific columns if necessary
df.select( cols.map( c =>
when(col(c) === vFrom, vTo).otherwise(col(c)).as(c)
): _*
).show
// +---+---+---+
// | x| y| z|
// +---+---+---+
// | 1| 2| 5|
// | 5| 5| 1|
// | 5| 5| 5|
// +---+---+---+
There are various ways to do it here are some:
df.map(row => {
val size = row.size
var seq: Seq[Int] = Seq.empty[Int]
for (a <- 0 to size - 1) {
val value: Int = row(a).asInstanceOf[Int]
val newVal: Int = value match {
case 0 =>
5
case _ =>
value
}
seq = seq :+ newVal
}
Row.fromSeq(seq)
})(RowEncoder.apply(df.schema))
val columns = df.columns
df.select(
columns.map(c => when(col(c) === 0, 5).otherwise(col(c)).as(c)): _*)
.show()
def fun: (Int => Int) = { x =>
if (x == 0) 5 else x
}
val function = udf(fun)
df.select(function(col("x")).as("x"),
function(col("y")).as("y"),
function(col("z")).as("z"))
.show()
def checkZero(a: Int): Int = if (a == 0) 5 else a
df.map {
case Row(a: Int, b: Int, c: Int) =>
Row(checkZero(a), checkZero(b), checkZero(c))
} { RowEncoder.apply(df.schema) }
.show()

How to assign keys to items in a column in Scala?

I have the following RDD:
Col1 Col2
"abc" "123a"
"def" "783b"
"abc "674b"
"xyz" "123a"
"abc" "783b"
I need the following output where each item in each column is converted into a unique key.
for example : abc->1,def->2,xyz->3
Col1 Col2
1 1
2 2
1 3
3 1
1 2
Any help would be appreciated. Thanks!
In this case, you can rely on the hashCode of the string. The hashcode will be the same if the input and datatype is same. Try this.
scala> "abc".hashCode
res23: Int = 96354
scala> "xyz".hashCode
res24: Int = 119193
scala> val df = Seq(("abc","123a"),
| ("def","783b"),
| ("abc","674b"),
| ("xyz","123a"),
| ("abc","783b")).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala>
scala> def hashc(x:String):Int =
| return x.hashCode
hashc: (x: String)Int
scala> val myudf = udf(hashc(_:String):Int)
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(StringType)))
scala> df.select(myudf('col1), myudf('col2)).show
+---------+---------+
|UDF(col1)|UDF(col2)|
+---------+---------+
| 96354| 1509487|
| 99333| 1694000|
| 96354| 1663279|
| 119193| 1509487|
| 96354| 1694000|
+---------+---------+
scala>
If you must map your columns into natural numbers starting from 1, one approach would be to apply zipWithIndex to the individual columns, add 1 to the index (as zipWithIndex always starts from 0), convert indvidual RDDs to DataFrames, and finally join the converted DataFrames for the index keys:
val rdd = sc.parallelize(Seq(
("abc", "123a"),
("def", "783b"),
("abc", "674b"),
("xyz", "123a"),
("abc", "783b")
))
val df1 = rdd.map(_._1).distinct.zipWithIndex.
map(r => (r._1, r._2 + 1)).
toDF("col1", "c1key")
val df2 = rdd.map(_._2).distinct.zipWithIndex.
map(r => (r._1, r._2 + 1)).
toDF("col2", "c2key")
val dfJoined = rdd.toDF("col1", "col2").
join(df1, Seq("col1")).
join(df2, Seq("col2"))
// +----+----+-----+-----+
// |col2|col1|c1key|c2key|
// +----+----+-----+-----+
// |783b| abc| 2| 1|
// |783b| def| 3| 1|
// |123a| xyz| 1| 2|
// |123a| abc| 2| 2|
// |674b| abc| 2| 3|
//+----+----+-----+-----+
dfJoined.
select($"c1key".as("col1"), $"c2key".as("col2")).
show
// +----+----+
// |col1|col2|
// +----+----+
// | 2| 1|
// | 3| 1|
// | 1| 2|
// | 2| 2|
// | 2| 3|
// +----+----+
Note that if you're okay with having the keys start from 0, the step of map(r => (r._1, r._2 + 1)) can be skipped in generating df1 and df2.

How to do a Spark dataframe(1 million rows) cartesian product with a list(1000 entries) efficiently to generate a new dataframe with 1 billion rows

I want to take each row of a dataframe which has 1 million rows and generate 1000 rows from each row of it by taking a cross product with a list having 1000 entries thereby generating a dataframe with 1 billion rows. What is the best approach to do it efficiently.
I have tried with broadcasting the list and then using it while mapping each row of the dataframe. But this seems to be taking too much time.
val mappedrdd = validationDataFrames.map(x => {
val cutoffList : List[String] = cutoffListBroadcast.value
val arrayTruthTableVal = arrayTruthTableBroadcast.value
var listBufferRow: ListBuffer[Row] = new ListBuffer()
for(cutOff <- cutoffList){
val conversion = x.get(0).asInstanceOf[Int]
val probability = x.get(1).asInstanceOf[Double]
var columnName : StringBuffer = new StringBuffer
columnName = columnName.append(conversion)
if(probability > cutOff.toDouble){
columnName = columnName.append("_").append("1")
}else{
columnName = columnName.append("_").append("0")
}
val index:Int = arrayTruthTableVal.indexOf(columnName.toString)
var listBuffer : ListBuffer[String] = new ListBuffer()
listBuffer :+= cutOff
for(i <- 1 to 4){
if((index + 1) == i) listBuffer :+= "1" else listBuffer :+= "0"
}
val row = Row.fromSeq(listBuffer)
listBufferRow = listBufferRow :+ row
}
listBufferRow
})
Depending on your spark version you can do:
Spark 2.1.0
Add the list as a column and explode. A simplified example:
val df = spark.range(5)
val exploded = df.withColumn("a",lit(List(1,2,3).toArray)).withColumn("a", explode($"a"))
df.show()
+---+---+
| id| a|
+---+---+
| 0| 1|
| 0| 2|
| 0| 3|
| 1| 1|
| 1| 2|
| 1| 3|
| 2| 1|
| 2| 2|
| 2| 3|
| 3| 1|
| 3| 2|
| 3| 3|
| 4| 1|
| 4| 2|
| 4| 3|
+---+---+
For timing you can do:
def time[R](block: => R): Long = {
val t0 = System.currentTimeMillis()
block // call-by-name
val t1 = System.currentTimeMillis()
t1 - t0
}
time(spark.range(1000000).withColumn("a",lit((0 until 1000).toArray)).withColumn("a", explode($"a")).count())
took 5.41 seconds on a 16 core computer with plenty of memory configured with default parallelism of 60.
< Spark 2.1.0
You can define a simple UDF.
val xx = (0 until 1000).toArray.toSeq // replace with your list but turn it to seq
val ff = udf(() => {xx})
time(spark.range(1000000).withColumn("a",ff()).withColumn("a", explode($"a")).count())
Took on the same server as above 8.25 seconds

Using stat.bloomFilter in Spark 2.0.0 to filter another dataframe

I have two large dataframes [a] one which has all events identified by an id [b] a list of ids. I want to filter [a] based on the ids in [b] using the stat.bloomFilter implementation in spark 2.0.0
However I don't see any operations in the dataset API to join the bloom filter to the data frame [a]
val in1 = spark.sparkContext.parallelize(List(0, 1, 2, 3, 4, 5))
val df1 = in1.map(x => (x, x+1, x+2)).toDF("c1", "c2", "c3")
val in2 = spark.sparkContext.parallelize(List(0, 1, 2))
val df2 = in2.map(x => (x)).toDF("c1")
val expectedNumItems: Long = 1000
val fpp: Double = 0.005
val sbf = df.stat.bloomFilter($"c1", expectedNumItems, fpp)
val sbf2 = df2.stat.bloomFilter($"c1", expectedNumItems, fpp)
What is the best way to filter 'df1' based on values in df2?
Thanks!
You can use an UDF:
def might_contain(f: org.apache.spark.util.sketch.BloomFilter) = udf((x: Int) =>
if(x != null) f.mightContain(x) else false)
df1.where(might_contain(sbf2)($"C1"))
I think I found the correct way to do this, but would still like pointers to see if there are better ways to manage this.
Here's my solution -
val in1 = spark.sparkContext.parallelize(List(0, 1, 2, 3, 4, 5))
val d1 = in1.map(x => (x, x+1, x+2)).toDF("c1", "c2", "c3")
val in2 = spark.sparkContext.parallelize(List(0, 1, 2))
val d2 = in2.map(x => (x)).toDF("c1")
val s2 = d2.stat.bloomFilter($"c1", expectedNumItems, fpp)
val a = spark.sparkContext.broadcast(s2)
val x = d1.rdd.filter(x => a.value.mightContain(x(0)))
case class newType(c1: Int, c2: Int, c3: Int) extends Serializable
val xDF = x.map(y => newType(y(0).toString.toInt, y(1).toString.toInt, y(2).toString.toInt)).toDF()
scala> d1.show(10)
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 0| 1| 2|
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
| 4| 5| 6|
| 5| 6| 7|
+---+---+---+
scala> d2.show(10)
+---+
| c1|
+---+
| 0|
| 1|
| 2|
+---+
scala> xDF.show(10)
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 0| 1| 2|
| 1| 2| 3|
| 2| 3| 4|
+---+---+---+
I built an implicit class that wraps https://stackoverflow.com/a/41989703/6723616
Comments welcome!
/**
* Copyright 2017 Yahoo, Inc.
* Zlib license: https://www.zlib.net/zlib_license.html
*/
package me.klotz.spark.utils
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
import org.apache.spark.util.sketch.BloomFilter
import org.apache.spark.SparkContext
object BloomFilterEnhancedJoin {
// not parameterized for field typel; assumes string
/**
* Like .join(bigDF, smallDF, but accelerated with a Bloom filter.
* You pass in a size estimate of the bigDF, and a ratio of acceptable false positives out of the expected result set size.
* ratio=1 is a good start; that will result in about 50% false positives in the big-small join, so the filter accepts
* about as many as it passes, rather than rejecting almost all. Pass in a size estimate of the big dataframe
* to avoid enumerating it. The small DataFrame gets enumerated anyway.
*
* Example use:
* <code>
* import me.klotz.spark.utils.BloomFilterEnhancedJoin._
* val (dups_joined, bloomFilterBroadcast) = df_big.joinBloom(1024L*1024L*1024L, dups, 10.0, "id")
* dups_joined.write.format("orc").save("dups")
* bloomFilterBroadcast.unpersist
* <code>
*/
implicit class BloomFilterEnhancedJoiner(bigdf:Dataset[Row]) {
/**
* You should call bloomFilterBroadcast.unpersist after
*/
def joinBloom(bigDFCountEstimate:Long, smallDF: Dataset[Row], ratio:Double, field:String) = {
val sc = smallDF.sparkSession.sparkContext
val smallDFCount = smallDF.count
val fpr = smallDFCount.toDouble / bigDFCountEstimate.toDouble / ratio
println(s"fpr=${fpr} = smallDFCount=${smallDFCount} / bigDFCountEstimate=${bigDFCountEstimate} / ratio=${ratio}")
val bloomFilterBroadcast = sc.broadcast((smallDF.stat.bloomFilter(field, smallDFCount, fpr)))
val mightContain = udf((x: String) => if (x != null) bloomFilterBroadcast.value.mightContainString(x) else false)
(bigdf.filter(mightContain(col(field))).join(smallDF, field), bloomFilterBroadcast)
}
}
}