Using stat.bloomFilter in Spark 2.0.0 to filter another dataframe - scala

I have two large dataframes [a] one which has all events identified by an id [b] a list of ids. I want to filter [a] based on the ids in [b] using the stat.bloomFilter implementation in spark 2.0.0
However I don't see any operations in the dataset API to join the bloom filter to the data frame [a]
val in1 = spark.sparkContext.parallelize(List(0, 1, 2, 3, 4, 5))
val df1 = in1.map(x => (x, x+1, x+2)).toDF("c1", "c2", "c3")
val in2 = spark.sparkContext.parallelize(List(0, 1, 2))
val df2 = in2.map(x => (x)).toDF("c1")
val expectedNumItems: Long = 1000
val fpp: Double = 0.005
val sbf = df.stat.bloomFilter($"c1", expectedNumItems, fpp)
val sbf2 = df2.stat.bloomFilter($"c1", expectedNumItems, fpp)
What is the best way to filter 'df1' based on values in df2?
Thanks!

You can use an UDF:
def might_contain(f: org.apache.spark.util.sketch.BloomFilter) = udf((x: Int) =>
if(x != null) f.mightContain(x) else false)
df1.where(might_contain(sbf2)($"C1"))

I think I found the correct way to do this, but would still like pointers to see if there are better ways to manage this.
Here's my solution -
val in1 = spark.sparkContext.parallelize(List(0, 1, 2, 3, 4, 5))
val d1 = in1.map(x => (x, x+1, x+2)).toDF("c1", "c2", "c3")
val in2 = spark.sparkContext.parallelize(List(0, 1, 2))
val d2 = in2.map(x => (x)).toDF("c1")
val s2 = d2.stat.bloomFilter($"c1", expectedNumItems, fpp)
val a = spark.sparkContext.broadcast(s2)
val x = d1.rdd.filter(x => a.value.mightContain(x(0)))
case class newType(c1: Int, c2: Int, c3: Int) extends Serializable
val xDF = x.map(y => newType(y(0).toString.toInt, y(1).toString.toInt, y(2).toString.toInt)).toDF()
scala> d1.show(10)
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 0| 1| 2|
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
| 4| 5| 6|
| 5| 6| 7|
+---+---+---+
scala> d2.show(10)
+---+
| c1|
+---+
| 0|
| 1|
| 2|
+---+
scala> xDF.show(10)
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 0| 1| 2|
| 1| 2| 3|
| 2| 3| 4|
+---+---+---+

I built an implicit class that wraps https://stackoverflow.com/a/41989703/6723616
Comments welcome!
/**
* Copyright 2017 Yahoo, Inc.
* Zlib license: https://www.zlib.net/zlib_license.html
*/
package me.klotz.spark.utils
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
import org.apache.spark.util.sketch.BloomFilter
import org.apache.spark.SparkContext
object BloomFilterEnhancedJoin {
// not parameterized for field typel; assumes string
/**
* Like .join(bigDF, smallDF, but accelerated with a Bloom filter.
* You pass in a size estimate of the bigDF, and a ratio of acceptable false positives out of the expected result set size.
* ratio=1 is a good start; that will result in about 50% false positives in the big-small join, so the filter accepts
* about as many as it passes, rather than rejecting almost all. Pass in a size estimate of the big dataframe
* to avoid enumerating it. The small DataFrame gets enumerated anyway.
*
* Example use:
* <code>
* import me.klotz.spark.utils.BloomFilterEnhancedJoin._
* val (dups_joined, bloomFilterBroadcast) = df_big.joinBloom(1024L*1024L*1024L, dups, 10.0, "id")
* dups_joined.write.format("orc").save("dups")
* bloomFilterBroadcast.unpersist
* <code>
*/
implicit class BloomFilterEnhancedJoiner(bigdf:Dataset[Row]) {
/**
* You should call bloomFilterBroadcast.unpersist after
*/
def joinBloom(bigDFCountEstimate:Long, smallDF: Dataset[Row], ratio:Double, field:String) = {
val sc = smallDF.sparkSession.sparkContext
val smallDFCount = smallDF.count
val fpr = smallDFCount.toDouble / bigDFCountEstimate.toDouble / ratio
println(s"fpr=${fpr} = smallDFCount=${smallDFCount} / bigDFCountEstimate=${bigDFCountEstimate} / ratio=${ratio}")
val bloomFilterBroadcast = sc.broadcast((smallDF.stat.bloomFilter(field, smallDFCount, fpr)))
val mightContain = udf((x: String) => if (x != null) bloomFilterBroadcast.value.mightContainString(x) else false)
(bigdf.filter(mightContain(col(field))).join(smallDF, field), bloomFilterBroadcast)
}
}
}

Related

Scala spark, input dataframe, return columns where all values equal to 1

Given a dataframe, say that it contains 4 columns and 3 rows. I want to write a function to return the columns where all the values in that column are equal to 1.
This is a Scala code. I want to use some spark transformations to transform or filter the dataframe input. This filter should be implemented in a function.
case class Grade(c1: Integral, c2: Integral, c3: Integral, c4: Integral)
val example = Seq(
Grade(1,3,1,1),
Grade(1,1,null,1),
Grade(1,10,2,1)
)
val dfInput = spark.createDataFrame(example)
After I call the function filterColumns()
val dfOutput = dfInput.filterColumns()
it should return 3 row 2 columns dataframe with value all 1.
A bit more readable approach using Dataset[Grade]
import org.apache.spark.sql.functions.col
import scala.collection.mutable
import org.apache.spark.sql.Column
val tmp = dfInput.map(grade => grade.dropWhenNotEqualsTo(1))
val rowsCount = dfInput.count()
val colsToRetain = mutable.Set[Column]()
for (column <- tmp.columns) {
val withoutNullsCount = tmp.select(column).na.drop().count()
if (rowsCount == withoutNullsCount) colsToRetain += col(column)
}
dfInput.select(colsToRetain.toArray:_*).show()
+---+---+
| c4| c1|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
And the case object
case class Grade(c1: Integer, c2: Integer, c3: Integer, c4: Integer) {
def dropWhenNotEqualsTo(n: Integer): Grade = {
Grade(nullOrValue(c1, n), nullOrValue(c2, n), nullOrValue(c3, n), nullOrValue(c4, n))
}
def nullOrValue(c: Integer, n: Integer) = if (c == n) c else null
}
grade.dropWhenNotEqualsTo(1) -> returns a new Grade with values that not satisfies the condition replaced to nulls
+---+----+----+---+
| c1| c2| c3| c4|
+---+----+----+---+
| 1|null| 1| 1|
| 1| 1|null| 1|
| 1|null|null| 1|
+---+----+----+---+
(column <- tmp.columns) -> iterate over the columns
tmp.select(column).na.drop() -> drop rows with nulls
e.g for c2 this will return
+---+
| c2|
+---+
| 1|
+---+
if (rowsCount == withoutNullsCount) colsToRetain += col(column) -> if column contains nulls just drop it
one of the options is reduce on rdd:
import spark.implicits._
val df= Seq(("1","A","3","4"),("1","2","?","4"),("1","2","3","4")).toDF()
df.show()
val first = df.first()
val size = first.length
val diffStr = "#"
val targetStr = "1"
def rowToArray(row: Row): Array[String] = {
val arr = new Array[String](row.length)
for (i <- 0 to row.length-1){
arr(i) = row.getString(i)
}
arr
}
def compareArrays(a1: Array[String], a2: Array[String]): Array[String] = {
val arr = new Array[String](a1.length)
for (i <- 0 to a1.length-1){
arr(i) = if (a1(i).equals(a2(i)) && a1(i).equals(targetStr)) a1(i) else diffStr
}
arr
}
val diff = df.rdd
.map(rowToArray)
.reduce(compareArrays)
val cols = (df.columns zip diff).filter(!_._2.equals(diffStr)).map(s=>df(s._1))
df.select(cols:_*).show()
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| A| 3| 4|
| 1| 2| ?| 4|
| 1| 2| 3| 4|
+---+---+---+---+
+---+
| _1|
+---+
| 1|
| 1|
| 1|
+---+
I would try to prepare dataset for processing without nulls. In case of few columns this simple iterative approach might work fine (don't forget to import spark implicits before import spark.implicits._):
val example = spark.sparkContext.parallelize(Seq(
Grade(1,3,1,1),
Grade(1,1,0,1),
Grade(1,10,2,1)
)).toDS().cache()
def allOnes(colName: String, ds: Dataset[Grade]): Boolean = {
val row = ds.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns = example.columns.filter(col => allOnes(col, example))
example.selectExpr(resultColumns: _*).show()
result is:
+---+---+
| c1| c4|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
If nulls are inevitable, use untyped dataset (aka dataframe):
val schema = StructType(Seq(
StructField("c1", IntegerType, nullable = true),
StructField("c2", IntegerType, nullable = true),
StructField("c3", IntegerType, nullable = true),
StructField("c4", IntegerType, nullable = true)
))
val example = spark.sparkContext.parallelize(Seq(
Row(1,3,1,1),
Row(1,1,null,1),
Row(1,10,2,1)
))
val dfInput = spark.createDataFrame(example, schema).cache()
def allOnes(colName: String, df: DataFrame): Boolean = {
val row = df.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns= dfInput.columns.filter(col => allOnes(col, dfInput))
dfInput.selectExpr(resultColumns: _*).show()

How to retain the column structure of a Spark Dataframe following a map operation on rows

I am trying to apply a function to each row of a Spark DataFrame, as in the example.
val df = sc.parallelize(
Seq((1, 2, 0), (0, 0, 1), (0, 0, 0))).toDF("x", "y", "z")
df.show()
which yields
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 0|
| 0| 0| 1|
| 0| 0| 0|
+---+---+---+
Suppose I want to do something to the values in each row, for example changing 0 to 5.
val b = df.map(row => row.toSeq.map(x => x match{
case 0 => 5
case x: Int => x
}))
b.show()
+---------+
| value|
+---------+
|[1, 2, 5]|
|[5, 5, 1]|
|[5, 5, 5]|
+---------+
The function worked, but I now have one column whose entries are Lists, instead of 3 columns of Ints. I would like my named columns back.
You can define an UDF to apply this substitution. For example:
def subsDef(k: Int): Int = if(k==0) 5 else k
val subs = udf[Int, Int](subsDef)
Then you can apply the UDF to a specific column or, if you desire, to every columns of the DF:
// to a single column, for example "x"
df = df.withColumn("x", subs(col("x")))
df.show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 0|
| 5| 0| 1|
| 5| 0| 0|
+---+---+---+
// to every columns of DF
df.columns.foreach(c => {
df = df.withColumn(c, subs(col(c)))
})
df.show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 5|
| 5| 5| 1|
| 5| 5| 5|
+---+---+---+
Rather than transforming the DataFrame row-wise, consider using built-in Spark API function when/otherwise, as follows:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq((1, 2, 0), (0, 0, 1), (0, 0, 0)).toDF("x", "y", "z")
val vFrom = 0
val vTo = 5
val cols = df.columns // Filter for specific columns if necessary
df.select( cols.map( c =>
when(col(c) === vFrom, vTo).otherwise(col(c)).as(c)
): _*
).show
// +---+---+---+
// | x| y| z|
// +---+---+---+
// | 1| 2| 5|
// | 5| 5| 1|
// | 5| 5| 5|
// +---+---+---+
There are various ways to do it here are some:
df.map(row => {
val size = row.size
var seq: Seq[Int] = Seq.empty[Int]
for (a <- 0 to size - 1) {
val value: Int = row(a).asInstanceOf[Int]
val newVal: Int = value match {
case 0 =>
5
case _ =>
value
}
seq = seq :+ newVal
}
Row.fromSeq(seq)
})(RowEncoder.apply(df.schema))
val columns = df.columns
df.select(
columns.map(c => when(col(c) === 0, 5).otherwise(col(c)).as(c)): _*)
.show()
def fun: (Int => Int) = { x =>
if (x == 0) 5 else x
}
val function = udf(fun)
df.select(function(col("x")).as("x"),
function(col("y")).as("y"),
function(col("z")).as("z"))
.show()
def checkZero(a: Int): Int = if (a == 0) 5 else a
df.map {
case Row(a: Int, b: Int, c: Int) =>
Row(checkZero(a), checkZero(b), checkZero(c))
} { RowEncoder.apply(df.schema) }
.show()

How to add a column collection based on the maximum and minimum values in a dataframe

I've got this DataFrame
val for_df = Seq((5,7,"5k-7k"),(4,8,"4k-8k"),(6,12,"6k-2k")).toDF("min","max","salary")
I want to convert 5k-7k to 5,6,7 and 4k-8k to 4,5,6,7,8.
Original DataFrame:
Desired DataFrame
a.select("min","max","salary")
.as[(Integer,Integer,String)]
.map{
case(min,max,salary) =>
(min,max,salary.split("-").flatMap(x => {
for(i <- 0 to x.length-1) yield (i)
}))
}.toDF("1","2","3").show()
you need to create a UDF to expand the limits. The following UDF will convert convert 5k-7k to 5,6,7 and 4k-8k to 4,5,6,7,8 and so on
import org.apache.spark.sql.functions._
val inputDF = sc.parallelize(List((5,7,"5k-7k"),(4,8,"4k-8k"),(6,12,"6k-12k"))).toDF("min","max","salary")
val extendUDF = udf((str: String) => {
val nums = str.replace("k","").split("-").map(_.toInt)
(nums(0) to nums(1)).toList.mkString(",")
})
val output = inputDF.withColumn("salary_level", extendUDF($"salary"))
Output:
scala> output.show
+---+---+------+----------------+
|min|max|salary| salary_level|
+---+---+------+----------------+
| 5| 7| 5k-7k| 5,6,7|
| 4| 8| 4k-8k| 4,5,6,7,8|
| 6| 12|6k-12k|6,7,8,9,10,11,12|
+---+---+------+----------------+
You can easily do this with a udf.
// The following defines a udf in spark which create a list as per your requirement.
val makeRangeLists = udf( (min: Int, max: Int) => List.range(min, max+1) )
val input = sc.parallelize(List((5,7,"5k-7k"),
(4,8,"4k-8k"),(6,12,"6k-12k"))).toDF("min","max","salary")
// Create a new column using the UDF and pass the max and min columns.
input.withColumn("salary_level", makeRangeLists($"min", $"max")).show
Here one quick option with an UDF
import org.apache.spark.sql.functions
val toSalary = functions.udf((value: String) => {
val array = value.filterNot(_ == 'k').split("-").map(_.trim.toInt).sorted
val (startSalary, endSalary) = (array.headOption, array.tail.headOption)
(startSalary, endSalary) match {
case (Some(s), Some(e)) => (s to e).toList.mkString(",")
case _ => ""
}
})
for_df.withColumn("salary_level", toSalary($"salary")).drop("salary")
Input
+---+---+------+
|min|max|salary|
+---+---+------+
| 5| 7| 5k-7k|
| 4| 8| 4k-8k|
| 6| 12| 6k-2k|
+---+---+------+
Result
+---+---+------------+
|min|max|salary_level|
+---+---+------------+
| 5| 7| 5,6,7|
| 4| 8| 4,5,6,7,8|
| 6| 12| 2,3,4,5,6|
+---+---+------------+
First you remove the k and split your string by the dash. Then you get the start and endSalary and perform a range beetwen them.

Scala & Spark: Add value to every cell of every row

I have a two DataFrames:
scala> df1.show()
+----+----+----+---+----+
|col1|col2|col3| |colN|
+----+----+----+ +----+
| 2|null| 3|...| 4|
| 4| 3| 3| | 1|
| 5| 2| 8| | 1|
+----+----+----+---+----+
scala> df2.show() // has one row only (avg())
+----+----+----+---+----+
|col1|col2|col3| |colN|
+----+----+----+ +----+
| 3.6|null| 4.6|...| 2|
+----+----+----+---+----+
and a constant val c : Double = 0.1.
Desired output is a df3: Dataframe that is given by
,
with n=numberOfRow and m=numberOfColumn.
I already looked through the list of sql.functions and failed implementing it myself with some nested map operations (fearing performance issues). One idea I had was:
val cBc = spark.sparkContext.broadcast(c)
val df2Bc = spark.sparkContext.broadcast(averageObservation)
df1.rdd.map(row => {
for (colIdx <- 0 until row.length) {
val correspondingDf2value = df2Bc.value.head().getDouble(colIdx)
row.getDouble(colIdx) * (1 - cBc.value) + correspondingDf2value * cBc.value
}
})
Thank you in advance!
(cross)join combined with select is more than enough and will be much more efficient than mapping. Required imports:
import org.apache.spark.sql.functions.{broadcast, col, lit}
and expression:
val exprs = df1.columns.map { x => (df1(x) * (1 - c) + df2(x) * c).alias(x) }
join and select:
df1.crossJoin(broadcast(df2)).select(exprs: _*)

How to do a Spark dataframe(1 million rows) cartesian product with a list(1000 entries) efficiently to generate a new dataframe with 1 billion rows

I want to take each row of a dataframe which has 1 million rows and generate 1000 rows from each row of it by taking a cross product with a list having 1000 entries thereby generating a dataframe with 1 billion rows. What is the best approach to do it efficiently.
I have tried with broadcasting the list and then using it while mapping each row of the dataframe. But this seems to be taking too much time.
val mappedrdd = validationDataFrames.map(x => {
val cutoffList : List[String] = cutoffListBroadcast.value
val arrayTruthTableVal = arrayTruthTableBroadcast.value
var listBufferRow: ListBuffer[Row] = new ListBuffer()
for(cutOff <- cutoffList){
val conversion = x.get(0).asInstanceOf[Int]
val probability = x.get(1).asInstanceOf[Double]
var columnName : StringBuffer = new StringBuffer
columnName = columnName.append(conversion)
if(probability > cutOff.toDouble){
columnName = columnName.append("_").append("1")
}else{
columnName = columnName.append("_").append("0")
}
val index:Int = arrayTruthTableVal.indexOf(columnName.toString)
var listBuffer : ListBuffer[String] = new ListBuffer()
listBuffer :+= cutOff
for(i <- 1 to 4){
if((index + 1) == i) listBuffer :+= "1" else listBuffer :+= "0"
}
val row = Row.fromSeq(listBuffer)
listBufferRow = listBufferRow :+ row
}
listBufferRow
})
Depending on your spark version you can do:
Spark 2.1.0
Add the list as a column and explode. A simplified example:
val df = spark.range(5)
val exploded = df.withColumn("a",lit(List(1,2,3).toArray)).withColumn("a", explode($"a"))
df.show()
+---+---+
| id| a|
+---+---+
| 0| 1|
| 0| 2|
| 0| 3|
| 1| 1|
| 1| 2|
| 1| 3|
| 2| 1|
| 2| 2|
| 2| 3|
| 3| 1|
| 3| 2|
| 3| 3|
| 4| 1|
| 4| 2|
| 4| 3|
+---+---+
For timing you can do:
def time[R](block: => R): Long = {
val t0 = System.currentTimeMillis()
block // call-by-name
val t1 = System.currentTimeMillis()
t1 - t0
}
time(spark.range(1000000).withColumn("a",lit((0 until 1000).toArray)).withColumn("a", explode($"a")).count())
took 5.41 seconds on a 16 core computer with plenty of memory configured with default parallelism of 60.
< Spark 2.1.0
You can define a simple UDF.
val xx = (0 until 1000).toArray.toSeq // replace with your list but turn it to seq
val ff = udf(() => {xx})
time(spark.range(1000000).withColumn("a",ff()).withColumn("a", explode($"a")).count())
Took on the same server as above 8.25 seconds