Using stat.bloomFilter in Spark 2.0.0 to filter another dataframe

Using stat.bloomFilter in Spark 2.0.0 to filter another dataframe - scala

I have two large dataframes [a] one which has all events identified by an id [b] a list of ids. I want to filter [a] based on the ids in [b] using the stat.bloomFilter implementation in spark 2.0.0
However I don't see any operations in the dataset API to join the bloom filter to the data frame [a]
val in1 = spark.sparkContext.parallelize(List(0, 1, 2, 3, 4, 5))
val df1 = in1.map(x => (x, x+1, x+2)).toDF("c1", "c2", "c3")
val in2 = spark.sparkContext.parallelize(List(0, 1, 2))
val df2 = in2.map(x => (x)).toDF("c1")
val expectedNumItems: Long = 1000
val fpp: Double = 0.005
val sbf = df.stat.bloomFilter($"c1", expectedNumItems, fpp)
val sbf2 = df2.stat.bloomFilter($"c1", expectedNumItems, fpp)
What is the best way to filter 'df1' based on values in df2?
Thanks!

You can use an UDF:
def might_contain(f: org.apache.spark.util.sketch.BloomFilter) = udf((x: Int) =>
if(x != null) f.mightContain(x) else false)
df1.where(might_contain(sbf2)($"C1"))

I think I found the correct way to do this, but would still like pointers to see if there are better ways to manage this.
Here's my solution -
val in1 = spark.sparkContext.parallelize(List(0, 1, 2, 3, 4, 5))
val d1 = in1.map(x => (x, x+1, x+2)).toDF("c1", "c2", "c3")
val in2 = spark.sparkContext.parallelize(List(0, 1, 2))
val d2 = in2.map(x => (x)).toDF("c1")
val s2 = d2.stat.bloomFilter($"c1", expectedNumItems, fpp)
val a = spark.sparkContext.broadcast(s2)
val x = d1.rdd.filter(x => a.value.mightContain(x(0)))
case class newType(c1: Int, c2: Int, c3: Int) extends Serializable
val xDF = x.map(y => newType(y(0).toString.toInt, y(1).toString.toInt, y(2).toString.toInt)).toDF()
scala> d1.show(10)
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 0| 1| 2|
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
| 4| 5| 6|
| 5| 6| 7|
+---+---+---+
scala> d2.show(10)
+---+
| c1|
+---+
| 0|
| 1|
| 2|
+---+
scala> xDF.show(10)
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 0| 1| 2|
| 1| 2| 3|
| 2| 3| 4|
+---+---+---+

I built an implicit class that wraps https://stackoverflow.com/a/41989703/6723616
Comments welcome!
/**
* Copyright 2017 Yahoo, Inc.
* Zlib license: https://www.zlib.net/zlib_license.html
*/
package me.klotz.spark.utils
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
import org.apache.spark.util.sketch.BloomFilter
import org.apache.spark.SparkContext
object BloomFilterEnhancedJoin {
// not parameterized for field typel; assumes string
/**
* Like .join(bigDF, smallDF, but accelerated with a Bloom filter.
* You pass in a size estimate of the bigDF, and a ratio of acceptable false positives out of the expected result set size.
* ratio=1 is a good start; that will result in about 50% false positives in the big-small join, so the filter accepts
* about as many as it passes, rather than rejecting almost all. Pass in a size estimate of the big dataframe
* to avoid enumerating it. The small DataFrame gets enumerated anyway.
*
* Example use:
* <code>
* import me.klotz.spark.utils.BloomFilterEnhancedJoin._
* val (dups_joined, bloomFilterBroadcast) = df_big.joinBloom(1024L*1024L*1024L, dups, 10.0, "id")
* dups_joined.write.format("orc").save("dups")
* bloomFilterBroadcast.unpersist
* <code>
*/
implicit class BloomFilterEnhancedJoiner(bigdf:Dataset[Row]) {
/**
* You should call bloomFilterBroadcast.unpersist after
*/
def joinBloom(bigDFCountEstimate:Long, smallDF: Dataset[Row], ratio:Double, field:String) = {
val sc = smallDF.sparkSession.sparkContext
val smallDFCount = smallDF.count
val fpr = smallDFCount.toDouble / bigDFCountEstimate.toDouble / ratio
println(s"fpr=${fpr} = smallDFCount=${smallDFCount} / bigDFCountEstimate=${bigDFCountEstimate} / ratio=${ratio}")
val bloomFilterBroadcast = sc.broadcast((smallDF.stat.bloomFilter(field, smallDFCount, fpr)))
val mightContain = udf((x: String) => if (x != null) bloomFilterBroadcast.value.mightContainString(x) else false)
(bigdf.filter(mightContain(col(field))).join(smallDF, field), bloomFilterBroadcast)
}
}
}

Related

Scala spark, input dataframe, return columns where all values equal to 1

Given a dataframe, say that it contains 4 columns and 3 rows. I want to write a function to return the columns where all the values in that column are equal to 1.
This is a Scala code. I want to use some spark transformations to transform or filter the dataframe input. This filter should be implemented in a function.
case class Grade(c1: Integral, c2: Integral, c3: Integral, c4: Integral)
val example = Seq(
Grade(1,3,1,1),
Grade(1,1,null,1),
Grade(1,10,2,1)
)
val dfInput = spark.createDataFrame(example)
After I call the function filterColumns()
val dfOutput = dfInput.filterColumns()
it should return 3 row 2 columns dataframe with value all 1.

A bit more readable approach using Dataset[Grade]
import org.apache.spark.sql.functions.col
import scala.collection.mutable
import org.apache.spark.sql.Column
val tmp = dfInput.map(grade => grade.dropWhenNotEqualsTo(1))
val rowsCount = dfInput.count()
val colsToRetain = mutable.Set[Column]()
for (column <- tmp.columns) {
val withoutNullsCount = tmp.select(column).na.drop().count()
if (rowsCount == withoutNullsCount) colsToRetain += col(column)
}
dfInput.select(colsToRetain.toArray:_*).show()
+---+---+
| c4| c1|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
And the case object
case class Grade(c1: Integer, c2: Integer, c3: Integer, c4: Integer) {
def dropWhenNotEqualsTo(n: Integer): Grade = {
Grade(nullOrValue(c1, n), nullOrValue(c2, n), nullOrValue(c3, n), nullOrValue(c4, n))
}
def nullOrValue(c: Integer, n: Integer) = if (c == n) c else null
}
grade.dropWhenNotEqualsTo(1) -> returns a new Grade with values that not satisfies the condition replaced to nulls
+---+----+----+---+
| c1| c2| c3| c4|
+---+----+----+---+
| 1|null| 1| 1|
| 1| 1|null| 1|
| 1|null|null| 1|
+---+----+----+---+
(column <- tmp.columns) -> iterate over the columns
tmp.select(column).na.drop() -> drop rows with nulls
e.g for c2 this will return
+---+
| c2|
+---+
| 1|
+---+
if (rowsCount == withoutNullsCount) colsToRetain += col(column) -> if column contains nulls just drop it

one of the options is reduce on rdd:
import spark.implicits._
val df= Seq(("1","A","3","4"),("1","2","?","4"),("1","2","3","4")).toDF()
df.show()
val first = df.first()
val size = first.length
val diffStr = "#"
val targetStr = "1"
def rowToArray(row: Row): Array[String] = {
val arr = new Array[String](row.length)
for (i <- 0 to row.length-1){
arr(i) = row.getString(i)
}
arr
}
def compareArrays(a1: Array[String], a2: Array[String]): Array[String] = {
val arr = new Array[String](a1.length)
for (i <- 0 to a1.length-1){
arr(i) = if (a1(i).equals(a2(i)) && a1(i).equals(targetStr)) a1(i) else diffStr
}
arr
}
val diff = df.rdd
.map(rowToArray)
.reduce(compareArrays)
val cols = (df.columns zip diff).filter(!_._2.equals(diffStr)).map(s=>df(s._1))
df.select(cols:_*).show()
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| A| 3| 4|
| 1| 2| ?| 4|
| 1| 2| 3| 4|
+---+---+---+---+
+---+
| _1|
+---+
| 1|
| 1|
| 1|
+---+

I would try to prepare dataset for processing without nulls. In case of few columns this simple iterative approach might work fine (don't forget to import spark implicits before import spark.implicits._):
val example = spark.sparkContext.parallelize(Seq(
Grade(1,3,1,1),
Grade(1,1,0,1),
Grade(1,10,2,1)
)).toDS().cache()
def allOnes(colName: String, ds: Dataset[Grade]): Boolean = {
val row = ds.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns = example.columns.filter(col => allOnes(col, example))
example.selectExpr(resultColumns: _*).show()
result is:
+---+---+
| c1| c4|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
If nulls are inevitable, use untyped dataset (aka dataframe):
val schema = StructType(Seq(
StructField("c1", IntegerType, nullable = true),
StructField("c2", IntegerType, nullable = true),
StructField("c3", IntegerType, nullable = true),
StructField("c4", IntegerType, nullable = true)
))
val example = spark.sparkContext.parallelize(Seq(
Row(1,3,1,1),
Row(1,1,null,1),
Row(1,10,2,1)
))
val dfInput = spark.createDataFrame(example, schema).cache()
def allOnes(colName: String, df: DataFrame): Boolean = {
val row = df.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns= dfInput.columns.filter(col => allOnes(col, dfInput))
dfInput.selectExpr(resultColumns: _*).show()

How to retain the column structure of a Spark Dataframe following a map operation on rows

I am trying to apply a function to each row of a Spark DataFrame, as in the example.
val df = sc.parallelize(
Seq((1, 2, 0), (0, 0, 1), (0, 0, 0))).toDF("x", "y", "z")
df.show()
which yields
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 0|
| 0| 0| 1|
| 0| 0| 0|
+---+---+---+
Suppose I want to do something to the values in each row, for example changing 0 to 5.
val b = df.map(row => row.toSeq.map(x => x match{
case 0 => 5
case x: Int => x
}))
b.show()
+---------+
| value|
+---------+
|[1, 2, 5]|
|[5, 5, 1]|
|[5, 5, 5]|
+---------+
The function worked, but I now have one column whose entries are Lists, instead of 3 columns of Ints. I would like my named columns back.

You can define an UDF to apply this substitution. For example:
def subsDef(k: Int): Int = if(k==0) 5 else k
val subs = udf[Int, Int](subsDef)
Then you can apply the UDF to a specific column or, if you desire, to every columns of the DF:
// to a single column, for example "x"
df = df.withColumn("x", subs(col("x")))
df.show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 0|
| 5| 0| 1|
| 5| 0| 0|
+---+---+---+
// to every columns of DF
df.columns.foreach(c => {
df = df.withColumn(c, subs(col(c)))
})
df.show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 5|
| 5| 5| 1|
| 5| 5| 5|
+---+---+---+

Rather than transforming the DataFrame row-wise, consider using built-in Spark API function when/otherwise, as follows:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq((1, 2, 0), (0, 0, 1), (0, 0, 0)).toDF("x", "y", "z")
val vFrom = 0
val vTo = 5
val cols = df.columns // Filter for specific columns if necessary
df.select( cols.map( c =>
when(col(c) === vFrom, vTo).otherwise(col(c)).as(c)
): _*
).show
// +---+---+---+
// | x| y| z|
// +---+---+---+
// | 1| 2| 5|
// | 5| 5| 1|
// | 5| 5| 5|
// +---+---+---+

There are various ways to do it here are some:
df.map(row => {
val size = row.size
var seq: Seq[Int] = Seq.empty[Int]
for (a <- 0 to size - 1) {
val value: Int = row(a).asInstanceOf[Int]
val newVal: Int = value match {
case 0 =>
5
case _ =>
value
}
seq = seq :+ newVal
}
Row.fromSeq(seq)
})(RowEncoder.apply(df.schema))
val columns = df.columns
df.select(
columns.map(c => when(col(c) === 0, 5).otherwise(col(c)).as(c)): _*)
.show()
def fun: (Int => Int) = { x =>
if (x == 0) 5 else x
}
val function = udf(fun)
df.select(function(col("x")).as("x"),
function(col("y")).as("y"),
function(col("z")).as("z"))
.show()
def checkZero(a: Int): Int = if (a == 0) 5 else a
df.map {
case Row(a: Int, b: Int, c: Int) =>
Row(checkZero(a), checkZero(b), checkZero(c))
} { RowEncoder.apply(df.schema) }
.show()

How to add a column collection based on the maximum and minimum values in a dataframe

I've got this DataFrame
val for_df = Seq((5,7,"5k-7k"),(4,8,"4k-8k"),(6,12,"6k-2k")).toDF("min","max","salary")
I want to convert 5k-7k to 5,6,7 and 4k-8k to 4,5,6,7,8.
Original DataFrame:
Desired DataFrame
a.select("min","max","salary")
.as[(Integer,Integer,String)]
.map{
case(min,max,salary) =>
(min,max,salary.split("-").flatMap(x => {
for(i <- 0 to x.length-1) yield (i)
}))
}.toDF("1","2","3").show()

you need to create a UDF to expand the limits. The following UDF will convert convert 5k-7k to 5,6,7 and 4k-8k to 4,5,6,7,8 and so on
import org.apache.spark.sql.functions._
val inputDF = sc.parallelize(List((5,7,"5k-7k"),(4,8,"4k-8k"),(6,12,"6k-12k"))).toDF("min","max","salary")
val extendUDF = udf((str: String) => {
val nums = str.replace("k","").split("-").map(_.toInt)
(nums(0) to nums(1)).toList.mkString(",")
})
val output = inputDF.withColumn("salary_level", extendUDF($"salary"))
Output:
scala> output.show
+---+---+------+----------------+
|min|max|salary| salary_level|
+---+---+------+----------------+
| 5| 7| 5k-7k| 5,6,7|
| 4| 8| 4k-8k| 4,5,6,7,8|
| 6| 12|6k-12k|6,7,8,9,10,11,12|
+---+---+------+----------------+

You can easily do this with a udf.
// The following defines a udf in spark which create a list as per your requirement.
val makeRangeLists = udf( (min: Int, max: Int) => List.range(min, max+1) )
val input = sc.parallelize(List((5,7,"5k-7k"),
(4,8,"4k-8k"),(6,12,"6k-12k"))).toDF("min","max","salary")
// Create a new column using the UDF and pass the max and min columns.
input.withColumn("salary_level", makeRangeLists($"min", $"max")).show

Here one quick option with an UDF
import org.apache.spark.sql.functions
val toSalary = functions.udf((value: String) => {
val array = value.filterNot(_ == 'k').split("-").map(_.trim.toInt).sorted
val (startSalary, endSalary) = (array.headOption, array.tail.headOption)
(startSalary, endSalary) match {
case (Some(s), Some(e)) => (s to e).toList.mkString(",")
case _ => ""
}
})
for_df.withColumn("salary_level", toSalary($"salary")).drop("salary")
Input
+---+---+------+
|min|max|salary|
+---+---+------+
| 5| 7| 5k-7k|
| 4| 8| 4k-8k|
| 6| 12| 6k-2k|
+---+---+------+
Result
+---+---+------------+
|min|max|salary_level|
+---+---+------------+
| 5| 7| 5,6,7|
| 4| 8| 4,5,6,7,8|
| 6| 12| 2,3,4,5,6|
+---+---+------------+
First you remove the k and split your string by the dash. Then you get the start and endSalary and perform a range beetwen them.

Scala & Spark: Add value to every cell of every row

I have a two DataFrames:
scala> df1.show()
+----+----+----+---+----+
|col1|col2|col3| |colN|
+----+----+----+ +----+
| 2|null| 3|...| 4|
| 4| 3| 3| | 1|
| 5| 2| 8| | 1|
+----+----+----+---+----+
scala> df2.show() // has one row only (avg())
+----+----+----+---+----+
|col1|col2|col3| |colN|
+----+----+----+ +----+
| 3.6|null| 4.6|...| 2|
+----+----+----+---+----+
and a constant val c : Double = 0.1.
Desired output is a df3: Dataframe that is given by
,
with n=numberOfRow and m=numberOfColumn.
I already looked through the list of sql.functions and failed implementing it myself with some nested map operations (fearing performance issues). One idea I had was:
val cBc = spark.sparkContext.broadcast(c)
val df2Bc = spark.sparkContext.broadcast(averageObservation)
df1.rdd.map(row => {
for (colIdx <- 0 until row.length) {
val correspondingDf2value = df2Bc.value.head().getDouble(colIdx)
row.getDouble(colIdx) * (1 - cBc.value) + correspondingDf2value * cBc.value
}
})
Thank you in advance!

(cross)join combined with select is more than enough and will be much more efficient than mapping. Required imports:
import org.apache.spark.sql.functions.{broadcast, col, lit}
and expression:
val exprs = df1.columns.map { x => (df1(x) * (1 - c) + df2(x) * c).alias(x) }
join and select:
df1.crossJoin(broadcast(df2)).select(exprs: _*)

How to do a Spark dataframe(1 million rows) cartesian product with a list(1000 entries) efficiently to generate a new dataframe with 1 billion rows

I want to take each row of a dataframe which has 1 million rows and generate 1000 rows from each row of it by taking a cross product with a list having 1000 entries thereby generating a dataframe with 1 billion rows. What is the best approach to do it efficiently.
I have tried with broadcasting the list and then using it while mapping each row of the dataframe. But this seems to be taking too much time.
val mappedrdd = validationDataFrames.map(x => {
val cutoffList : List[String] = cutoffListBroadcast.value
val arrayTruthTableVal = arrayTruthTableBroadcast.value
var listBufferRow: ListBuffer[Row] = new ListBuffer()
for(cutOff <- cutoffList){
val conversion = x.get(0).asInstanceOf[Int]
val probability = x.get(1).asInstanceOf[Double]
var columnName : StringBuffer = new StringBuffer
columnName = columnName.append(conversion)
if(probability > cutOff.toDouble){
columnName = columnName.append("_").append("1")
}else{
columnName = columnName.append("_").append("0")
}
val index:Int = arrayTruthTableVal.indexOf(columnName.toString)
var listBuffer : ListBuffer[String] = new ListBuffer()
listBuffer :+= cutOff
for(i <- 1 to 4){
if((index + 1) == i) listBuffer :+= "1" else listBuffer :+= "0"
}
val row = Row.fromSeq(listBuffer)
listBufferRow = listBufferRow :+ row
}
listBufferRow
})

Depending on your spark version you can do:
Spark 2.1.0
Add the list as a column and explode. A simplified example:
val df = spark.range(5)
val exploded = df.withColumn("a",lit(List(1,2,3).toArray)).withColumn("a", explode($"a"))
df.show()
+---+---+
| id| a|
+---+---+
| 0| 1|
| 0| 2|
| 0| 3|
| 1| 1|
| 1| 2|
| 1| 3|
| 2| 1|
| 2| 2|
| 2| 3|
| 3| 1|
| 3| 2|
| 3| 3|
| 4| 1|
| 4| 2|
| 4| 3|
+---+---+
For timing you can do:
def time[R](block: => R): Long = {
val t0 = System.currentTimeMillis()
block // call-by-name
val t1 = System.currentTimeMillis()
t1 - t0
}
time(spark.range(1000000).withColumn("a",lit((0 until 1000).toArray)).withColumn("a", explode($"a")).count())
took 5.41 seconds on a 16 core computer with plenty of memory configured with default parallelism of 60.
< Spark 2.1.0
You can define a simple UDF.
val xx = (0 until 1000).toArray.toSeq // replace with your list but turn it to seq
val ff = udf(() => {xx})
time(spark.range(1000000).withColumn("a",ff()).withColumn("a", explode($"a")).count())
Took on the same server as above 8.25 seconds

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Using stat.bloomFilter in Spark 2.0.0 to filter another dataframe - scala

You can use an UDF: def might_contain(f: org.apache.spark.util.sketch.BloomFilter) = udf((x: Int) => if(x != null) f.mightContain(x) else false) df1.where(might_contain(sbf2)($"C1"))

Related

Scala spark, input dataframe, return columns where all values equal to 1

How to retain the column structure of a Spark Dataframe following a map operation on rows

How to add a column collection based on the maximum and minimum values in a dataframe

Scala & Spark: Add value to every cell of every row

How to do a Spark dataframe(1 million rows) cartesian product with a list(1000 entries) efficiently to generate a new dataframe with 1 billion rows

Categories

Resources