Spark: Query for multiple conditions at the same time - scala

So Dataframe.where can be used to filter a dataframe for the rows given by an expression, like this:
df.where(($"group_id" == 1234) || ($"group_id" == 4434))
or to give a more complex example
df.where(($"group_id" == 1234 && $"country" === "PL") || ($"group_id" == 4434 $"country" === "FR"))
I am interest in whether I can supply these conditions somehow as a list, so suppose I have a list of group_id's, List((1234, "PL"), (4434, "FR"), ....) then I would like to efficiently filter the dataframe.

You can try something like this:
val df = Seq((1,"a"),(2,"b"),(3,"c")).toDF()
df.show()
 
+---+---+
| _1| _2|
+---+---+
| 1| a|
| 2| b|
| 3| c|
+---+---+
 
val list = List((1,"a"),(3,"c"))
val cols = List("_1","_2")
def mkCol(values: List[(Any,Any)], columns: List[String]) = list.map(s=>col(columns.apply(0)) === s._1 && col(columns.apply(1)) === s._2)
.reduce((a,b)=>a.or(b))
val col = mkCol(list,cols)
col.explain(true)
 
((('_1 = 1) && ('_2 = a)) || (('_1 = 3) && ('_2 = c)))
 
df.where(col).show()
 
+---+---+
| _1| _2|
+---+---+
| 1| a|
| 3| c|
+---+---+

Related

How to remove Spark values that are out of sequence

I need to remove some values from dataframe that is not in right place.
I have the following dataframe, for example:
+-----+-----+
|count|PHASE|
+-----+-----+
| 1| 3|
| 2| 3|
| 3| 6|
| 4| 6|
| 5| 8|
| 6| 4|
| 7| 4|
| 8| 4|
+-----+-----+
I need to remove 6 and 8 from dataframe because of some rules:
phase === 3 and lastPhase.isNull
phase === 4 and lastPhase.isin(2, 3)
phase === 6 and lastPhase.isin(4, 5)
phase === 8 and lastPhase.isin(6, 7)
This is a huge dataframe and those misplaced values can happen many times.
Could you help with that, please?
Expected output:
+-----+-----+------+
|count|PHASE|CHANGE|
+-----+-----+------+
| 1| 3| 3|
| 2| 3| 3|
| 3| 6| 3|
| 4| 6| 3|
| 5| 8| 3|
| 6| 4| 4|
| 7| 4| 4|
| 8| 4| 4|
+-----+-----+------+
val rows = Seq(
Row(1, 3),
Row(2, 3),
Row(3, 6),
Row(4, 6),
Row(5, 8),
Row(6, 4),
Row(7, 4),
Row(8, 4)
)
val schema = StructType(
Seq(StructField("count", IntegerType), StructField("PHASE", IntegerType))
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(rows),
schema
)
Thanks in advance!
If I correctly understood your question, you want to populate column CHANGE as follow:
For a dataframe sorted by count column, for each row, if the value of the PHASE column matches a defined set of rules, set this value in CHANGE column. If value doesn't match the rules, set latest valid PHASE value in CHANGE column
To do so, You can use an user-defined aggregate function to setup CHANGE column over a window ordered by COUNT column
First, you define an Aggregator object where its buffer will be the last valid phase, and you implement your set of rules in its reduce function:
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.{Encoder, Encoders}
object LatestValidPhase extends Aggregator[Integer, Integer, Integer] {
def zero: Integer = null
def reduce(lastPhase: Integer, phase: Integer): Integer = {
if (lastPhase == null && phase == 3) {
phase
} else if (Set(2, 3).contains(lastPhase) && phase == 4) {
phase
} else if (Set(4, 5).contains(lastPhase) && phase == 6) {
phase
} else if (Set(6, 7).contains(lastPhase) && phase == 8) {
phase
} else {
lastPhase
}
}
def merge(b1: Integer, b2: Integer): Integer = {
throw new NotImplementedError("should not use as general aggregation")
}
def finish(reduction: Integer): Integer = reduction
def bufferEncoder: Encoder[Integer] = Encoders.INT
def outputEncoder: Encoder[Integer] = Encoders.INT
}
Then you transform it into an aggregate user-defined function that you apply over your window ordered by COUNT column:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, udaf}
val latest_valid_phase = udaf(LatestValidPhase)
val window = Window.orderBy("count")
df.withColumn("CHANGE", latest_valid_phase(col("PHASE")).over(window))

Scala spark, input dataframe, return columns where all values equal to 1

Given a dataframe, say that it contains 4 columns and 3 rows. I want to write a function to return the columns where all the values in that column are equal to 1.
This is a Scala code. I want to use some spark transformations to transform or filter the dataframe input. This filter should be implemented in a function.
case class Grade(c1: Integral, c2: Integral, c3: Integral, c4: Integral)
val example = Seq(
Grade(1,3,1,1),
Grade(1,1,null,1),
Grade(1,10,2,1)
)
val dfInput = spark.createDataFrame(example)
After I call the function filterColumns()
val dfOutput = dfInput.filterColumns()
it should return 3 row 2 columns dataframe with value all 1.
A bit more readable approach using Dataset[Grade]
import org.apache.spark.sql.functions.col
import scala.collection.mutable
import org.apache.spark.sql.Column
val tmp = dfInput.map(grade => grade.dropWhenNotEqualsTo(1))
val rowsCount = dfInput.count()
val colsToRetain = mutable.Set[Column]()
for (column <- tmp.columns) {
val withoutNullsCount = tmp.select(column).na.drop().count()
if (rowsCount == withoutNullsCount) colsToRetain += col(column)
}
dfInput.select(colsToRetain.toArray:_*).show()
+---+---+
| c4| c1|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
And the case object
case class Grade(c1: Integer, c2: Integer, c3: Integer, c4: Integer) {
def dropWhenNotEqualsTo(n: Integer): Grade = {
Grade(nullOrValue(c1, n), nullOrValue(c2, n), nullOrValue(c3, n), nullOrValue(c4, n))
}
def nullOrValue(c: Integer, n: Integer) = if (c == n) c else null
}
grade.dropWhenNotEqualsTo(1) -> returns a new Grade with values that not satisfies the condition replaced to nulls
+---+----+----+---+
| c1| c2| c3| c4|
+---+----+----+---+
| 1|null| 1| 1|
| 1| 1|null| 1|
| 1|null|null| 1|
+---+----+----+---+
(column <- tmp.columns) -> iterate over the columns
tmp.select(column).na.drop() -> drop rows with nulls
e.g for c2 this will return
+---+
| c2|
+---+
| 1|
+---+
if (rowsCount == withoutNullsCount) colsToRetain += col(column) -> if column contains nulls just drop it
one of the options is reduce on rdd:
import spark.implicits._
val df= Seq(("1","A","3","4"),("1","2","?","4"),("1","2","3","4")).toDF()
df.show()
val first = df.first()
val size = first.length
val diffStr = "#"
val targetStr = "1"
def rowToArray(row: Row): Array[String] = {
val arr = new Array[String](row.length)
for (i <- 0 to row.length-1){
arr(i) = row.getString(i)
}
arr
}
def compareArrays(a1: Array[String], a2: Array[String]): Array[String] = {
val arr = new Array[String](a1.length)
for (i <- 0 to a1.length-1){
arr(i) = if (a1(i).equals(a2(i)) && a1(i).equals(targetStr)) a1(i) else diffStr
}
arr
}
val diff = df.rdd
.map(rowToArray)
.reduce(compareArrays)
val cols = (df.columns zip diff).filter(!_._2.equals(diffStr)).map(s=>df(s._1))
df.select(cols:_*).show()
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| A| 3| 4|
| 1| 2| ?| 4|
| 1| 2| 3| 4|
+---+---+---+---+
+---+
| _1|
+---+
| 1|
| 1|
| 1|
+---+
I would try to prepare dataset for processing without nulls. In case of few columns this simple iterative approach might work fine (don't forget to import spark implicits before import spark.implicits._):
val example = spark.sparkContext.parallelize(Seq(
Grade(1,3,1,1),
Grade(1,1,0,1),
Grade(1,10,2,1)
)).toDS().cache()
def allOnes(colName: String, ds: Dataset[Grade]): Boolean = {
val row = ds.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns = example.columns.filter(col => allOnes(col, example))
example.selectExpr(resultColumns: _*).show()
result is:
+---+---+
| c1| c4|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
If nulls are inevitable, use untyped dataset (aka dataframe):
val schema = StructType(Seq(
StructField("c1", IntegerType, nullable = true),
StructField("c2", IntegerType, nullable = true),
StructField("c3", IntegerType, nullable = true),
StructField("c4", IntegerType, nullable = true)
))
val example = spark.sparkContext.parallelize(Seq(
Row(1,3,1,1),
Row(1,1,null,1),
Row(1,10,2,1)
))
val dfInput = spark.createDataFrame(example, schema).cache()
def allOnes(colName: String, df: DataFrame): Boolean = {
val row = df.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns= dfInput.columns.filter(col => allOnes(col, dfInput))
dfInput.selectExpr(resultColumns: _*).show()

How to update column of spark dataframe based on the values of previous record

I have three columns in df
Col1,col2,col3
X,x1,x2
Z,z1,z2
Y,
X,x3,x4
P,p1,p2
Q,q1,q2
Y
I want to do the following
when col1=x,store the value of col2 and col3
and assign those column values to next row when col1=y
expected output
X,x1,x2
Z,z1,z2
Y,x1,x2
X,x3,x4
P,p1,p2
Q,q1,q2
Y,x3,x4
Any help would be appreciated
Note:-spark 1.6
Here's one approach using Window function with steps as follows:
Add row-identifying column (not needed if there is already one) and combine non-key columns (presumably many of them) into one
Create tmp1 with conditional nulls and tmp2 using last/rowsBetween Window function to back-fill with the last non-null value
Create newcols conditionally from cols and tmp2
Expand newcols back to individual columns using foldLeft
Note that this solution uses Window function without partitioning, thus may not work for large dataset.
val df = Seq(
("X", "x1", "x2"),
("Z", "z1", "z2"),
("Y", "", ""),
("X", "x3", "x4"),
("P", "p1", "p2"),
("Q", "q1", "q2"),
("Y", "", "")
).toDF("col1", "col2", "col3")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val colList = df.columns.filter(_ != "col1")
val df2 = df.select($"col1", monotonically_increasing_id.as("id"),
struct(colList.map(col): _*).as("cols")
)
val df3 = df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
withColumn( "tmp2", last("tmp1", ignoreNulls = true).over(
Window.orderBy("id").rowsBetween(Window.unboundedPreceding, 0)
) )
df3.show
// +----+---+-------+-------+-------+
// |col1| id| cols| tmp1| tmp2|
// +----+---+-------+-------+-------+
// | X| 0|[x1,x2]|[x1,x2]|[x1,x2]|
// | Z| 1|[z1,z2]| null|[x1,x2]|
// | Y| 2| [,]| null|[x1,x2]|
// | X| 3|[x3,x4]|[x3,x4]|[x3,x4]|
// | P| 4|[p1,p2]| null|[x3,x4]|
// | Q| 5|[q1,q2]| null|[x3,x4]|
// | Y| 6| [,]| null|[x3,x4]|
// +----+---+-------+-------+-------+
val df4 = df3.withColumn( "newcols",
when($"col1" === "Y", $"tmp2").otherwise($"cols")
).select($"col1", $"newcols")
df4.show
// +----+-------+
// |col1|newcols|
// +----+-------+
// | X|[x1,x2]|
// | Z|[z1,z2]|
// | Y|[x1,x2]|
// | X|[x3,x4]|
// | P|[p1,p2]|
// | Q|[q1,q2]|
// | Y|[x3,x4]|
// +----+-------+
val dfResult = colList.foldLeft( df4 )(
(accDF, c) => accDF.withColumn(c, df4(s"newcols.$c"))
).drop($"newcols")
dfResult.show
// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// | X| x1| x2|
// | Z| z1| z2|
// | Y| x1| x2|
// | X| x3| x4|
// | P| p1| p2|
// | Q| q1| q2|
// | Y| x3| x4|
// +----+----+----+
[UPDATE]
For Spark 1.x, last(colName, ignoreNulls) isn't available in the DataFrame API. A work-around is to revert to use Spark SQL which supports ignore-null in its last() method:
df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
createOrReplaceTempView("df2table")
// might need to use registerTempTable("df2table") instead
val df3 = spark.sqlContext.sql("""
select col1, id, cols, tmp1, last(tmp1, true) over (
order by id rows between unbounded preceding and current row
) as tmp2
from df2table
""")
Yes, there is a lag function that requires ordering
import org.apache.spark.sql.expressions.Window.orderBy
import org.apache.spark.sql.functions.{coalesce, lag}
case class Temp(a: String, b: Option[String], c: Option[String])
val input = ss.createDataFrame(
Seq(
Temp("A", Some("a1"), Some("a2")),
Temp("D", Some("d1"), Some("d2")),
Temp("B", Some("b1"), Some("b2")),
Temp("E", None, None),
Temp("C", None, None)
))
+---+----+----+
| a| b| c|
+---+----+----+
| A| a1| a2|
| D| d1| d2|
| B| b1| b2|
| E|null|null|
| C|null|null|
+---+----+----+
val order = orderBy($"a")
input
.withColumn("b", coalesce($"b", lag($"b", 1).over(order)))
.withColumn("c", coalesce($"c", lag($"c", 1).over(order)))
.show()
+---+---+---+
| a| b| c|
+---+---+---+
| A| a1| a2|
| B| b1| b2|
| C| b1| b2|
| D| d1| d2|
| E| d1| d2|
+---+---+---+

Scala & Spark: Add value to every cell of every row

I have a two DataFrames:
scala> df1.show()
+----+----+----+---+----+
|col1|col2|col3| |colN|
+----+----+----+ +----+
| 2|null| 3|...| 4|
| 4| 3| 3| | 1|
| 5| 2| 8| | 1|
+----+----+----+---+----+
scala> df2.show() // has one row only (avg())
+----+----+----+---+----+
|col1|col2|col3| |colN|
+----+----+----+ +----+
| 3.6|null| 4.6|...| 2|
+----+----+----+---+----+
and a constant val c : Double = 0.1.
Desired output is a df3: Dataframe that is given by
,
with n=numberOfRow and m=numberOfColumn.
I already looked through the list of sql.functions and failed implementing it myself with some nested map operations (fearing performance issues). One idea I had was:
val cBc = spark.sparkContext.broadcast(c)
val df2Bc = spark.sparkContext.broadcast(averageObservation)
df1.rdd.map(row => {
for (colIdx <- 0 until row.length) {
val correspondingDf2value = df2Bc.value.head().getDouble(colIdx)
row.getDouble(colIdx) * (1 - cBc.value) + correspondingDf2value * cBc.value
}
})
Thank you in advance!
(cross)join combined with select is more than enough and will be much more efficient than mapping. Required imports:
import org.apache.spark.sql.functions.{broadcast, col, lit}
and expression:
val exprs = df1.columns.map { x => (df1(x) * (1 - c) + df2(x) * c).alias(x) }
join and select:
df1.crossJoin(broadcast(df2)).select(exprs: _*)

How to do a Spark dataframe(1 million rows) cartesian product with a list(1000 entries) efficiently to generate a new dataframe with 1 billion rows

I want to take each row of a dataframe which has 1 million rows and generate 1000 rows from each row of it by taking a cross product with a list having 1000 entries thereby generating a dataframe with 1 billion rows. What is the best approach to do it efficiently.
I have tried with broadcasting the list and then using it while mapping each row of the dataframe. But this seems to be taking too much time.
val mappedrdd = validationDataFrames.map(x => {
val cutoffList : List[String] = cutoffListBroadcast.value
val arrayTruthTableVal = arrayTruthTableBroadcast.value
var listBufferRow: ListBuffer[Row] = new ListBuffer()
for(cutOff <- cutoffList){
val conversion = x.get(0).asInstanceOf[Int]
val probability = x.get(1).asInstanceOf[Double]
var columnName : StringBuffer = new StringBuffer
columnName = columnName.append(conversion)
if(probability > cutOff.toDouble){
columnName = columnName.append("_").append("1")
}else{
columnName = columnName.append("_").append("0")
}
val index:Int = arrayTruthTableVal.indexOf(columnName.toString)
var listBuffer : ListBuffer[String] = new ListBuffer()
listBuffer :+= cutOff
for(i <- 1 to 4){
if((index + 1) == i) listBuffer :+= "1" else listBuffer :+= "0"
}
val row = Row.fromSeq(listBuffer)
listBufferRow = listBufferRow :+ row
}
listBufferRow
})
Depending on your spark version you can do:
Spark 2.1.0
Add the list as a column and explode. A simplified example:
val df = spark.range(5)
val exploded = df.withColumn("a",lit(List(1,2,3).toArray)).withColumn("a", explode($"a"))
df.show()
+---+---+
| id| a|
+---+---+
| 0| 1|
| 0| 2|
| 0| 3|
| 1| 1|
| 1| 2|
| 1| 3|
| 2| 1|
| 2| 2|
| 2| 3|
| 3| 1|
| 3| 2|
| 3| 3|
| 4| 1|
| 4| 2|
| 4| 3|
+---+---+
For timing you can do:
def time[R](block: => R): Long = {
val t0 = System.currentTimeMillis()
block // call-by-name
val t1 = System.currentTimeMillis()
t1 - t0
}
time(spark.range(1000000).withColumn("a",lit((0 until 1000).toArray)).withColumn("a", explode($"a")).count())
took 5.41 seconds on a 16 core computer with plenty of memory configured with default parallelism of 60.
< Spark 2.1.0
You can define a simple UDF.
val xx = (0 until 1000).toArray.toSeq // replace with your list but turn it to seq
val ff = udf(() => {xx})
time(spark.range(1000000).withColumn("a",ff()).withColumn("a", explode($"a")).count())
Took on the same server as above 8.25 seconds