This is my function apply rule, the col mdp_codcat,mdp_idregl, usedRef changechanges according to the data in array bRef.
def withMdpCodcat(bRef: Broadcast[Array[RefRglSDC]])(dataFrame: DataFrame):DataFrame ={var matchRule = false
var i = 0
while (i < bRef.value.size && !matchRule) {
if ((bRef.value(i).sensop.isEmpty || bRef.value(i).sensop.equals(col("signe")))
&& (bRef.value(i).cdopcz.isEmpty || Lib.matchCdopcz(strTail(col("cdopcz")).toString(), bRef.value(i).cdopcz))
&& (bRef.value(i).libope.isEmpty || Lib.matchRule(col("lib_ope").toString(), bRef.value(i).libope))
&& (bRef.value(i).qualib.isEmpty || Lib.matchRule(col("qualif_lib_ope").toString(), bRef.value(i).qualib))) {
matchRule = true
dataFrame.withColumn("mdp_codcat", lit(bRef.value(i).codcat))
dataFrame.withColumn("mdp_idregl", lit(bRef.value(i).idregl))
dataFrame.withColumn("usedRef", lit("SDC"))
}else{
dataFrame.withColumn("mdp_codcat", lit("NOT_CATEGORIZED"))
dataFrame.withColumn("mdp_idregl", lit("-1"))
dataFrame.withColumn("usedRef", lit(""))
}
i += 1
}
dataFrame
}
dataFrame : "cdenjp", "cdguic", "numcpt", "mdp_codcat", "mdp_idregl" , mdp_codcat","mdp_idregl","usedRef" if match add mdp_idregl, mdp_idregl,mdp_idregl with value bRef
Example - my dataframe :
val DF = Seq(("tt", "aa","bb"),("tt1", "aa1","bb2"),("tt1", "aa1","bb2")).toDF("t","a","b)
+---+---+---+---+
| t| a| b| c|
+---+---+---+---+
| tt| aa| bb| cc|
|tt1|aa1|bb2|cc3|
+---+---+---+---+
file.text content :
,aa,bb,cc
,aa1,bb2,cc3
tt4,aa4,bb4,cc4
tt1,aa1,,cc6
case class TOTO(a: String, b:String, c: String, d:String)
val text = sc.textFile("file:///home/X176616/file")
val bRef= textFromCsv.map(row => row.split(",", -1))
.map(c => TOTO(c(0), c(1), c(2), c(3))).collect().sortBy(_.a)
def withMdpCodcat(bRef: Broadcast[Array[RefRglSDC]])(dataFrame: DataFrame):DataFrame
dataframe.withColumn("mdp_codcat_new", "NOT_FOUND") //first init not found, change if while if match
var matchRule = false
var i = 0
while (i < bRef.value.size && !matchRule) {
if ((bRef.value(i).a.isEmpty || bRef.value(i).a.equals(signe))
&& (bRef.value(i).b.isEmpty || Lib.matchCdopcz(col(b), bRef.value(i).b))
&& (bRef.value(i).c.isEmpty || Lib.matchRule(col(c), bRef.value(i).c))
)) {
matchRule = true
dataframe.withColumn("mdp_codcat_new", bRef.value(i).d)
dataframe.withColumn("mdp_mdp_idregl_new" = bRef.value(i).e
}
i += 1
}
Finally df if condition true
bRef.value(i).a.isEmpty || bRef.value(i).a.equals(signe))
&& (bRef.value(i).b.isEmpty || Lib.matchCdopcz(b.substring(1).toInt.toString, bRef.value(i).b))
&& (bRef.value(i).c.isEmpty || Lib.matchRule(c, bRef.value(i).c)
+---+---+---+---+-----------+----------+
| t| a| b| c|mdp_codcat |mdp_idregl|
+---+---+---+---+-----------|----------+
| tt| aa| bb| cc|cc | other |
| ab|aa1|bb2|cc3|cc4 | toto | from bRef if true in while
| cd|aa1|bb2|cc3|cc4 | titi |
| b|a1 |b2 |c3 |NO_FOUND |NO_FOUND | (not_found if conditional false)
+---+---+---+---+----------------------+
+---+---+---+---+----------------------+
You can not create a dataframe schema depending on a runtime value. I would try to do it simpler. First I´d create the three columns with a default value:
dataFrame.withColumn("mdp_codcat", lit(""))
dataFrame.withColumn("mdp_idregl", lit(""))
dataFrame.withColumn("usedRef", lit(""))
Then you can use a udf with your broadcasted value:
def mdp_codcat(bRef: Broadcast[Array[RefRglSDC]]) = udf { (field: String) =>
{
// Your while and if stuff
// return your update data
}}
And apply each udf to each field:
dataframe.withColumn("mdp_codcat_new", mdp_codcat(bRef)("mdp_codcat"))
Maybe it can help
Related
Suppose I have a Spark DataFrame (in Scala) like
+---+---+---------------+
| a| b| expr|
+---+---+---------------+
| 0| 0|a = 1 AND b = 0|
| 0| 1| a = 0|
| 1| 0|a = 1 AND b = 1|
| 1| 1|a = 1 AND b = 1|
| 1| 1| null|
| 1| 1| a = 0 OR b = 1|
+---+---+---------------+
in which the string column expr contains nullable Boolean expressions that refer to the other numeric columns in the same DataFrame (a and b).
I would like to derive a column eval(expr) that evaluates the Boolean expression expr row-wise, i.e.,
+---+---+---------------+----------+
| a| b| expr|eval(expr)|
+---+---+---------------+----------+
| 0| 0|a = 1 AND b = 0| false|
| 0| 1| a = 0| true|
| 1| 0|a = 1 AND b = 1| false|
| 1| 1|a = 1 AND b = 1| true|
| 1| 1| null| true|
| 1| 1| a = 0 OR b = 1| true|
+---+---+---------------+----------+
(in particular, although this is an optional specification, null evaluates to true).
Question
What's the best way to create eval(expr)?
That is, how can I create a column in a Spark DataFrame that evaluates a column of Boolean expressions that refer to other columns in the DataFrame?
I have some not-quite-satisfactory solutions below. They assume the following DataFrame in scope:
val df: DataFrame = Seq(
(0, 0, "a = 1 AND b = 0"),
(0, 1, "a = 0"),
(1, 0, "a = 1 AND b = 1"),
(1, 1, "a = 1 AND b = 1"),
(1, 1, null),
(1, 1, "a = 0 OR b = 1")
).toDF("a", "b", "expr")
Solution 1
Create a large global expression out of the individual expressions:
val exprs: Column = concat(
df.columns
.filter(_ != "expr")
.zipWithIndex
.flatMap {
case (name, i) =>
if (i == 0)
Seq(lit(s"($name = "), col(name))
else
Seq(lit(s" AND $name = "), col(name))
} :+ lit(" AND (") :+ col("expr") :+ lit("))"): _*
)
// exprs: org.apache.spark.sql.Column = concat((a = , a, AND b = , b, AND (, expr, )))
val bigExprString = df.select(exprs).na.drop.as[String].collect.mkString(" OR ")
// bigExprString: String = (a = 0 AND b = 0 AND (a = 1 AND b = 0)) OR (a = 0 AND b = 1 AND (a = 0)) OR (a = 1 AND b = 0 AND (a = 1 AND b = 1)) OR (a = 1 AND b = 1 AND (a = 1 AND b = 1)) OR (a = 1 AND b = 1 AND (a = 0 OR b = 1))
val result: DataFrame = df.withColumn("eval(expr)", expr(bigExprString))
The downside here is the resulting string is very large. In my actual use case, it would be many tens of thousands of characters long, if not longer. I'm not too sure whether this would cause problems.
Solution 2
Separate the DataFrame into multiple based on the value of the expression column, operate on each individually, and recombine into one DataFrame.
val exprs: Seq[String] = df.select("expr").distinct.as[String].collect
// exprs: Seq[String] = WrappedArray(a = 1 AND b = 1, a = 1 AND b = 0, null, a = 0, a = 0 OR b = 1)
val result: DataFrame = exprs.map(e =>
df.filter(col("expr") === e)
.withColumn("eval(expr)", if (e == null) lit(true) else when(expr(e), true).otherwise(false))
).reduce(_.union(_))
.show()
I think the downside of this approach is that it creates many intermediate tables (one for each distinct expression). In my actual use case, this count is potentially hundreds or thousands.
Using this answer the scala.tools.reflect.ToolBox can be used to evaluate the expression after transforming it into a valid Scala expression:
case class Result(a: Integer, b: Integer, expr: String, result: Boolean)
df.mapPartitions(it => {
import scala.reflect.runtime.universe
import scala.tools.reflect.ToolBox
val tb = universe.runtimeMirror(this.getClass.getClassLoader).mkToolBox()
val res = it.map(r => {
val a = r.getInt(0)
val b = r.getInt(1)
val expr = r.getString(2)
val exprResult =
if ( expr == null) {
true
}
else {
val scalaExpr = expr.replace("=", "==").replace("AND", "&").replace("OR", "|")
val scalaExpr2 = s"var a=${a}; var b=${b}; ${scalaExpr}"
tb.eval(tb.parse(scalaExpr2)).asInstanceOf[Boolean]
}
Result(a, b, expr, exprResult)
})
res
}).show()
Output:
+---+---+---------------+------+
| a| b| expr|result|
+---+---+---------------+------+
| 0| 0|a = 1 AND b = 0| false|
| 0| 1| a = 0| true|
| 1| 0|a = 1 AND b = 1| false|
| 1| 1|a = 1 AND b = 1| true|
| 1| 1| null| true|
| 1| 1| a = 0 OR b = 1| true|
+---+---+---------------+------+
I am using mapPartitions here instead of a simple udf as the initialization of the the toolbox takes some time. Instead of initializing it once per row it is now initialized only once per partition.
I have a property table like below, in the dataframe
In the columns to rename,
I have to rename the column based on this input
If the cust_id flag is yes I just want to join with the customer table
In the final output I want to show the hashed column values with the actual column name
val maintab_df = maintable
val cust_df = customertable
Joining main table and customer table after renaming the main table column e to a.
maintable.a = customertable.a
Here's an example of how to do it:
propertydf.show
+-----------------+------------+
|columns-to-rename|cust_id_flag|
+-----------------+------------+
|(e to a),(d to b)| Y|
+-----------------+------------+
val columns_to_rename = propertydf.head(1)(0).getAs[String]("columns-to-rename")
val cust_id_flag = propertydf.head(1)(0).getAs[String]("cust_id_flag")
val parsed_columns = columns_to_rename.split(",")
.map(c => c.replace("(", "").replace(")", "")
.split(" to "))
// parsed_columns: Array[Array[String]] = Array(Array(e, a), Array(d, b))
val rename_columns = maintab_df.columns.map(c => {
val matched = parsed_columns.filter(p => c == p(0))
if (matched.size != 0)
col(c).as(matched(0)(1).toString)
else
col(c)
})
// rename_columns: Array[org.apache.spark.sql.Column] = Array(e AS `a`, f, c, d AS `b`)
val select_columns = maintab_df.columns.map(c => {
val matched = parsed_columns.filter(p => c == p(0))
if (matched.size != 0)
col(matched(0)(1) + "_hash").as(matched(0)(1).toString)
else
col(c)
})
// select_columns: Array[org.apache.spark.sql.Column] = Array(a_hash AS `a`, f, c, b_hash AS `b`)
val join_cond = parsed_columns.map(_(1))
// join_cond: Array[String] = Array(a, b)
if (cust_id_flag == "Y") {
val result = maintab_df.select(rename_columns:_*)
.join(cust_df, join_cond)
.select(select_columns:_*)
} else {
val result = maintab_df
}
result.show
+------+---+---+--------+
| a| f| c| b|
+------+---+---+--------+
|*****!| 1| 11| &&&&|
| ****%| 2| 12|;;;;;;;;|
|*****#| 3| 13| \\\\\\|
+------+---+---+--------+
So Dataframe.where can be used to filter a dataframe for the rows given by an expression, like this:
df.where(($"group_id" == 1234) || ($"group_id" == 4434))
or to give a more complex example
df.where(($"group_id" == 1234 && $"country" === "PL") || ($"group_id" == 4434 $"country" === "FR"))
I am interest in whether I can supply these conditions somehow as a list, so suppose I have a list of group_id's, List((1234, "PL"), (4434, "FR"), ....) then I would like to efficiently filter the dataframe.
You can try something like this:
val df = Seq((1,"a"),(2,"b"),(3,"c")).toDF()
df.show()
+---+---+
| _1| _2|
+---+---+
| 1| a|
| 2| b|
| 3| c|
+---+---+
val list = List((1,"a"),(3,"c"))
val cols = List("_1","_2")
def mkCol(values: List[(Any,Any)], columns: List[String]) = list.map(s=>col(columns.apply(0)) === s._1 && col(columns.apply(1)) === s._2)
.reduce((a,b)=>a.or(b))
val col = mkCol(list,cols)
col.explain(true)
((('_1 = 1) && ('_2 = a)) || (('_1 = 3) && ('_2 = c)))
df.where(col).show()
+---+---+
| _1| _2|
+---+---+
| 1| a|
| 3| c|
+---+---+
Given a dataframe, say that it contains 4 columns and 3 rows. I want to write a function to return the columns where all the values in that column are equal to 1.
This is a Scala code. I want to use some spark transformations to transform or filter the dataframe input. This filter should be implemented in a function.
case class Grade(c1: Integral, c2: Integral, c3: Integral, c4: Integral)
val example = Seq(
Grade(1,3,1,1),
Grade(1,1,null,1),
Grade(1,10,2,1)
)
val dfInput = spark.createDataFrame(example)
After I call the function filterColumns()
val dfOutput = dfInput.filterColumns()
it should return 3 row 2 columns dataframe with value all 1.
A bit more readable approach using Dataset[Grade]
import org.apache.spark.sql.functions.col
import scala.collection.mutable
import org.apache.spark.sql.Column
val tmp = dfInput.map(grade => grade.dropWhenNotEqualsTo(1))
val rowsCount = dfInput.count()
val colsToRetain = mutable.Set[Column]()
for (column <- tmp.columns) {
val withoutNullsCount = tmp.select(column).na.drop().count()
if (rowsCount == withoutNullsCount) colsToRetain += col(column)
}
dfInput.select(colsToRetain.toArray:_*).show()
+---+---+
| c4| c1|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
And the case object
case class Grade(c1: Integer, c2: Integer, c3: Integer, c4: Integer) {
def dropWhenNotEqualsTo(n: Integer): Grade = {
Grade(nullOrValue(c1, n), nullOrValue(c2, n), nullOrValue(c3, n), nullOrValue(c4, n))
}
def nullOrValue(c: Integer, n: Integer) = if (c == n) c else null
}
grade.dropWhenNotEqualsTo(1) -> returns a new Grade with values that not satisfies the condition replaced to nulls
+---+----+----+---+
| c1| c2| c3| c4|
+---+----+----+---+
| 1|null| 1| 1|
| 1| 1|null| 1|
| 1|null|null| 1|
+---+----+----+---+
(column <- tmp.columns) -> iterate over the columns
tmp.select(column).na.drop() -> drop rows with nulls
e.g for c2 this will return
+---+
| c2|
+---+
| 1|
+---+
if (rowsCount == withoutNullsCount) colsToRetain += col(column) -> if column contains nulls just drop it
one of the options is reduce on rdd:
import spark.implicits._
val df= Seq(("1","A","3","4"),("1","2","?","4"),("1","2","3","4")).toDF()
df.show()
val first = df.first()
val size = first.length
val diffStr = "#"
val targetStr = "1"
def rowToArray(row: Row): Array[String] = {
val arr = new Array[String](row.length)
for (i <- 0 to row.length-1){
arr(i) = row.getString(i)
}
arr
}
def compareArrays(a1: Array[String], a2: Array[String]): Array[String] = {
val arr = new Array[String](a1.length)
for (i <- 0 to a1.length-1){
arr(i) = if (a1(i).equals(a2(i)) && a1(i).equals(targetStr)) a1(i) else diffStr
}
arr
}
val diff = df.rdd
.map(rowToArray)
.reduce(compareArrays)
val cols = (df.columns zip diff).filter(!_._2.equals(diffStr)).map(s=>df(s._1))
df.select(cols:_*).show()
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| A| 3| 4|
| 1| 2| ?| 4|
| 1| 2| 3| 4|
+---+---+---+---+
+---+
| _1|
+---+
| 1|
| 1|
| 1|
+---+
I would try to prepare dataset for processing without nulls. In case of few columns this simple iterative approach might work fine (don't forget to import spark implicits before import spark.implicits._):
val example = spark.sparkContext.parallelize(Seq(
Grade(1,3,1,1),
Grade(1,1,0,1),
Grade(1,10,2,1)
)).toDS().cache()
def allOnes(colName: String, ds: Dataset[Grade]): Boolean = {
val row = ds.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns = example.columns.filter(col => allOnes(col, example))
example.selectExpr(resultColumns: _*).show()
result is:
+---+---+
| c1| c4|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
If nulls are inevitable, use untyped dataset (aka dataframe):
val schema = StructType(Seq(
StructField("c1", IntegerType, nullable = true),
StructField("c2", IntegerType, nullable = true),
StructField("c3", IntegerType, nullable = true),
StructField("c4", IntegerType, nullable = true)
))
val example = spark.sparkContext.parallelize(Seq(
Row(1,3,1,1),
Row(1,1,null,1),
Row(1,10,2,1)
))
val dfInput = spark.createDataFrame(example, schema).cache()
def allOnes(colName: String, df: DataFrame): Boolean = {
val row = df.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns= dfInput.columns.filter(col => allOnes(col, dfInput))
dfInput.selectExpr(resultColumns: _*).show()
Let's say I have following dataframe:
/*
+---------+--------+----------+--------+
|a |b | c |d |
+---------+--------+----------+--------+
| bob| -1| 5| -1|
| alice| -1| -1| -1|
+---------+--------+----------+--------+
*/
I want to remove columns which only have -1 in all rows (in this case b and d). I found a solution but when I run my job I found out it was very inefficient:
private def removeEmptyColumns(df: DataFrame): DataFrame = {
val types = List("IntegerType", "DoubleType", "LongType")
val dTypes: Array[(String, String)] = df.dtypes
dTypes.foldLeft(df)((d, t) => {
val colType = t._2
val colName = t._1
if (types.contains(colType)) {
if (colType.equals("IntegerType")) {
if (d.select(colName).filter(col(colName) =!= -1).take(1).length == 0) d.drop(colName)
else d
} else if (colType.equals("DoubleType")) {
if (d.select(colName).filter(col(colName) =!= -1.0).take(1).length == 0) d.drop(colName)
else d
} else {
if (d.select(colName).filter(col(colName) =!= -1).take(1).length == 0) d.drop(colName)
else d
}
} else {
d
}
})
}
Is there a better solution or way to improve my existing code?
(I think this line val count = d.select(colName).distinct.count is the bottleneck)
I am using Spark 2.2 atm.
Many thanks
Instead of counting number of distinct values try to check if there exist any other value that is not -1
d.select(colName).filter(_ != -1).take(1).length == 0
Another approach
Instead of going through the dataframe n times (once for each column) you can try to collect statistics all at once
val summary = d.agg(
max(col1).as(s"${col1}_max"), min(col1).as(s"${col1}_min")),
max(col2).as(s"${col1}_max"), min(col2).as(s"${col2}_min")),
...)
.first
Then compare if min and max value for the column is the same -1