how to count field with condition by spark - scala

I have a dataframe, there is a enum field(value are 0 or 1) named A, another one field B, I would like to implement below scenario:
if `B` is null:
count(when `A` is 0) and set a column name `xx`
count(when `A` is 1) and set a column name `yy`
if `B` is not null:
count(when `A` is 0) and set a column name `zz`
count(when `A` is 1) and set a column name `mm`
how can I do it by spark scala?

It's possible to conditionally populate columns in this way, however the final output DataFrame requires an expected schema.
Assuming all of the scenarios you detailed are possible in one DataFrame, I would suggest creating each of the four columns: "xx", "yy", "zz" and "mm" and conditionally populating them.
In the below example I've populated the values with either "found" or "", primarily to make it easy to see where the values are populated. Using true and false here, or another enum, would likely make more sense in the real world.
Starting with a DataFrame (since you didn't specify the type that "B" is I have gone for a Option[String] (nullable) for this example:
val df = List(
(0, None),
(1, None),
(0, Some("hello")),
(1, Some("world"))
).toDF("A", "B")
df.show(false)
gives:
+---+-----+
|A |B |
+---+-----+
|0 |null |
|1 |null |
|0 |hello|
|1 |world|
+---+-----+
and to create the columns:
df
.withColumn("xx", when(col("B").isNull && col("A") === 0, "found").otherwise(""))
.withColumn("yy", when(col("B").isNull && col("A") === 1, "found").otherwise(""))
.withColumn("zz", when(col("B").isNotNull && col("A") === 0, "found").otherwise(""))
.withColumn("mm", when(col("B").isNotNull && col("A") === 1, "found").otherwise(""))
.show(false)
gives:
+---+-----+-----+-----+-----+-----+
|A |B |xx |yy |zz |mm |
+---+-----+-----+-----+-----+-----+
|0 |null |found| | | |
|1 |null | |found| | |
|0 |hello| | |found| |
|1 |world| | | |found|
+---+-----+-----+-----+-----+-----+

Related

Scala -- apply a custom if-then on a dataframe

I have this kind of dataset:
val cols = Seq("col_1","col_2")
val data = List(("a",1),
("b",1),
("a",2),
("c",3),
("a",3))
val df = spark.createDataFrame(data).toDF(cols:_*)
+-----+-----+
|col_1|col_2|
+-----+-----+
|a |1 |
|b |1 |
|a |2 |
|c |3 |
|a |3 |
+-----+-----+
I want to add an if-then column based on the existing columns.
df
.withColumn("col_new",
when(col("col_2").isin(2, 5), "str_1")
.when(col("col_2").isin(4, 6), "str_2")
.when(col("col_2").isin(1) && col("col_1").contains("a"), "str_3")
.when(col("col_2").isin(3) && col("col_1").contains("b"), "str_1")
.when(col("col_2").isin(1,2,3), "str_4")
.otherwise(lit("other")))
Instead of the list of when-then statements, I would prefer to apply a custom function. In Python I would run a lambda & map.
thank you!

How to efficiently map over DF and use combination of outputs?

Given a DF, let's say I have 3 classes each with a method addCol that will use the columns in the DF to create and append a new column to the DF (based on different calculations).
What is the best way to get a resulting df that will contain the original df A and the 3 added columns?
val df = Seq((1, 2), (2,5), (3, 7)).toDF("num1", "num2")
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method1", col("num1")/col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method2", col("num1")*col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method3", col("num1")+col("num2"))
}
One option is actions.foldLeft(df) { (df, action) => action.addCol(df))}. The end result is the DF I want -- with columns num1, num2, method1, method2, and method3. But from my understanding this will not make use of distributed evaluation, and each addCol will happen sequentially. What is the more efficient way to do this?
Efficient way to do this is using select.
select is faster than the foldLeft if you have very huge data - Check this post
You can build required expressions & use that inside select, check below code.
scala> df.show(false)
+----+----+
|num1|num2|
+----+----+
|1 |2 |
|2 |5 |
|3 |7 |
+----+----+
scala> val colExpr = Seq(
$"num1",
$"num2",
($"num1"/$"num2").as("method1"),
($"num1" * $"num2").as("method2"),
($"num1" + $"num2").as("method3")
)
Final Output
scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1 |method2|method3|
+----+----+-------------------+-------+-------+
|1 |2 |0.5 |2 |3 |
|2 |5 |0.4 |10 |7 |
|3 |7 |0.42857142857142855|21 |10 |
+----+----+-------------------+-------+-------+
Update
Return Column instead of DataFrame. Try using higher order functions, Your all three function can be replaced with below one function.
scala> def add(
num1:Column, // May be you can try to use variable args here if you want.
num2:Column,
f: (Column,Column) => Column
): Column = f(num1,num2)
For Example, varargs & while invoking this method you need to pass required columns at the end.
def add(f: (Column,Column) => Column,cols:Column*): Column = cols.reduce(f)
Invoking add function.
scala> val colExpr = Seq(
$"num1",
$"num2",
add($"num1",$"num2",(_ / _)).as("method1"),
add($"num1", $"num2",(_ * _)).as("method2"),
add($"num1", $"num2",(_ + _)).as("method3")
)
Final Output
scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1 |method2|method3|
+----+----+-------------------+-------+-------+
|1 |2 |0.5 |2 |3 |
|2 |5 |0.4 |10 |7 |
|3 |7 |0.42857142857142855|21 |10 |
+----+----+-------------------+-------+-------+

scala: get column name corresponding to max column value from variable columns list

I have the following working solution in a databricks notebook as test.
var maxcol = udf((col1: Long, col2: Long, col3: Long) => {
var res = ""
if (col1 > col2 && col1 > col3) res = "col1"
else if (col2 > col1 && col2 > col3) res = "col2"
else res = "col3"
res
})
val someDF = Seq(
(8, 10, 12, "bat"),
(64, 61, 59, "mouse"),
(-27, -30, -15, "horse")
).toDF("number1", "number2", "number3", "word")
.withColumn("maxColVal", greatest("number1", "number2", "number3"))
.withColumn("maxColVal_Name", maxcol(col("number1"), col("number2"), col("number3")))
display(someDF)
Is there any way to make this generic? I have a usecase to make variable columns pass to this UDF and still get the max column name as output corresponding to the column having max value.
Unlike above where I have hard coded the column names 'col1', 'col2' and 'col3' in the UDF.
Use below:
val df = List((1,2,3,5,"a"),(4,2,3,1,"a"),(1,20,3,1,"a"),(1,22,22,2,"a")).toDF("mycol1","mycol2","mycol3","mycol4","mycol5")
//list all your columns among which you want to find the max value
val colGroup = List(df("mycol1"),df("mycol2"),df("mycol3"),df("mycol4"))
//list column value -> column name of the columns among which you want to find max value column NAME
val colGroupMap = List(df("mycol1"),lit("mycol1"),
df("mycol2"),lit("mycol2"),
df("mycol3"),lit("mycol3"),
df("mycol4"),lit("mycol4"))
var maxcol = udf((colVal: Map[Int,String]) => {
colVal.max._2 //you can easily find the column name of the max column value
})
df.withColumn("maxColValue",greatest(colGroup:_*)).withColumn("maxColVal_Name",maxcol(map(colGroupMap:_*))).show(false)
+------+------+------+------+------+-----------+--------------+
|mycol1|mycol2|mycol3|mycol4|mycol5|maxColValue|maxColVal_Name|
+------+------+------+------+------+-----------+--------------+
|1 |2 |3 |5 |a |5 |mycol4 |
|4 |2 |3 |1 |a |4 |mycol1 |
|1 |20 |3 |1 |a |20 |mycol2 |
|1 |22 |22 |2 |a |22 |mycol3 |
+------+------+------+------+------+-----------+--------------+

How to count the frequency of words with CountVectorizer in spark ML?

The below code gives a count vector for each row in the DataFrame:
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = spark.createDataFrame(Seq(
(0, Array("a", "b", "c")),
(1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")
// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.fit(df)
cvModel.transform(df).show(false)
The result is:
+---+---------------+-------------------------+
|id |words |features |
+---+---------------+-------------------------+
|0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])|
|1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+---+---------------+-------------------------+
How to get total counts of each words, like:
+---+------+------+
|id |words |counts|
+---+------+------+
|0 |a | 3 |
|1 |b | 3 |
|2 |c | 2 |
+---+------+------+
Shankar's answer only gives you the actual frequencies if the CountVectorizer model keeps every single word in the corpus (e.g. no minDF or VocabSize limitations). In these cases you can use Summarizer to directly sum each Vector. Note: this requires Spark 2.3+ for Summarizer.
import org.apache.spark.ml.stat.Summarizer.metrics
// You need to select normL1 and another item (like mean) because, for some reason, Spark
// won't allow one Vector to be selected at a time (at least in 2.4)
val totalCounts = cvModel.transform(df)
.select(metrics("normL1", "mean").summary($"features").as("summary"))
.select("summary.normL1", "summary.mean")
.as[(Vector, Vector)]
.first()
._1
You'll then have to zip totalCounts with cvModel.vocabulary to get the words themselves.
You can simply explode and groupBy to get the count of each word
cvModel.transform(df).withColumn("words", explode($"words"))
.groupBy($"words")
.agg(count($"words").as("counts"))
.withColumn("id", row_number().over(Window.orderBy("words")) -1)
.show(false)
Output:
+-----+------+---+
|words|counts|id |
+-----+------+---+
|a |3 |1 |
|b |3 |2 |
|c |2 |3 |
+-----+------+---+

Convert every value of a dataframe

I need to modify the values of every column of a dataframe so that, they all are enclosed within double quotes after mapping but the dataframe still retains its original structure with the headers.
I tried mapping the values by changing the rows to sequences but it loses its headers in the output dataframe.
With this read in as input dataframe:
|prodid|name |city|
+------+-------+----+
|1 |Harshit|VNS |
|2 |Mohit |BLR |
|2 |Mohit |RAO |
|2 |Mohit |BTR |
|3 |Rohit |BOM |
|4 |Shobhit|KLK |
I tried the following code.
val columns = df.columns
df.map{ row =>
row.toSeq.map{col => "\""+col+"\"" }
}.toDF(columns:_*)
But it throws an error stating there's only 1 header i.e value in the mapped dataframe.
This is the actual result (if I remove ".df(columns:_*)"):
| value|
+--------------------+
|["1", "Harshit", ...|
|["2", "Mohit", "B...|
|["2", "Mohit", "R...|
|["2", "Mohit", "B...|
|["3", "Rohit", "B...|
|["4", "Shobhit", ...|
+--------------------+
And my expected result is something like:
|prodid|name |city |
+------+---------+------+
|"1" |"Harshit"|"VNS" |
|"2" |"Mohit" |"BLR" |
|"2" |"Mohit" |"RAO" |
|"2" |"Mohit" |"BTR" |
|"3" |"Rohit" |"BOM" |
|"4" |"Shobhit"|"KLK" |
Note: There are only 3 headers in this example but my original data has a lot of headers so manually typing each and every one of them is not an option in case the file header changes. How do I get this modified value dataframe from that?
Edit: If I need the quotes on all values except the Integers. So, the output is something like:
|prodid|name |city |
+------+---------+------+
|1 |"Harshit"|"VNS" |
|2 |"Mohit" |"BLR" |
|2 |"Mohit" |"RAO" |
|2 |"Mohit" |"BTR" |
|3 |"Rohit" |"BOM" |
|4 |"Shobhit"|"KLK" |
Might be easier to use select instead:
val df = Seq((1, "Harshit", "VNS"), (2, "Mohit", "BLR"))
.toDF("prodid", "name", "city")
df.select(df.schema.fields.map {
case StructField(name, IntegerType, _, _) => col(name)
case StructField(name, _, _, _) => format_string("\"%s\"", col(name)) as name
}:_*).show()
Output:
+------+---------+-----+
|prodid| name| city|
+------+---------+-----+
| 1|"Harshit"|"VNS"|
| 2| "Mohit"|"BLR"|
+------+---------+-----+
Note that there are other numeric types as well such as LongType and DoubleType so might need to handle these as well or alternatively just quote StringType etc.