improve if condition with scala - scala

I wrote this:
if (fork == "0" || fork == "1" || fork == "3" || fork == "null" ) {
list2 :: List(
Wrapper(
Location.PL_TYPES,
subType,
daFuncId,
NA,
name,
code)
)
}
else list2 :: List(
Wrapper(
Location.PL_TYPES,
subType,
NA,
NA,
name,
code
)
)
}
I want to improve this by replacing the if else with another pattern
best regards

It seems only the ID is different between the two cases. You could use pattern matching to choose the id, and append to the list only after so you don't repeat the Wrapper construction:
val id = fork match {
case "0" | "1" | "3" | "null" => daFuncId
case _ => NA
}
list2 :: List(
Wrapper(
Location.PL_TYPES,
subType,
id,
NA,
name,
code)
)

You can write the same if-else condition using pattern matching in scala.
fork match {
case "0" | "1" | "3" | null =>
list2 :: List(
Wrapper(
Location.PL_TYPES,
subType,
daFuncId,
NA,
name,
code)
)
case _ =>
list2 :: List(
Wrapper(
Location.PL_TYPES,
subType,
NA,
NA,
name,
code
)
)
}
Please let me know if this works out for you.

list2 :: List(fork)
.map {
case "0" | "1" | "3" | "null" => daFuncId
case _ => NA
}.map { id =>
Wrapper(Location.PL_TYPES, subType, id, NA, name, code)
}

Not really scala specific but I'd suggest something like this:
if (List("0", "1", "3", "null").contains(fork)) {
} else {
}

Related

Spark - Drop null values from map column

I'm using Spark to read a CSV file and then gather all the fields to create a map. Some of the fields are empty and I'd like to remove them from the map.
So for a CSV that looks like this:
"animal", "colour", "age"
"cat" , "black" ,
"dog" , , "3"
I'd like to get a dataset with the following maps:
Map("animal" -> "cat", "colour" -> "black")
Map("animal" -> "dog", "age" -> "3")
This is what I have so far:
val csv_cols_n_vals: Array[Column] = csv.columns.flatMap { c => Array(lit(c), col(c)) }
sparkSession.read
.option("header", "true")
.csv(csvLocation)
.withColumn("allFieldsMap", map(csv_cols_n_vals: _*))
I've tried a few variations, but I can't seem to find the correct solution.
There is most certainly a better and more efficient way using the Dataframe API, but here is a map/flatmap solution:
val df = Seq(("cat", "black", null), ("dog", null, "3")).toDF("animal", "colour", "age")
val cols = df.columns
df.map(r => {
cols.flatMap( c => {
val v = r.getAs[String](c)
if (v != null) {
Some(Map(c -> v))
} else {
None
}
}).reduce(_ ++ _)
}).toDF("map").show(false)
Which produces:
+--------------------------------+
|map |
+--------------------------------+
|[animal -> cat, colour -> black]|
|[animal -> dog, age -> 3] |
+--------------------------------+
scala> df.show(false)
+------+------+----+
|animal|colour|age |
+------+------+----+
|cat |black |null|
|dog |null |3 |
+------+------+----+
Building Expressions
val colExpr = df
.columns // getting list of columns from dataframe.
.map{ columnName =>
when(
col(columnName).isNotNull, // checking if column is not null
map(
lit(columnName),
col(columnName)
) // Adding column name and its value inside map
)
.otherwise(map())
}
.reduce(map_concat(_,_))
// finally using map_concat function to concat map values.
Above code will create below expressions.
map_concat(
map_concat(
CASE WHEN (animal IS NOT NULL) THEN map(animal, animal) ELSE map() END,
CASE WHEN (colour IS NOT NULL) THEN map(colour, colour) ELSE map() END
),
CASE WHEN (age IS NOT NULL) THEN map(age, age) ELSE map() END
)
Applying colExpr on DataFrame.
scala>
df
.withColumn("allFieldsMap",colExpr)
.show(false)
+------+------+----+--------------------------------+
|animal|colour|age |allFieldsMap |
+------+------+----+--------------------------------+
|cat |black |null|[animal -> cat, colour -> black]|
|dog |null |3 |[animal -> dog, age -> 3] |
+------+------+----+--------------------------------+
Spark-sql solution:
val df = Seq(("cat", "black", null), ("dog", null, "3")).toDF("animal", "colour", "age")
df.show(false)
+------+------+----+
|animal|colour|age |
+------+------+----+
|cat |black |null|
|dog |null |3 |
+------+------+----+
df.createOrReplaceTempView("a_vw")
val cols_str = df.columns.flatMap( x => Array("\"".concat(x).concat("\""),x)).mkString(",")
spark.sql(s"""
select collect_list(m2) res from (
select id, key, value, map(key,value) m2 from (
select id, explode(m) as (key,value) from
( select monotonically_increasing_id() id, map(${cols_str}) m from a_vw )
)
where value is not null
) group by id
""")
.show(false)
+------------------------------------+
|res |
+------------------------------------+
|[[animal -> cat], [colour -> black]]|
|[[animal -> dog], [age -> 3]] |
+------------------------------------+
Or much shorter
spark.sql(s"""
select collect_list(case when value is not null then map(key,value) end ) res from (
select id, explode(m) as (key,value) from
( select monotonically_increasing_id() id, map(${cols_str}) m from a_vw )
) group by id
""")
.show(false)
+------------------------------------+
|res |
+------------------------------------+
|[[animal -> cat], [colour -> black]]|
|[[animal -> dog], [age -> 3]] |
+------------------------------------+

Scala: filter with different conditions specified by tuples in a set

I have a RDD with it's field1 containing a drug name and field2 containing corresponding dosage of that drug.
I am trying to filter this RDD based on multiple criterion saved in a set of tuples, like:
val MyCriteria = Set(("drug a", ">", 1.2), ("drug b", ">=", 4.5), ("drug c", "<", 6.3))
I guess what I can do is something like:
val rslt = rdd.filter(x => MyCriteria.foreach(x.field1 == _._1 && x.field2 _._2 _._3))
But I don't know how to convert the 2nd element of the tuple (string) to actual operators that scala understands. It throws out an error message:
<console>:1: error: ')' expected but '.' found.
val rslt = rdd.filter(x => MyCriteria.foreach(x.field1 == _._1 && x.field2 _._2 _._3))
^
Or what would be a better way to realize the filter?
It won't work like this, Scala string literal's won't be translated into operator.
Instead you need to use function to compare values from RDD with filter results.
Please, see code example below:
type Compare[T : Numeric] = (T, T) => Boolean
type DoubleCompare = Compare[Double]
val > : DoubleCompare = _ > _
val < : DoubleCompare = _ < _
val >= : DoubleCompare = _ >= _
val myCriteria: Set[(String, DoubleCompare, Double)] = Set (
("drug a", > , 1.2),
("drug b", >=, 4.5),
("drug c", <, 6.3)
)
rdd.filter { x =>
val fieldName = x.field1
val fieldValue = x.field2
myCriteria.foreach {
case (filterFieldName, filter, filterValue) =>
(fieldName == filterFieldName) && filter(fieldValue, filterValue)
}
}
Hope this helps!
If you can use DataFrame instead of RDD then you can use the where with String expression which will handle all data types and you don't have to code for them.
import spark.implicits._
val df = spark.createDataset(Seq(("drug a",5),("drug b",4),("drug c",4))).toDF("drug","dose")
val criteria = Set(("drug a", ">", 1.2), ("drug b", ">=", 4.5), ("drug c", "<", 6.3))
df.show()
val criteriaExp = criteria.foldLeft(""){(cstr, cset)=>
if(cstr == "")
s"(drug = '${cset._1}' and dose ${cset._2} ${cset._3})"
else
s"$cstr or (drug = '${cset._1}' and dose ${cset._2} ${cset._3})"
}
println(criteriaExp)
df.where(criteriaExp).show()
Result
+------+----+
| drug|dose|
+------+----+
|drug a| 5|
|drug b| 4|
|drug c| 4|
+------+----+
criteria string - (drug = 'drug a' and dose > 1.2) or (drug = 'drug b' and dose >= 4.5) or (drug = 'drug c' and dose < 6.3)
+------+----+
| drug|dose|
+------+----+
|drug a| 5|
|drug c| 4|
+------+----+

Write List of Map data into csv

val rdd = df.rdd.map(line => Row.fromSeq((
scala.xml.XML.loadString("<?xml version='1.0' encoding='utf-8'?>" + line(1)).child
.filter(elem =>
elem.label == "name1"
|| elem.label == "name2"
|| elem.label == "name3"
|| elem.label == "name4"
).map(elem => (elem.label -> elem.text)).toList)
)
I do rdd.take(10).foreach(println), My is RDD[Row] then produced the output something like this:
[(name1, value1), (name2, value2),(name3, value3)]
[(name1, value11), (name2, value22),(name3, value33)]
[(name1, value111), (name2, value222),(name4, value44)]
I want save this into csv with (name1..name4 are header of csv), anyone please help how can I implement this with apache spark 2.4.0
name1 | name2 | name3 | name4
value1 | value2 |value3 | null
value11 | value22 |value33 | null
value111 | value222 |null | value444
I adjusted your example and added some intermediate Values to help get each step:
// define the labels you want:
val labels = Seq("name1", "name2", "name3", "name4")
val result: RDD[Row] = rdd.map { line =>
// your raw data
val tuples: immutable.Seq[(String, String)] =
scala.xml.XML.loadString("<?xml version='1.0' encoding='utf-8'?>" + line(1)).child
.filter(elem => labels.contains(elem.label)) // you can use the label list to filter
.map(elem => (elem.label -> elem.text)).toList // no change here
val values: Seq[String] =
labels.map(l =>
// take the values you have a label
tuples.find{case (k, v) => k == l}.map(_._2)
// or just add an empty String
.getOrElse(""))
// create a Row
Row.fromSeq(values)
}
Now I am not sure - but in essence you have to insert the title Row as the first row:
[name1, name2, name3]

Spark DataFrame - drop null values from column

Given a dataframe :
val df = sc.parallelize(Seq(("foo", ArrayBuffer(null,"bar",null)), ("bar", ArrayBuffer("one","two",null)))).toDF("key", "value")
df.show
+---+--------------------------+
|key| value|
+---+--------------------------+
|foo|ArrayBuffer(null,bar,null)|
|bar|ArrayBuffer(one, two,null)|
+---+--------------------------+
I'd like to drop null from column value. After removal the dataframe should look like this :
+---+--------------------------+
|key| value|
+---+--------------------------+
|foo|ArrayBuffer(bar) |
|bar|ArrayBuffer(one, two) |
+---+--------------------------+
Any suggestion welcome . 10x
You'll need an UDF here. For example with a flatMap:
val filterOutNull = udf((xs: Seq[String]) =>
Option(xs).map(_.flatMap(Option(_))))
df.withColumn("value", filterOutNull($"value"))
where external Option with map handles NULL columns:
Option(null: Seq[String]).map(identity)
Option[Seq[String]] = None
Option(Seq("foo", null, "bar")).map(identity)
Option[Seq[String]] = Some(List(foo, null, bar))
and ensures we don't fail with NPE when input is NULL / null by mapping
NULL -> null -> None -> None -> NULL
where null is a Scala null and NULL is a SQL NULL.
The internal flatMap flattens a sequence of Options effectively filtering nulls:
Seq("foo", null, "bar").flatMap(Option(_))
Seq[String] = List(foo, bar)
A more imperative equivalent could be something like this:
val imperativeFilterOutNull = udf((xs: Seq[String]) =>
if (xs == null) xs
else for {
x <- xs
if x != null
} yield x)
Option 1: using UDF:
val filterNull = udf((arr : Seq[String]) => arr.filter((x: String) => x != null))
df.withColumn("value", filterNull($"value")).show()
Option 2: no UDF
df.withColumn("value", explode($"value")).filter($"value".isNotNull).groupBy("key").agg(collect_list($"value")).show()
Note that this is less efficient...
Also you can use spark-daria it has: com.github.mrpowers.spark.daria.sql.functions.arrayExNull
from the documentation:
Like array but doesn't include null element

How to filter a map<String, Int> in a data frame : Spark / Scala

I am trying to get the count individual column to publish metrics. I have a I have a df [customerId : string, totalRent : bigint, totalPurchase: bigint, itemTypeCounts: map<string, int> ]
Right now I am doing :
val totalCustomers = df.count
val totalPurchaseCount = df.filter("totalPurchase > 0").count
val totalRentCount = df.filter("totalRent > 0").count
publishMetrics("Total Customer", totalCustomers )
publishMetrics("Total Purchase", totalPurchaseCount )
publishMetrics("Total Rent", totalRentCount )
publishMetrics("Percentage of Rent", percentage(totalRentCount, totalCustomers) )
publishMetrics("Percentage of Purchase", percentage(totalPurchaseCount, totalCustomers) )
private def percentageCalc(num: Long, denom: Long): Double = {
val numD: Long = num
val denomD: Long = denom
return if (denomD == 0.0) 0.0
else (numD / denomD) * 100
}
But I am not sure how do I do this for itemTypeCounts which is a map. I want count and percentage based on each key entry. The issue is the key value is dynamic , I mean there is no way I know the key value before hand. Can some one tell me how do get count for each key values. I am new to scala/spark, any other efficient approaches to get the counts of each columns are much appreciated.
Sample data :
customerId : 1
totalPurchase : 17
totalRent : 0
itemTypeCounts : {"TV" : 4, "Blender" : 2}
customerId : 2
totalPurchase : 1
totalRent : 1
itemTypeCounts : {"Cloths" : 4}
customerId : 3
totalPurchase : 0
totalRent : 10
itemTypeCounts : {"TV" : 4}
So the output is :
totalCustomer : 3
totalPurchaseCount : 2 (2 customers with totalPurchase > 0)
totalRent : 2 (2 customers with totalRent > 0)
itemTypeCounts_TV : 2
itemTypeCounts_Cloths : 1
itemTypeCounts_Blender : 1
You can accomplish this in Spark SQL, I show two examples of this below (one where the keys are known and can be enumerated in code, one where the keys are unknown). Note that by using Spark SQL, you take advantage of the catalyst optimizer, and this will run very efficiently:
val data = List((1,17,0,Map("TV" -> 4, "Blender" -> 2)),(2,1,1,Map("Cloths" -> 4)),(3,0,10,Map("TV" -> 4)))
val df = data.toDF("customerId","totalPurchase","totalRent","itemTypeCounts")
//Only good if you can enumerate the keys
def countMapKey(name:String) = {
count(when($"itemTypeCounts".getItem(name).isNotNull,lit(1))).as(s"itemTypeCounts_$name")
}
val keysToCount = List("TV","Blender","Cloths").map(key => countMapKey(key))
df.select(keysToCount :_*).show
+-----------------+----------------------+---------------------+
|itemTypeCounts_TV|itemTypeCounts_Blender|itemTypeCounts_Cloths|
+-----------------+----------------------+---------------------+
| 2| 1| 1|
+-----------------+----------------------+---------------------+
//More generic
val pivotData = df.select(explode(col("itemTypeCounts"))).groupBy(lit(1).as("tmp")).pivot("key").count.drop("tmp")
val renameStatement = pivotData.columns.map(name => col(name).as(s"itemTypeCounts_$name"))
pivotData.select(renameStatement :_*).show
+----------------------+---------------------+-----------------+
|itemTypeCounts_Blender|itemTypeCounts_Cloths|itemTypeCounts_TV|
+----------------------+---------------------+-----------------+
| 1| 1| 2|
+----------------------+---------------------+-----------------+
I'm a spark newbie myself, so there is probably a better way to do this. But one thing you could try is transforming the itemTypeCounts into a data structure in scala that you could work with. I converted each row to a List of (Name, Count) pairs e.g. List((Blender,2), (TV,4)).
With this you can have a List of such list of pairs, one list of pairs for each row. In your example, this will be a List of 3 elements:
List(
List((Blender,2), (TV,4)),
List((Cloths,4)),
List((TV,4))
)
Once you have this structure, transforming it to a desired output is standard scala.
Worked example is below:
val itemTypeCounts = df.select("itemTypeCounts")
//Build List of List of Pairs as suggested above
val itemsList = itemTypeCounts.collect().map {
row =>
val values = row.getStruct(0).mkString("",",","").split(",")
val fields = row.schema.head.dataType.asInstanceOf[StructType].map(s => s.name).toList
fields.zip(values).filter(p => p._2 != "null")
}.toList
// Build a summary map for the list constructed above
def itemTypeCountsSummary(frames: List[List[(String, String)]], summary: Map[String, Int]) : Map[String, Int] = frames match {
case Nil => summary
case _ => itemTypeCountsSummary(frames.tail, merge(frames.head, summary))
}
//helper method for the summary map.
def merge(head: List[(String, String)], summary: Map[String, Int]): Map[String, Int] = {
val headMap = head.toMap.map(e => ("itemTypeCounts_" + e._1, 1))
val updatedSummary = summary.map{e => if(headMap.contains(e._1)) (e._1, e._2 + 1) else e}
updatedSummary ++ headMap.filter(e => !updatedSummary.contains(e._1))
}
val summaryMap = itemTypeCountsSummary(itemsList, Map())
summaryMap.foreach(e => println(e._1 + ": " + e._2 ))
Output:
itemTypeCounts_Blender: 1
itemTypeCounts_TV: 2
itemTypeCounts_Cloths: 1
Borrowing the input from Nick and using spark-sql pivot solution:
val data = List((1,17,0,Map("TV" -> 4, "Blender" -> 2)),(2,1,1,Map("Cloths" -> 4)),(3,0,10,Map("TV" -> 4)))
val df = data.toDF("customerId","totalPurchase","totalRent","itemTypeCounts")
df.show(false)
df.createOrReplaceTempView("df")
+----------+-------------+---------+-----------------------+
|customerId|totalPurchase|totalRent|itemTypeCounts |
+----------+-------------+---------+-----------------------+
|1 |17 |0 |[TV -> 4, Blender -> 2]|
|2 |1 |1 |[Cloths -> 4] |
|3 |0 |10 |[TV -> 4] |
+----------+-------------+---------+-----------------------+
Assuming that we know the distinct itemType beforehand, we can use
val dfr = spark.sql("""
select * from (
select explode(itemTypeCounts) itemTypeCounts from (
select flatten(collect_list(map_keys(itemTypeCounts))) itemTypeCounts from df
) ) t
pivot ( count(itemTypeCounts) as c3
for itemTypeCounts in ('TV' ,'Blender' ,'Cloths') )
""")
dfr.show(false)
+---+-------+------+
|TV |Blender|Cloths|
+---+-------+------+
|2 |1 |1 |
+---+-------+------+
For renaming columns,
dfr.select(dfr.columns.map( x => col(x).alias("itemTypeCounts_" + x )):_* ).show(false)
+-----------------+----------------------+---------------------+
|itemTypeCounts_TV|itemTypeCounts_Blender|itemTypeCounts_Cloths|
+-----------------+----------------------+---------------------+
|2 |1 |1 |
+-----------------+----------------------+---------------------+
To get the distinct itemType dynamically and pass it to pivot
val item_count_arr = spark.sql(""" select array_distinct(flatten(collect_list(map_keys(itemTypeCounts)))) itemTypeCounts from df """).as[Array[String]].first
item_count_arr: Array[String] = Array(TV, Blender, Cloths)
spark.sql(s"""
select * from (
select explode(itemTypeCounts) itemTypeCounts from (
select flatten(collect_list(map_keys(itemTypeCounts))) itemTypeCounts from df
) ) t
pivot ( count(itemTypeCounts) as c3
for itemTypeCounts in (${item_count_arr.map(c => "'"+c+"'").mkString(",")}) )
""").show(false)
+---+-------+------+
|TV |Blender|Cloths|
+---+-------+------+
|2 |1 |1 |
+---+-------+------+