Key corresponding to max value in a spark map column - scala

If I have a spark map column from string to double, is there an easy way to generate a new column with the key corresponding to the maximum value?
I was able to achieve it using collection functions as illustrated below:
import org.apache.spark.sql.functions._
val mockedDf = Seq(1, 2, 3)
.toDF("id")
.withColumn("optimized_probabilities_map", typedLit(Map("foo"->0.34333337, "bar"->0.23)))
val df = mockedDf
.withColumn("optimizer_probabilities", map_values($"optimized_probabilities_map"))
.withColumn("max_probability", array_max($"optimizer_probabilities"))
.withColumn("max_position", array_position($"optimizer_probabilities", $"max_probability"))
.withColumn("optimizer_ruler_names", map_keys($"optimized_probabilities_map"))
.withColumn("optimizer_ruler_name", $"optimizer_ruler_names"( $"max_position"))
However, this solution is unnecessarly long and not very efficient. There is also a possible precision issue since I am comparing doubles when using array_position. I wonder if there is a better way to do this without UDFs, maybe using an expression string.

Sine you can use Spark 2.4+, one way is to use Spark-SQL builtin function aggregate where we iterate through all map_keys and then compare the corresponding map_values with the buffered values acc.val and then update acc.name accordingly:
mockedDf.withColumn("optimizer_ruler_name", expr("""
aggregate(
map_keys(optimized_probabilities_map),
(string(NULL) as name, double(NULL) as val),
(acc, y) ->
IF(acc.val is NULL OR acc.val < optimized_probabilities_map[y]
, (y as name, optimized_probabilities_map[y] as val)
, acc
),
acc -> acc.name
)
""")).show(false)
+---+--------------------------------+--------------------+
|id |optimized_probabilities_map |optimizer_ruler_name|
+---+--------------------------------+--------------------+
|1 |[foo -> 0.34333337, bar -> 0.23]|foo |
|2 |[foo -> 0.34333337, bar -> 0.23]|foo |
|3 |[foo -> 0.34333337, bar -> 0.23]|foo |
+---+--------------------------------+--------------------+

Another solution would be to explode the map column and then use Window function to get the max value like this:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"id")
val df = mockedDf.select($"id", $"optimized_probabilities_map", explode($"optimized_probabilities_map"))
.withColumn("max_value", max($"value").over(w))
.where($"max_value" === $"value")
.drop("value", "max_value")

Related

Apply a transformation to all the columns with the same data type on Spark

I need to apply a transformation to all the Integer columns of my Data Frame before writting a CSV. The transformation consists on changing the type to String and then transform the format to the European one (E.g. 1234567 -> "1234567" -> "1.234.567").
Has Spark any way to apply this transformation to all the Integer Columns? I want it to be a generic functionality (because I need to write multiple CSVs) instead of hardcoding all the columns to transform for each dataframe.
DataFrame has dtypes method, which returns column names along with their data types: Array[("Column name", "Data Type")].
You can map this array, applying different expressions to each column, based on their data type. And you can then pass this mapped list to the select method:
import spark.implicits._
import org.apache.spark.sql.functions._
val dataSeq = Seq(
(1246984, 993922, "test_1"),
(246984, 993922, "test_2"),
(246984, 993922, "test_3"))
val df = dataSeq.toDF("int_1", "int_2", "str_3")
df.show
+-------+------+------+
| int_1| int_2| str_3|
+-------+------+------+
|1246984|993922|test_1|
| 246984|993922|test_2|
| 246984|993922|test_3|
+-------+------+------+
val columns =
df.dtypes.map{
case (c, "IntegerType") => regexp_replace(format_number(col(c), 0), ",", ".").as(c)
case (c, t) => col(c)
}
val df2 = df.select(columns:_*)
df2.show
+---------+-------+------+
| int_1| int_2| str_3|
+---------+-------+------+
|1,246,984|993,922|test_1|
| 246,984|993,922|test_2|
| 246,984|993,922|test_3|
+---------+-------+------+

Spark: apply sliding() to each row without UDF

I have a Dataframe with several columns. The i-th column contains strings. I want to apply the string sliding(n) function to each string in the column. Is there a way to do so without using user-defined functions?
Example:
My dataframe is
var df = Seq((0, "hello"), (1, "hola")).toDF("id", "text")
I want to apply the sliding(3) function to each element of column "text" to obtain a dataframe corresponding to
Seq(
(0, ("hel", "ell", "llo"))
(1, ("hol", "ola"))
)
How can I do this?
For spark version >= 2.4.0, this can be done using the inbuilt functions array_repeat, transform and substring.
import org.apache.spark.sql.functions.{array_repeat, transform, substring}
//Repeat the array `n` times
val repeated_df = df.withColumn("tmp",array_repeat($"text",length($"text")-3+1))
//Get the slices with transform higher order function
val res = repeated_df.withColumn("str_slices",
expr("transform(tmp,(x,i) -> substring(x from i+1 for 3))")
)
//res.show()
+---+-----+---------------------+---------------+
|id |text |tmp |str_slices |
+---+-----+---------------------+---------------+
|0 |hello|[hello, hello, hello]|[hel, ell, llo]|
|1 |hola |[hola, hola] |[hol, ola] |
+---+-----+---------------------+---------------+

How to convert spark dataframe array to tuple

How can I convert spark dataframe to a tuple of 2 in scala?
I tried to explode the array and create a new column with help of lead function, so that I can use two columns to create tuple.
In order to use lead function, I need a column to sort by, I don't have any.
Please suggest which is best way to solve this?
Note: I need to retain the same order in the array.
For example:
Input
For example, input looks something like this,
id1 | [text1, text2, text3, text4]
id2 | [txt, txt2, txt4, txt5, txt6, txt7, txt8, txt9]
expected o/p:
I need to get output of tuple of length 2
id1 | [(text1, text2), (text2, text3), (text3,text4)]
id2 | [(txt, txt2), (txt2, txt4), (txt4, txt5), (txt5, txt6), (txt6, txt7), (txt7, txt8), (txt8, txt9)]
You can create an udf to create list of tuple using sliding window function
val df = Seq(
("id1", List("text1", "text2", "text3", "text4")),
("id2", List("txt", "txt2", "txt4", "txt5", "txt6", "txt7", "txt8", "txt9"))
).toDF("id", "text")
val sliding = udf((value: Seq[String]) => {
value.toList.sliding(2).map { case List(a, b) => (a, b) }.toList
})
val result = df.withColumn("text", sliding($"text"))
Output:
+---+-------------------------------------------------------------------------------------------------+
|id |text |
+---+-------------------------------------------------------------------------------------------------+
|id1|[[text1, text2], [text2, text3], [text3, text4]] |
|id2|[[txt, txt2], [txt2, txt4], [txt4, txt5], [txt5, txt6], [txt6, txt7], [txt7, txt8], [txt8, txt9]]|
+---+-------------------------------------------------------------------------------------------------+
Hope this helps!

Use Map to replace column values in Spark

I have to map a list of columns to another column in a Spark dataset: think something like this
val translationMap: Map[Column, Column] = Map(
lit("foo") -> lit("bar"),
lit("baz") -> lit("bab")
)
And I have a dataframe like this one:
val df = Seq("foo", "baz").toDF("mov")
So I intend to perform the translation like this:
df.select(
col("mov"),
translationMap(col("mov"))
)
but this piece of code spits the following error
key not found: movs
java.util.NoSuchElementException: key not found: movs
Is there a way to perform such translation without concatenating hundreds of whens? think that translationMap could have lots of pairs key-value.
Instead of Map[Column, Column] you should use a Column containing a map literal:
import org.apache.spark.sql.functions.typedLit
val translationMap: Column = typedLit(Map(
"foo" -> "bar",
"baz" -> "bab"
))
The rest of your code can stay as-is:
df.select(
col("mov"),
translationMap(col("mov"))
).show
+---+---------------------------------------+
|mov|keys: [foo,baz], values: [bar,bab][mov]|
+---+---------------------------------------+
|foo| bar|
|baz| bab|
+---+---------------------------------------+
You can not refer a Scala collection declared on the driver like this inside a distributed dataframe. An alternative would be to use a UDF which will not be performance efficient if you have a large dataset since UDFs are not optimized by Spark.
val translationMap = Map( "foo" -> "bar" , "baz" -> "bab" )
val getTranslationValue = udf ((x: String)=>translationMap.getOrElse(x,null.asInstanceOf[String]) )
df.select(col("mov"), getTranslationValue($"mov").as("value") ).show
//+---+-----+
//|mov|value|
//+---+-----+
//|foo| bar|
//|baz| bab|
//+---+-----+
Another solution would be to load the Map as a DataSet[(String, String)] and the join the two datasets taking mov as the key.

check data size spark dataframes

I have the following question :
Actually I am working with the following csv file:
""job"";""marital"""
""management"";""married"""
""technician"";""single"""
I loaded it into a spark dataframe as follows:
My aim is to check the length and type of each field in the dataframe following the set od rules below :
col type
job char10
marital char7
I started implementing the check of the length of each field but I am getting a compilation error :
val data = spark.read.option("inferSchema", "true").option("header", "true").csv("file:////home/user/Desktop/user/file.csv")
data.map(line => {
val fields = line.toString.split(";")
fields(0).size
fields(1).size
})
The expected output should be:
List(10,10)
As for the check of the types I don't have any idea about how to implement it as we are using dataframes. Any idea about a function verifying the data format ?
Thanks a lot in advance for your replies.
ata
I see you are trying to use Dataframe, But if there are multiple double quotes then you can read as a textFile and remove them and convert to Dataframe as below
import org.apache.spark.sql.functions._
import spark.implicits._
val raw = spark.read.textFile("path to file ")
.map(_.replaceAll("\"", ""))
val header = raw.first
val data = raw.filter(row => row != header)
.map { r => val x = r.split(";"); (x(0), x(1)) }
.toDF(header.split(";"): _ *)
You get with data.show(false)
+----------+-------+
|job |marital|
+----------+-------+
|management|married|
|technician|single |
+----------+-------+
To calculate the size you can use withColumn and length function and play around as you need.
data.withColumn("jobSize", length($"job"))
.withColumn("martialSize", length($"marital"))
.show(false)
Output:
+----------+-------+-------+-----------+
|job |marital|jobSize|martialSize|
+----------+-------+-------+-----------+
|management|married|10 |7 |
|technician|single |10 |6 |
+----------+-------+-------+-----------+
All the column type are String.
Hope this helps!
You are using a dataframe. So when you use the map method, you are processing Row in your lambda.
so line is a Row.
Row.toString will return a string representing the Row, so in your case 2 structfields typed as String.
If you want to use map and process your Row, you have to get the vlaue inside the fields manually. with getAsString and getAsString.
Usually when you use Dataframes, you have to work in column's logic as in SQL using select, where... or directly the SQL syntax.