Select a literal based on a column value in Spark - scala

I have a map:
val map = Map("A" -> 1, "B" -> 2)
And I have a DataFrame. a column in the data frame contains the keys in the map. I am trying to select a column in a new DF that has the map values in it based on the key:
val newDF = DfThatContainsTheKeyColumn.select(concat(col(SomeColumn), lit("|"),
lit(map.get(col(ColumnWithKey).toString()).get) as newColumn)
But this is resulting in the following error:
java.lang.RuntimeException: Unsupported literal type class scala.None$ None
I made sure that the column ColumnWithKey has As and Bs only and does not have empty values in it.
Is there another way to get the result I am looking for? Any help would be appreciated.

The Problem in this statement (besides syntax problems)
val newDF = DfThatContainsTheKeyColumn.select(concat(col(SomeColumn), lit("|"),
lit(map.get(col(ColumnWithKey).toString()).get) as newColumn)
is that col(ColumnWithKey) will not take the value of a specific row, but is only given by the schema, i.e. has a constant value.
In your case I would suggest to join your map to your dataframe :
val map = Map("A" -> 1, "B" -> 2)
val df_map = map.toSeq.toDF("key","value")
val DfThatContainsTheKeyColumn = Seq(
"A",
"A",
"B",
"B"
).toDF("myCol")
DfThatContainsTheKeyColumn
.join(broadcast(df_map),$"mycol"===$"key")
.select(concat($"mycol",lit("|"),$"value").as("newColumn"))
.show()
gives
|newColumn|
+---------+
| A|1|
| A|1|
| B|2|
| B|2|
+---------+

You can use case classes to make it easy. This is an example:
Given this input
val givenMap = Map("A" -> 1, "B" -> 2)
import spark.implicits._
val df = Seq(
(1, "A"),
(2, "A"),
(3, "B"),
(4, "B")
).toDF("col_a", "col_b")
df.show()
Above code looks like:
+-----+-----+
|col_a|col_b|
+-----+-----+
| 1| A|
| 2| A|
| 3| B|
| 4| B|
+-----+-----+
givenMap: scala.collection.immutable.Map[String,Int] = Map(A -> 1, B -> 2)
import spark.implicits._
df: org.apache.spark.sql.DataFrame = [col_a: int, col_b: string]
The code that you need will look like:
case class MyInput(col_a: Int, col_b: String)
case class MyOutput(col_a: Int, col_b: String, new_column: Int)
df.as[MyInput].map(row=> MyOutput(row.col_a, row.col_b, givenMap(row.col_b))).show()
With the case classes you can cast your df and use object notation to access to your column values within a .map. Above code will output:
+-----+-----+----------+
|col_a|col_b|new_column|
+-----+-----+----------+
| 1| A| 1|
| 2| A| 1|
| 3| B| 2|
| 4| B| 2|
+-----+-----+----------+
defined class MyInput
defined class MyOutput

You can lookup a map using key from a column as,
val map = Map("A" -> 1, "B" -> 2)
val df = spark.createDataset(Seq("dummy"))
.withColumn("key",lit("A"))
df.map{ row =>
val k = row.getAs[String]("key")
val v = map.getOrElse(k,0)
(k,v)
}.toDF("key", "value").show(false)
Result -
+---+-----+
|key|value|
+---+-----+
|A |1 |
+---+-----+
You can look up a map present inside a column using a literal key using Column.getItem, please see an example below.
val mapKeys = Array("A","B")
val mapValues = Array(1,2)
val df = spark.createDataset(Seq("dummy"))
.withColumn("key",lit("A"))
.withColumn("keys",lit(mapKeys))
.withColumn("values",lit(mapValues))
.withColumn("map",map_from_arrays($"keys",$"values"))
.withColumn("lookUpTheMap",$"map".getItem("A"))
//A dataframe with Map is created.
//A map is looked up using a hard coded String key.
df.show(false)
Result
+-----+---+------+------+----------------+------------+
|value|key|keys |values|map |lookUpTheMap|
+-----+---+------+------+----------------+------------+
|dummy|A |[A, B]|[1, 2]|[A -> 1, B -> 2]|1 |
+-----+---+------+------+----------------+------------+
To look up a map present inside a column based on another column containing the key - you can use an UDF or use map function on the dataframe the way I am showing below.
//A map is looked up using a Column key.
df.map{ row =>
val m = row.getAs[Map[String,Int]]("map")
val k = row.getAs[String]("key")
val v = m.getOrElse(k,0)
(m,k,v)
}.toDF("map","key", "value").show(false)
Result
+----------------+---+-----+
|map |key|value|
+----------------+---+-----+
|[A -> 1, B -> 2]|A |1 |
+----------------+---+-----+

I think a simpler option could be to use typedLit:
val map = typedLit(Map("A" -> 1, "B" -> 2))
val newDF = DfThatContainsTheKeyColumn.select(concat(col(SomeColumn), lit("|"),
map(col(ColumnWithKey))) as newColumn)

Related

How to add a new column to my DataFrame such that values of new column are populated by some other function in scala?

myFunc(Row): String = {
//process row
//returns string
}
appendNewCol(inputDF : DataFrame) : DataFrame ={
inputDF.withColumn("newcol",myFunc(Row))
inputDF
}
But no new column got created in my case. My myFunc passes this row to a knowledgebasesession object and that returns a string after firing rules. Can I do it this way? If not, what is the right way? Thanks in advance.
I saw many StackOverflow solutions using expr() sqlfunc(col(udf(x)) and other techniques but here my newcol is not derived directly from existing column.
Dataframe:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}
val myFunc = (r: Row) => {r.getAs[String]("col1") + "xyz"} // example transformation
val testDf = spark.sparkContext.parallelize(Seq(
(1, "abc"), (2, "def"), (3, "ghi"))).toDF("id", "col1")
testDf.show
val rddRes = testDf
.rdd
.map{x =>
val y = myFunc (x)
Row.fromSeq (x.toSeq ++ Seq(y) )
}
val newSchema = StructType(testDf.schema.fields ++ Array(StructField("col2", dataType =StringType, nullable =false)))
spark.sqlContext.createDataFrame(rddRes, newSchema).show
Results:
+---+----+
| id|col1|
+---+----+
| 1| abc|
| 2| def|
| 3| ghi|
+---+----+
+---+----+------+
| id|col1| col2|
+---+----+------+
| 1| abc|abcxyz|
| 2| def|defxyz|
| 3| ghi|ghixyz|
+---+----+------+
With Dataset:
case class testData(id: Int, col1: String)
case class transformedData(id: Int, col1: String, col2: String)
val test: Dataset[testData] = List(testData(1, "abc"), testData(2, "def"), testData(3, "ghi")).toDS
val transformedData: Dataset[transformedData] = test
.map { x: testData =>
val newCol = x.col1 + "xyz"
transformedData(x.id, x.col1, newCol)
}
transformedData.show
As you can see datasets is more readable, plus provides strong type casting.
Since I'm unaware of your spark version, providing both solutions here. However if you're using spark v>=1.6, you should look into Datasets. Playing with rdd is fun, but can quickly devolve into longer job runs and a host of other issues that you wont foresee

Perform lookup on a broadcasted Map conditoned on column value in Spark using Scala

I want to perform a lookup on myMap. When col2 value is "0000" I want to update it with the value related to col1 key. Otherwise I want to keep the existing col2 value.
val myDF :
+-----+-----+
|col1 |col2 |
+-----+-----+
|1 |a |
|2 |0000 |
|3 |c |
|4 |0000 |
+-----+-----+
val myMap : Map[String, String] ("2" -> "b", "4" -> "d")
val broadcastMyMap = spark.sparkContext.broadcast(myMap)
def lookup = udf((key:String) => broadcastMyMap.value.get(key))
myDF.withColumn("col2", when ($"col2" === "0000", lookup($"col1")).otherwise($"col2"))
I've used the code above in spark-shell and it works fine but when I build the application jar and submit it to Spark using spark-submit it throws an error:
org.apache.spark.SparkException: Failed to execute user defined function(anonfun$5: (string) => string)
Caused by: java.lang.NullPointerException
Is there a way to perform the lookup without using UDF, which aren't the best option in terms of performance, or to fix the error?
I think I can't just use join because some values of myDF.col2 that have to be kept could be sobstituted in the operation.
your NullPointerException is NOT Valid.I proved with sample program like below.
its PERFECTLY WORKING FINE. you execute the below program.
package com.example
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.UserDefinedFunction
object MapLookupDF {
Logger.getLogger("org").setLevel(Level.OFF)
def main(args: Array[String]) {
import org.apache.spark.sql.functions._
val spark = SparkSession.builder.
master("local[*]")
.appName("MapLookupDF")
.getOrCreate()
import spark.implicits._
val mydf = Seq((1, "a"), (2, "0000"), (3, "c"), (4, "0000")).toDF("col1", "col2")
mydf.show
val myMap: Map[String, String] = Map("2" -> "b", "4" -> "d")
println(myMap.toString)
val broadcastMyMap = spark.sparkContext.broadcast(myMap)
def lookup: UserDefinedFunction = udf((key: String) => {
println("getting the value for the key " + key)
broadcastMyMap.value.get(key)
}
)
val finaldf = mydf.withColumn("col2", when($"col2" === "0000", lookup($"col1")).otherwise($"col2"))
finaldf.show
}
}
Result :
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
+----+----+
|col1|col2|
+----+----+
| 1| a|
| 2|0000|
| 3| c|
| 4|0000|
+----+----+
Map(2 -> b, 4 -> d)
getting the value for the key 2
getting the value for the key 4
+----+----+
|col1|col2|
+----+----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
+----+----+
note: there wont be significant degradation for a small map broadcasted.
if you want to go with a dataframe you can go as convert map to dataframe
val df = myMap.toSeq.toDF("key", "val")
Map(2 -> b, 4 -> d) in dataframe format will be like
+----+----+
|key|val |
+----+----+
| 2| b|
| 4| d|
+----+----+
and then join like this
DIY...

Scala - Fill "null" column with another column

I want to replicate the problem mentioned here in Scala DataFrames. I have tried using the following approaches, to no success so far.
Input
Col1 Col2
A M
B K
null S
Expected Output
Col1 Col2
A M
B K
S <---- S
Approach 1
val output = df.na.fill("A", Seq("col1"))
The fill method does not take a column as the (first) input.
Approach 2
val output = df.where(df.col("col1").isNull)
I cannot find a suitable method to call after I have identified the null values.
Approach 3
val output = df.dtypes.map(column =>
column._2 match {
case "null" => (column._2 -> 0)
}).toMap
I get a StringType error.
I'd use when/otherwise, as shown below:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
("A", "M"), ("B", "K"), (null, "S")
).toDF("Col1", "Col2")
df.withColumn("Col1", when($"Col1".isNull, $"Col2").otherwise($"Col1")).show
// +----+----+
// |Col1|Col2|
// +----+----+
// | A| M|
// | B| K|
// | S| S|
// +----+----+

spark expression rename the column list after aggregation

I have written below code to group and aggregate the columns
val gmList = List("gc1","gc2","gc3")
val aList = List("val1","val2","val3","val4","val5")
val cype = "first"
val exprs = aList.map((_ -> cype )).toMap
dfgroupBy(gmList.map (col): _*).agg (exprs).show
but this create a columns with appending aggregation name in all column as shown below
so I want to alias that name first(val1) -> val1, I want to make this code generic as part of exprs
+----------+----------+-------------+-------------------------+------------------+---------------------------+------------------------+-------------------+
| gc1 | gc2 | gc3 | first(val1) | first(val2)| first(val3) | first(val4) | first(val5) |
+----------+----------+-------------+-------------------------+------------------+---------------------------+------------------------+-------------------+
One approach would be to alias the aggregated columns to the original column names in a subsequent select. I would also suggest generalizing the single aggregate function (i.e. first) to a list of functions, as shown below:
import org.apache.spark.sql.functions._
val df = Seq(
(1, 10, "a1", "a2", "a3"),
(1, 10, "b1", "b2", "b3"),
(2, 20, "c1", "c2", "c3"),
(2, 30, "d1", "d2", "d3"),
(2, 30, "e1", "e2", "e3")
).toDF("gc1", "gc2", "val1", "val2", "val3")
val gmList = List("gc1", "gc2")
val aList = List("val1", "val2", "val3")
// Populate with different aggregate methods for individual columns if necessary
val fList = List.fill(aList.size)("first")
val afPairs = aList.zip(fList)
// afPairs: List[(String, String)] = List((val1,first), (val2,first), (val3,first))
df.
groupBy(gmList.map(col): _*).agg(afPairs.toMap).
select(gmList.map(col) ::: afPairs.map{ case (v, f) => col(s"$f($v)").as(v) }: _*).
show
// +---+---+----+----+----+
// |gc1|gc2|val1|val2|val3|
// +---+---+----+----+----+
// | 2| 20| c1| c2| c3|
// | 1| 10| a1| a2| a3|
// | 2| 30| d1| d2| d3|
// +---+---+----+----+----+
You can slightly change the way you are generating the expression and use the function alias in there:
import org.apache.spark.sql.functions.col
val aList = List("val1","val2","val3","val4","val5")
val exprs = aList.map(c => first(col(c)).alias(c) )
dfgroupBy( gmList.map(col) : _*).agg(exprs.head , exprs.tail: _*).show
Here's a more generic version that will work with any aggregate functions and doesn't require naming your aggregate columns up front. Build your grouped df as you normally would, then use:
val colRegex = raw"^.+\((.*?)\)".r
val newCols = df.columns.map(c => col(c).as(colRegex.replaceAllIn(c, m => m.group(1))))
df.select(newCols: _*)
This will extract out only what is inside the parentheses, regardless of what aggregate function is called (e.g. first(val) -> val, sum(val) -> val, count(val) -> val, etc.).

Spark Scala: convert arbitrary N columns into Map

I have the following data structure representing movie ids (first column) and ratings for different users for that movie in the rest of columns - something like that:
+-------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
|movieId| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 11| 12| 13| 14| 15|
+-------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| 1580|null|null| 3.5| 5.0|null|null|null|null|null|null|null|null|null|null|null|
| 3175|null|null|null|null|null|null|null|null|null|null|null|null|null| 5.0|null|
| 3794|null|null|null|null|null|null|null|null|null|null|null| 3.0|null|null|null|
| 2659|null|null|null| 3.0|null|null|null|null|null|null|null|null|null|null|null|
I want to convert this DataFrame to a DataSet of
final case class MovieRatings(movie_id: Long, ratings: Map[Long, Double])
So that it would be something like
[1580, [1 -> null, 2 -> null, 3 -> 3.5, 4 -> 5.0, 5 -> null, 6 -> null, 7 -> null,...]]
Etc.
How this can be done?
The thing here is that number of users is arbitrary. And I want to zip those into a single column leaving the first column untouched.
First, you have to tranform your DataFrame into one with a schema matching your case class, then you can use .as[MovieRatings] to convert DataFrame into a Dataset[MovieRatings]:
import org.apache.spark.sql.functions._
import spark.implicits._
// define a new MapType column using `functions.map`, passing a flattened-list of
// column name (as a Long column) and column value
val mapColumn: Column = map(df.columns.tail.flatMap(name => Seq(lit(name.toLong), $"$name")): _*)
// select movie id and map column with names matching the case class, and convert to Dataset:
df.select($"movieId" as "movie_id", mapColumn as "ratings")
.as[MovieRatings]
.show(false)
You can use the spark.sql.functions.map to create a map from arbitrary columns. It expects a sequence alternating between keys and values which can be Column types or String's. Here is an example:
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions
case class Input(movieId: Int, a: Option[Double], b: Option[Double], c: Option[Double])
val data = Input(1, None, Option(3.5), Option(1.4)) ::
Input(2, Option(4.2), Option(1.34), None) ::
Input(3, Option(1.11), None, Option(3.32)) :: Nil
val df = sc.parallelize(data).toDF
// Exclude the PK column from the map
val mapKeys = df.columns.filterNot(_ == "movieId")
// Build the sequence of key, value, key, value, ..
val pairs = mapKeys.map(k => Seq(lit(k), col(k))).flatten
val mapped = df.select($"movieId", functions.map(pairs:_*) as "map")
mapped.show(false)
Produces this output:
+-------+------------------------------------+
|movieId|map |
+-------+------------------------------------+
|1 |Map(a -> null, b -> 3.5, c -> 1.4) |
|2 |Map(a -> 4.2, b -> 1.34, c -> null) |
|3 |Map(a -> 1.11, b -> null, c -> 3.32)|
+-------+------------------------------------+