Convert keys into columns names an values into rows (Map) - scala

I have a dataframe that contains a Map column of and an id column.
key1 -> value1, key2 -> value2
key1 -> value3, key2 -> value4
I want to have as a result a dataframe like that:
id key1 key2
1 value1 value2
2 value3 value4
Thanks for your help.

I assume you are talking about Spark DataFrame. In that case, you can use the map method of the DataFrame to extract out the values you want. Here is an example using spark-shell (which automatically imports many of the implicit methods).
Note that toDF is used twice, once to load the sequence from built-in data structures, and another time to rename the columns in the new DataFrame obtained from the map method of the original DataFrame.
The show method is called to display "before" and "after"
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val m = Map(1-> Map("key1" -> "v1", "key2" -> "v2"), 2 -> Map("key1" -> "v3", "key2" -> "v4"))
m: scala.collection.immutable.Map[Int,scala.collection.immutable.Map[String,String]] = Map(1 -> Map(key1 -> v1, key2 -> v2), 2 -> Map(key1 -> v3, key2 -> v4))
scala> val df = m.toSeq.toDF("id", "map_value")
df: org.apache.spark.sql.DataFrame = [id: int, map_value: map<string,string>]
scala> df.show()
+---+--------------------+
| id| map_value|
+---+--------------------+
| 1|[key1 -> v1, key2...|
| 2|[key1 -> v3, key2...|
+---+--------------------+
scala> val get_map:Function1[Row, Map[String,String]] = r => r.getAs[Map[String, String]]("map_value")
get_map: org.apache.spark.sql.Row => Map[String,String] = <function1>
scala> df.map(r => (r.getAs[Int]("id"), get_map(r).get("key1"), get_map(r).get("key2"))).toDF("id", "val1", "val2").show()
+---+----+----+
| id|val1|val2|
+---+----+----+
| 1| v1| v2|
| 2| v3| v4|
+---+----+----+
Edit:
This answers how to address a variable number of columns. Here, N is the number of columns plus one (so there are 7 columns and N is 8). Note that 3 is the number of rows plus one (here there are 2 rows).
It is more convenient to use the select method of the DataFrame in this case, to avoid having to dynamically create tuples.
scala> val N = 8
N: Int = 8
scala> val map_value:Function1[Int,Map[String,String]] = (i: Int) => Map((for (n <- Range(1, N)) yield (s"k${n}", s"v${n*i}")).toList:_*)
map_value: Int => Map[String,String] = <function1>
scala> val m = Map((for (i <- Range(1, 3)) yield (i, map_value(i))).toList:_*)
m: scala.collection.immutable.Map[Int,Map[String,String]] = Map(1 -> Map(k2 -> v2, k5 -> v5, k6 -> v6, k7 -> v7, k1 -> v1, k4 -> v4, k3 -> v3), 2 -> Map(k2 -> v4, k5 -> v10, k6 -> v12, k7 -> v14, k1 -> v2, k4 -> v8, k3 -> v6))
scala> val df0 = m.toSeq.toDF("id", "map_value")
df0: org.apache.spark.sql.DataFrame = [id: int, map_value: map<string,string>]
scala> val column_names:List[String] = (for (n <- Range(1, N)) yield (s"map_value.k${n}")).toList
column_names: List[String] = List(id, map_value.k1, map_value.k2, map_value.k3, map_value.k4, map_value.k5, map_value.k6, map_value.k7)
scala> df0.select("id", column_names:_*).show()
+---+---+---+---+---+---+---+---+
| id| k1| k2| k3| k4| k5| k6| k7|
+---+---+---+---+---+---+---+---+
| 1| v1| v2| v3| v4| v5| v6| v7|
| 2| v2| v4| v6| v8|v10|v12|v14|
+---+---+---+---+---+---+---+---+

Related

Scala - How to convert Spark DataFrame to Map

How to conver Spark DataFrame to Map like below : I want to convert into Map and then Json. Pivot didn't worked to reshape the cplumn so
Any help will be appreciated to convert as a Map like below.
Input DataFrame :
+-----+-----+-------+--------------------+
|col1 |col2 |object | values |
+-------------------+--------------------+
|one | two | main |[101 -> A, 202 -> B]|
+-------------------+--------------------+
Expected Output DataFrame :
+-----+-----+-------+--------------------+------------------------------------------------------------------------+
|col1 |col2 |object | values | newMap |
+-----+-----+-------+--------------------+------------------------------------------------------------------------+
|one | two |main |[101 -> A, 202 -> B]|[col1 -> one, col2 -> two, object -> main, main -> [101 -> A, 202 -> B]]|
+-----+-----+-------+--------------------+------------------------------------------------------------------------+
tried like below, but no success :
val toMap = udf((col1: String, col2: String, object: String, values: Map[String, String])) => {
col1.zip(values).toMap // need help for logic
// col1 -> col1_value, col2 -> col2_values, object -> object_value, object_value -> [values_of_Col_Values].toMap
})
df.withColumn("newMap", toMap($"col1", $"col2", $"object", $"values"))
I am stuck to format the code properly and get the output, please help either in Scala or Spark.
It's quit straight forward. Apparently the precondition is, you must have all the columns with same type otherwise you will get spark error.
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(("Foo", "L", "10"), ("Boo", "XL", "20"))
.toDF("brand", "size", "sales")
//Prepare your map columns.Bit of nasty iteration work is required
var preCol: Column = null
var counter = 1
val size = df.schema.fields.length
val mapColumns = df.schema.flatMap { field =>
val res = if (counter == size)
Seq(preCol, col(field.name))
else
Seq(lit(field.name), col(field.name))
//assign the current field name for tracking and increment the counter by 1
preCol = col(field.name)
counter += 1
res
}
df.withColumn("new", map(mapColumns: _*)).show(false)
Result
+-----+----+-----+---------------------------------------+
|brand|size|sales|new |
+-----+----+-----+---------------------------------------+
|Foo |L |10 |Map(brand -> Foo, size -> L, L -> 10) |
|Boo |XL |20 |Map(brand -> Boo, size -> XL, XL -> 20)|
+-----+----+-----+---------------------------------------+

Select a literal based on a column value in Spark

I have a map:
val map = Map("A" -> 1, "B" -> 2)
And I have a DataFrame. a column in the data frame contains the keys in the map. I am trying to select a column in a new DF that has the map values in it based on the key:
val newDF = DfThatContainsTheKeyColumn.select(concat(col(SomeColumn), lit("|"),
lit(map.get(col(ColumnWithKey).toString()).get) as newColumn)
But this is resulting in the following error:
java.lang.RuntimeException: Unsupported literal type class scala.None$ None
I made sure that the column ColumnWithKey has As and Bs only and does not have empty values in it.
Is there another way to get the result I am looking for? Any help would be appreciated.
The Problem in this statement (besides syntax problems)
val newDF = DfThatContainsTheKeyColumn.select(concat(col(SomeColumn), lit("|"),
lit(map.get(col(ColumnWithKey).toString()).get) as newColumn)
is that col(ColumnWithKey) will not take the value of a specific row, but is only given by the schema, i.e. has a constant value.
In your case I would suggest to join your map to your dataframe :
val map = Map("A" -> 1, "B" -> 2)
val df_map = map.toSeq.toDF("key","value")
val DfThatContainsTheKeyColumn = Seq(
"A",
"A",
"B",
"B"
).toDF("myCol")
DfThatContainsTheKeyColumn
.join(broadcast(df_map),$"mycol"===$"key")
.select(concat($"mycol",lit("|"),$"value").as("newColumn"))
.show()
gives
|newColumn|
+---------+
| A|1|
| A|1|
| B|2|
| B|2|
+---------+
You can use case classes to make it easy. This is an example:
Given this input
val givenMap = Map("A" -> 1, "B" -> 2)
import spark.implicits._
val df = Seq(
(1, "A"),
(2, "A"),
(3, "B"),
(4, "B")
).toDF("col_a", "col_b")
df.show()
Above code looks like:
+-----+-----+
|col_a|col_b|
+-----+-----+
| 1| A|
| 2| A|
| 3| B|
| 4| B|
+-----+-----+
givenMap: scala.collection.immutable.Map[String,Int] = Map(A -> 1, B -> 2)
import spark.implicits._
df: org.apache.spark.sql.DataFrame = [col_a: int, col_b: string]
The code that you need will look like:
case class MyInput(col_a: Int, col_b: String)
case class MyOutput(col_a: Int, col_b: String, new_column: Int)
df.as[MyInput].map(row=> MyOutput(row.col_a, row.col_b, givenMap(row.col_b))).show()
With the case classes you can cast your df and use object notation to access to your column values within a .map. Above code will output:
+-----+-----+----------+
|col_a|col_b|new_column|
+-----+-----+----------+
| 1| A| 1|
| 2| A| 1|
| 3| B| 2|
| 4| B| 2|
+-----+-----+----------+
defined class MyInput
defined class MyOutput
You can lookup a map using key from a column as,
val map = Map("A" -> 1, "B" -> 2)
val df = spark.createDataset(Seq("dummy"))
.withColumn("key",lit("A"))
df.map{ row =>
val k = row.getAs[String]("key")
val v = map.getOrElse(k,0)
(k,v)
}.toDF("key", "value").show(false)
Result -
+---+-----+
|key|value|
+---+-----+
|A |1 |
+---+-----+
You can look up a map present inside a column using a literal key using Column.getItem, please see an example below.
val mapKeys = Array("A","B")
val mapValues = Array(1,2)
val df = spark.createDataset(Seq("dummy"))
.withColumn("key",lit("A"))
.withColumn("keys",lit(mapKeys))
.withColumn("values",lit(mapValues))
.withColumn("map",map_from_arrays($"keys",$"values"))
.withColumn("lookUpTheMap",$"map".getItem("A"))
//A dataframe with Map is created.
//A map is looked up using a hard coded String key.
df.show(false)
Result
+-----+---+------+------+----------------+------------+
|value|key|keys |values|map |lookUpTheMap|
+-----+---+------+------+----------------+------------+
|dummy|A |[A, B]|[1, 2]|[A -> 1, B -> 2]|1 |
+-----+---+------+------+----------------+------------+
To look up a map present inside a column based on another column containing the key - you can use an UDF or use map function on the dataframe the way I am showing below.
//A map is looked up using a Column key.
df.map{ row =>
val m = row.getAs[Map[String,Int]]("map")
val k = row.getAs[String]("key")
val v = m.getOrElse(k,0)
(m,k,v)
}.toDF("map","key", "value").show(false)
Result
+----------------+---+-----+
|map |key|value|
+----------------+---+-----+
|[A -> 1, B -> 2]|A |1 |
+----------------+---+-----+
I think a simpler option could be to use typedLit:
val map = typedLit(Map("A" -> 1, "B" -> 2))
val newDF = DfThatContainsTheKeyColumn.select(concat(col(SomeColumn), lit("|"),
map(col(ColumnWithKey))) as newColumn)

In spark iterate through each column and find the max length

I am new to spark scala and I have following situation as below
I have a table "TEST_TABLE" on cluster(can be hive table)
I am converting that to dataframe
as:
scala> val testDF = spark.sql("select * from TEST_TABLE limit 10")
Now the DF can be viewed as
scala> testDF.show()
COL1|COL2|COL3
----------------
abc|abcd|abcdef
a|BCBDFG|qddfde
MN|1234B678|sd
I want an output like below
COLUMN_NAME|MAX_LENGTH
COL1|3
COL2|8
COL3|6
Is this feasible to do so in spark scala?
Plain and simple:
import org.apache.spark.sql.functions._
val df = spark.table("TEST_TABLE")
df.select(df.columns.map(c => max(length(col(c)))): _*)
You can try in the following way:
import org.apache.spark.sql.functions.{length, max}
import spark.implicits._
val df = Seq(("abc","abcd","abcdef"),
("a","BCBDFG","qddfde"),
("MN","1234B678","sd"),
(null,"","sd")).toDF("COL1","COL2","COL3")
df.cache()
val output = df.columns.map(c => (c, df.agg(max(length(df(s"$c")))).as[Int].first())).toSeq.toDF("COLUMN_NAME", "MAX_LENGTH")
+-----------+----------+
|COLUMN_NAME|MAX_LENGTH|
+-----------+----------+
| COL1| 3|
| COL2| 8|
| COL3| 6|
+-----------+----------+
I think it's good idea to cache input dataframe df to make the computation faster.
Here is one more way to get the report of column names in vertical
scala> val df = Seq(("abc","abcd","abcdef"),("a","BCBDFG","qddfde"),("MN","1234B678","sd")).toDF("COL1","COL2","COL3")
df: org.apache.spark.sql.DataFrame = [COL1: string, COL2: string ... 1 more field]
scala> df.show(false)
+----+--------+------+
|COL1|COL2 |COL3 |
+----+--------+------+
|abc |abcd |abcdef|
|a |BCBDFG |qddfde|
|MN |1234B678|sd |
+----+--------+------+
scala> val columns = df.columns
columns: Array[String] = Array(COL1, COL2, COL3)
scala> val df2 = columns.foldLeft(df) { (acc,x) => acc.withColumn(x,length(col(x))) }
df2: org.apache.spark.sql.DataFrame = [COL1: int, COL2: int ... 1 more field]
scala> df2.select( columns.map(x => max(col(x))):_* ).show(false)
+---------+---------+---------+
|max(COL1)|max(COL2)|max(COL3)|
+---------+---------+---------+
|3 |8 |6 |
+---------+---------+---------+
scala> df3.flatMap( r => { (0 until r.length).map( i => (columns(i),r.getInt(i)) ) } ).show(false)
+----+---+
|_1 |_2 |
+----+---+
|COL1|3 |
|COL2|8 |
|COL3|6 |
+----+---+
scala>
To get the results into Scala collections, say Map()
scala> val result = df3.flatMap( r => { (0 until r.length).map( i => (columns(i),r.getInt(i)) ) } ).as[(String,Int)].collect.toMap
result: scala.collection.immutable.Map[String,Int] = Map(COL1 -> 3, COL2 -> 8, COL3 -> 6)
scala> result
res47: scala.collection.immutable.Map[String,Int] = Map(COL1 -> 3, COL2 -> 8, COL3 -> 6)
scala>

Spark 'join' DataFrame with List and return String

I have the following DataFrame:
DF1:
+------+---------+
|key1 |Value |
+------+---------+
|[k, l]| 1 |
|[m, n]| 2 |
|[o] | 3 |
+------+---------+
that needs to be 'joined' with another dataframe
DF2:
+----+
|key2|
+----+
|k |
|l |
|m |
|n |
|o |
+----+
so that the output looks like this:
DF3:
+--------------------+---------+
|key3 |Value |
+--------------------+---------+
|k:1 l:1 m:0 n:0 o:0 | 1 |
|k:0 l:0 m:1 n:1 o:0 | 2 |
|k:0 l:0 m:0 n:0 o:1 | 3 |
+--------------------+---------+
In other words, the output dataframe should have a column that is a string of all rows in DF2, and each element should be followed by a 1 or 0 indicating whether that element was present in the list in the column key1 of DF1.
I am not sure how to go about it. Is there a simple UDF I can write to accomplish what I want?
For operation like this to be possible DF2 so you can just use udf:
import spark.implicits._
import org.apache.spark.sql.functions._
val df1 = Seq(
(Seq("k", "l"), 1), (Seq("m", "n"), 2), (Seq("o"), 3)
).toDF("key1", "value")
val df2 = Seq("k", "l", "m", "n", "o").toDF("key2")
val keys = df2.as[String].collect.map((_, 0)).toMap
val toKeyMap = udf((xs: Seq[String]) =>
xs.foldLeft(keys)((acc, x) => acc + (x -> 1)))
df1.select(toKeyMap($"key1").alias("key3"), $"value").show(false)
// +-------------------------------------------+-----+
// |key3 |value|
// +-------------------------------------------+-----+
// |Map(n -> 0, m -> 0, l -> 1, k -> 1, o -> 0)|1 |
// |Map(n -> 1, m -> 1, l -> 0, k -> 0, o -> 0)|2 |
// |Map(n -> 0, m -> 0, l -> 0, k -> 0, o -> 1)|3 |
// +-------------------------------------------+-----+
If you want just a string:
val toKeyMapString = udf((xs: Seq[String]) =>
xs.foldLeft(keys)((acc, x) => acc + (x -> 1))
.map { case (k, v) => s"$k: $v" }
.mkString(" ")
)
df1.select(toKeyMapString($"key1").alias("key3"), $"value").show(false)
// +------------------------+-----+
// |key3 |value|
// +------------------------+-----+
// |n: 0 m: 0 l: 1 k: 1 o: 0|1 |
// |n: 1 m: 1 l: 0 k: 0 o: 0|2 |
// |n: 0 m: 0 l: 0 k: 0 o: 1|3 |
// +------------------------+-----+

How to convert map to dataframe?

m is a map as following:
scala> m
res119: scala.collection.mutable.Map[Any,Any] = Map(A-> 0.11164610291904906, B-> 0.11856755943424617, C -> 0.1023171832681312)
I want to get:
name score
A 0.11164610291904906
B 0.11856755943424617
C 0.1023171832681312
How to get the final dataframe?
First covert it to a Seq, then you can use the toDF() function.
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val m = Map("A"-> 0.11164610291904906, "B"-> 0.11856755943424617, "C" -> 0.1023171832681312)
val df = m.toSeq.toDF("name", "score")
df.show
Will give you:
+----+-------------------+
|name| score|
+----+-------------------+
| A|0.11164610291904906|
| B|0.11856755943424617|
| C| 0.1023171832681312|
+----+-------------------+