Converting a dataframe into a hashmap where Key is int and Value is a list in Scala - scala

I have a dataframe which looks like this:
key
words
1
['a','test']
2
['hi', 'there]
And I would like to create the following hashmap:
Map(1 -> ['a', 'test'], 2 -> ['hi', 'there'])
But I cannot figure out how to do this, can anyone help me?
Thanks!

There must be dozens of ways of doing this. One would be:
df.collect().map { case row => (row.getAs[Int](0) -> row.getAs[mutable.WrappedArray[String]](1))}.toMap

This is very similar to the solution in this question. The following should give you the output you want. It gathers all the maps as a collection, and then uses the UDF to create a single map. This comes with the usual caveats regarding the potential poor performance of UDF functions.
import org.apache.spark.sql.functions.{col, map, collect_list, lit}
import org.apache.spark.sql.functions.udf
val joinMap = udf { values: Seq[Map[Int, Seq[String]]] =>
values.flatten.toMap
}
val df = Seq((1, Seq("a", "test")), (2, Seq("hi", "there"))).toDF("key", "words")
val rDf = df
.select(lit(1) as "id", map(col("key"), col("words")) as "kwMap")
.groupBy("id")
.agg(collect_list(col("kwMap")) as "kwMaps")
.select(joinMap(col("kwMaps")) as "map")
rDf.show

Related

How to extract data from MapType Scala Spark Column as Scala Map?

Well, the question is pretty much that. Let me provide sample:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, Column, Dataset}
val data = List(
Row("miley",
Map("good_songs" -> "wrecking ball",
"bad_songs" -> "younger now"
)
),
Row("kesha",
Map(
"good_songs" -> "tik tok",
"bad_songs" -> "rainbow"
)
)
)
val schema = List(
StructField("singer", StringType, true),
StructField("songs", MapType(StringType, StringType, true))
)
val someDF = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
// This returns scala.collection.Map[Nothing,Nothing]
someDF.select($"songs").head().getMap(0)
// Therefore, this won't work:
val myHappyMap : Map[String, String] = someDF.select($"songs").head().getMap(0)
I don't understand why I'm getting a Map[Nothing, Nothing] if I properly described my desired schema for the MapType column - not only that: when I do someDF.schema, what I get is
org.apache.spark.sql.types.StructType = StructType(StructField(singer,StringType,true), StructField(songs,MapType(StringType,StringType,true),true)), showing that the DataFrame schema is properly set.
I've read extract or filter MapType of Spark DataFrame
, and also How to get keys and values from MapType column in SparkSQL DataFrame
. I thought the latter would solve my problem by at least being able to extract the keys and the values separately, but, still, I get the values as WrappedArray(Nothing), which means it just adds extra complication for no real gain.
What am I missing here?
.getMap is a typed method and it's incapable of infering the types on your map, so you have to actually tell it:
val myHappyMap: Map[String, String] = someDF.select($"songs").head().getMap[String, String](0).toMap
the toMap in the end is just to convert it from scala.collection.Map to scala.collection.immutable.Map (they are different stuff and when you declare the type usually you are refering to the second one) (edited)

unable to store row elements of a dataset, via mapPartitions(), in variables

I am trying to create a Spark Dataset, and then using mapPartitions, trying to access each of its elements and store those in variables. Using below piece of code for the same:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val df = spark.sql("select col1,col2,col3 from table limit 10")
val schema = StructType(Seq(
StructField("col1", StringType),
StructField("col2", StringType),
StructField("col3", StringType)))
val encoder = RowEncoder(schema)
df.mapPartitions{iterator => { val myList = iterator.toList
myList.map(x=> { val value1 = x.getString(0)
val value2 = x.getString(1)
val value3 = x.getString(2)}).iterator}} (encoder)
The error I am getting against this code is:
<console>:39: error: type mismatch;
found : org.apache.spark.sql.catalyst.encoders.ExpressionEncoder[org.apache.spark.sql.Row]
required: org.apache.spark.sql.Encoder[Unit]
val value3 = x.getString(2)}).iterator}} (encoder)
Eventually, I am targeting to store the row elements in variables, and perform some operation with these. Not sure what am I missing here. Any help towards this would be highly appreciated!
Actually, there are several problems with your code:
Your map-statement has no return value, therefore Unit
If you return a tuple of String from mapPartitions, you don't need a RowEncoder (because you don't return a Row, but a Tuple3 which does not need a encoder because its a Product)
You can write your code like this:
df
.mapPartitions{itr => itr.map(x=> (x.getString(0),x.getString(1),x.getString(2)))}
.toDF("col1","col2","col3") // Convert Dataset to Dataframe, get desired field names
But you could just use a simple select statement in DataFrame API, no need for mapPartitions here
df
.select($"col1",$"col2",$"col3")

aggregating jsonarray into Map<key, list> in spark in spark2.x

I am quite new to Spark. I have a input json file which I am reading as
val df = spark.read.json("/Users/user/Desktop/resource.json");
Contents of resource.json looks like this:
{"path":"path1","key":"key1","region":"region1"}
{"path":"path112","key":"key1","region":"region1"}
{"path":"path22","key":"key2","region":"region1"}
Is there any way we can process this dataframe and aggregate result as
Map<key, List<data>>
where data is each json object in which key is present.
For ex: expected result is
Map<key1 =[{"path":"path1","key":"key1","region":"region1"}, {"path":"path112","key":"key1","region":"region1"}] ,
key2 = [{"path":"path22","key":"key2","region":"region1"}]>
Any reference/documents/link to proceed further would be a great help.
Thank you.
Here is what you can do:
import org.json4s._
import org.json4s.jackson.Serialization.read
case class cC(path: String, key: String, region: String)
val df = spark.read.json("/Users/user/Desktop/resource.json");
scala> df.show
+----+-------+-------+
| key| path| region|
+----+-------+-------+
|key1| path1|region1|
|key1|path112|region1|
|key2| path22|region1|
+----+-------+-------+
//Please note that original json structure is gone. Use .toJSON to get json back and extract key from json and create RDD[(String, String)] RDD[(key, json)]
val rdd = df.toJSON.rdd.map(m => {
implicit val formats = DefaultFormats
val parsedObj = read[cC](m)
(parsedObj.key, m)
})
scala> rdd.collect.groupBy(_._1).map(m => (m._1,m._2.map(_._2).toList))
res39: scala.collection.immutable.Map[String,List[String]] = Map(key2 -> List({"key":"key2","path":"path22","region":"region1"}), key1 -> List({"key":"key1","path":"path1","region":"region1"}, {"key":"key1","path":"path112","region":"region1"}))
You can use groupBy with collect_list, which is an aggregation function that collects all matching values into a list per key.
Notice that the original JSON strings are already "gone" (Spark parses them into individual columns), so if you really want a list of all records (with all their columns, including the key), you can use the struct function to combine columns into one column:
import org.apache.spark.sql.functions._
import spark.implicits._
df.groupBy($"key")
.agg(collect_list(struct($"path", $"key", $"region")) as "value")
The result would be:
+----+--------------------------------------------------+
|key |value |
+----+--------------------------------------------------+
|key1|[[path1, key1, region1], [path112, key1, region1]]|
|key2|[[path22, key2, region1]] |
+----+--------------------------------------------------+

Convert HadoopRDD to DataFrame

In EMR Spark, I have a HadoopRDD
org.apache.spark.rdd.RDD[(org.apache.hadoop.io.Text, org.apache.hadoop.dynamodb.DynamoDBItemWritable)] = HadoopRDD[0] at hadoopRDD
I want to convert this to DataFrame org.apache.spark.sql.DataFrame.
Does anyone know how to do this?
First convert it to simple types. Let's say your DynamoDBItemWritable has just one string column:
val simple: RDD[(String, String)] = rdd.map {
case (text, dbwritable) => (text.toString, dbwritable.getString(0))
}
Then you can use toDF to get a DataFrame:
import sqlContext.implicits._
val df: DataFrame = simple.toDF()

Spark-Scala RDD

I have a RDD RDD1 with the following Schema:
RDD[String, Array[String]]
(let's call it RDD1)
and I would like create a new RDD RDD2 with each row as RDD[String,String] with the key and value belonging to RDD1.
For example:
RDD1 =Array(("Fruit",("Orange","Apple","Peach")),("Shape",("Square","Rectangle")),("Mathematician",("Aryabhatt"))))
I want the output to be as:
RDD2 = Array(("Fruit","Orange"),("Fruit","Apple"),("Fruit","Peach"),("Shape","Square"),("Shape","Rectangle"),("Mathematician","Aryabhatt"))
Can someone help me with this piece of code?
My Try:
val R1 = RDD1.map(line => (line._1,line._2.split((","))))
val R2 = R1.map(line => line._2.foreach(ph => ph.map(line._1)))
This gives me an error:
error: value map is not a member of Char
I understand that it is because that map function is only applicable to the RDDs and not each string/char. Please help me with a way to use nested functions for this purpose in Spark.
Break down the problem.
("Fruit",Array("Orange","Apple","Peach") -> Array(("Fruit", "Orange"), ("Fruit", "Apple"), ("Fruit", "Peach"))
def flattenLine(line: (String, Array[String])) = line._2.map(x => (line._1, x)
Apply that function to your rdd:
rdd1.flatMap(flattenLine)