scala enclose a map to a struct? - scala

I have a schema and called a udf on this column called referencesTypes
|-- referenceTypes: struct (nullable = true)
| |-- map: map (nullable = true)
| | |-- key: string
| | |-- value: long (valueContainsNull = true)
The udf
val mapfilter = udf[Map[String,Long],Map[String,Long]](map => {
map.keySet.exists(_ != "Family")
val newMap = map.updated("Family",1L)
newMap
})
Now after the udf is used my schema goes to this
|-- referenceTypes: map (nullable = true)
| |-- key: string
| |-- value: long (valueContainsNull = false)
What do i do to get back referenceType as Struct and map as subroot. In other words how do i convert it back to the orginal on the top with Struct and map one level below.. Bottom has to look like top again, but dont know what changes to make to the udf.
Tried toArray(thought it can be struct) and tomap as well?
basically need to bring back []
actual : Map(Family -> 1)
EXPECTED : [Map(Family -> 1)]

You have to add struct:
import org.apache.spark.sql.functions.struct
df.withColumn(
"referenceTypes",
struct(mapFilter($"referenceTypes.map").alias("map")))

Related

Select column based on Map

I have the following dataframe df with the following schema:
|-- type: string (nullable = true)
|-- record_sales: array (nullable = false)
| |-- element: string (containsNull = false)
|-- record_marketing: array (nullable = false)
| |-- element: string (containsNull = false)
and a map
typemap = Map("sales" -> "record_sales", "marketing" -> "record_marketing")
I want a new column "record" that is either the value of record_sales or record_marketing based on the value of type.
I've tried some variants of this:
val typeMapCol = typedLit(typemap)
val df2 = df.withColumn("record", col(typeMapCol(col("type"))))
But nothing has worked. Does anyone have any idea? Thanks!
You can iterate over the map typemap and use when function to get case/when expressions depending on the value of type column:
val recordCol = typemap.map{case (k,v) => when(col("type") === k, col(v))}.toSeq
val df2 = df.withColumn("record", coalesce(recordCol: _*))

How to 'flatten' a Spark schema with a variable number of columns?

This is the schema of a Spark DataFrame that I've created:
root
|-- id: double (nullable = true)
|-- sim_scores: struct (nullable = true)
| |-- scores: map (nullable = true)
| | |-- key: string
| | |-- value: map (valueContainsNull = true)
| | | |-- key: integer
| | | |-- value: vector (valueContainsNull = true)
The 'sim_scores' struct represents a Scala case-class that I am using for aggregation purposes. I have custom a UDAF designed to merge these structs. To make them merge-safe for all the edge-cases, they look like they do. Lets assume for this question, they have to stay this way.
I would like to 'flatten' this DataFrame into something like:
root
|-- id: double (nullable = true)
|-- score_1: map (valueContainsNull = true)
| |-- key: integer
| |-- value: vector (valueContainsNull = true)
|-- score_2: map (valueContainsNull = true)
| |-- key: integer
| |-- value: vector (valueContainsNull = true)
|-- score_3: map (valueContainsNull = true)
| |-- key: integer
| |-- value: vector (valueContainsNull = true)
...
The outer MapType in the 'scores' struct maps score topics to documents; the inner maps, representing a document, map sentence position within a document to a vector score. The 'score_1', 'score_2', ... represent all possible keys of the 'scores' MapType in the initial DF.
In json-ish terms, if I had an input that looks like:
{ "id": 739874.0,
"sim_scores": {
"firstTopicName": {
1: [1,9,1,0,1,1,4,6],
2: [5,7,8,2,4,3,1,3],
...
},
"anotherTopic": {
1: [6,8,4,1,3,4,2,0],
2: [0,1,3,2,4,5,6,2],
...
}
}
}
then I would get an output
{ "id": 739874.0,
"firstTopicName": {
1: [1,9,1,0,1,1,4,6],
2: [5,7,8,2,4,3,1,3],
...
}
"anotherTopic": {
1: [6,8,4,1,3,4,2,0],
2: [0,1,3,2,4,5,6,2],
...
}
}
If I knew the total number of topic columns, this would be easy; but I do not. The number of topics is set by the user at runtime, the output DataFrame has a variable number of columns. It is guarantees to be >=1, but I need to design this so that it could work with 100 different topic columns, if necessary.
How can I implement this?
Last note: I'm stuck using Spark 1.6.3; so solutions that work with that version are best. However, I'll take any way of doing it in hopes of future implementation.
At a high level, I think you have two options here:
Using the dataframe API
Switch to an RDD
If you want to keep using spark SQL, then you could use selectExpr and generate the select query:
it("should flatten using dataframes and spark sql") {
val sqlContext = new SQLContext(sc)
val df = sqlContext.createDataFrame(sc.parallelize(rows), schema)
df.printSchema()
df.show()
val numTopics = 3 // input from user
// fancy logic to generate the select expression
val selectColumns: Seq[String] = "id" +: 1.to(numTopics).map(i => s"sim_scores['scores']['topic${i}']")
val df2 = df.selectExpr(selectColumns:_*)
df2.printSchema()
df2.show()
}
Given this sample data:
val schema = sql.types.StructType(List(
sql.types.StructField("id", sql.types.DoubleType, nullable = true),
sql.types.StructField("sim_scores", sql.types.StructType(List(
sql.types.StructField("scores", sql.types.MapType(sql.types.StringType, sql.types.MapType(sql.types.IntegerType, sql.types.StringType)), nullable = true)
)), nullable = true)
))
val rows = Seq(
sql.Row(1d, sql.Row(Map("topic1" -> Map(1 -> "scores1"), "topic2" -> Map(1 -> "scores2")))),
sql.Row(2d, sql.Row(Map("topic1" -> Map(1 -> "scores1"), "topic2" -> Map(1 -> "scores2")))),
sql.Row(3d, sql.Row(Map("topic1" -> Map(1 -> "scores1"), "topic2" -> Map(1 -> "scores2"), "topic3" -> Map(1 -> "scores3"))))
)
You get this result:
root
|-- id: double (nullable = true)
|-- sim_scores.scores[topic1]: map (nullable = true)
| |-- key: integer
| |-- value: string (valueContainsNull = true)
|-- sim_scores.scores[topic2]: map (nullable = true)
| |-- key: integer
| |-- value: string (valueContainsNull = true)
|-- sim_scores.scores[topic3]: map (nullable = true)
| |-- key: integer
| |-- value: string (valueContainsNull = true)
+---+-------------------------+-------------------------+-------------------------+
| id|sim_scores.scores[topic1]|sim_scores.scores[topic2]|sim_scores.scores[topic3]|
+---+-------------------------+-------------------------+-------------------------+
|1.0| Map(1 -> scores1)| Map(1 -> scores2)| null|
|2.0| Map(1 -> scores1)| Map(1 -> scores2)| null|
|3.0| Map(1 -> scores1)| Map(1 -> scores2)| Map(1 -> scores3)|
+---+-------------------------+-------------------------+-------------------------+
The other option is to switch to processing an RDD where you could add more powerful flattening logic based on the keys in the map.

Spark - How to parse a JSON-escaped String field as a JSON Object in DataFrames?

I have as input a set of files formatted as a single JSON object per line. The problem, however, is that one field on these JSON objects is a JSON-escaped String. Example
{"clientAttributes":{"backfillId":null,"clientPrimaryKey":"abc"},"escapedJsonPayload":"{\"name\":\"Akash\",\"surname\":\"Patel\",\"items\":[{\"itemId\":\"abc\",\"itemName\":\"xyz\"}"}
As I create a data frame by reading json file, it is creating data frame as below
val df = spark.sqlContext.read.json("file:///home/akaspate/sample.json")
df: org.apache.spark.sql.DataFrame = [clientAttributes: struct<backfillId: string, clientPrimaryKey: string>, escapedJsonPayload: string]
As we can see "escapedJsonPayload" is String and I need it to be Struct.
Note: I got similar question in StackOverflow and followed it (How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?) but it is giving me "[_corrupt_record: string]"
I have tried below steps
val df = spark.sqlContext.read.json("file:///home/akaspate/sample.json") (Work file)
val escapedJsons: RDD[String] = sc.parallelize(Seq("""df"""))
val unescapedJsons: RDD[String] = escapedJsons.map(_.replace("\"{", "{").replace("\"}", "}").replace("\\\"", "\""))
val dfJsons: DataFrame = spark.sqlContext.read.json(unescapedJsons) (This results in [_corrupt_record: string])
Any help would be appreciated
First of all the JSON you have provided is of wrong format (syntactically). The corrected JSON is as follows:
{"clientAttributes":{"backfillId":null,"clientPrimaryKey":"abc"},"escapedJsonPayload":{\"name\":\"Akash\",\"surname\":\"Patel\",\"items\":[{\"itemId\":\"abc\",\"itemName\":\"xyz\"}]}}
Next, to parse the JSON correctly from the above JSON, you have to use following code:
val rdd = spark.read.textFile("file:///home/akaspate/sample.json").toJSON.map(value => value.replace("\\", "").replace("{\"value\":\"", "").replace("}\"}", "}")).rdd
val df = spark.read.json(rdd)
Above code will give you following output:
df.show(false)
+----------------+-------------------------------------+
|clientAttributes|escapedJsonPayload |
+----------------+-------------------------------------+
|[null,abc] |[WrappedArray([abc,xyz]),Akash,Patel]|
+----------------+-------------------------------------+
With following schema:
df.printSchema
root
|-- clientAttributes: struct (nullable = true)
| |-- backfillId: string (nullable = true)
| |-- clientPrimaryKey: string (nullable = true)
|-- escapedJsonPayload: struct (nullable = true)
| |-- items: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- itemId: string (nullable = true)
| | | |-- itemName: string (nullable = true)
| |-- name: string (nullable = true)
| |-- surname: string (nullable = true)
I hope this helps !

How to access elemens in Row RDD in SCALA

My row RDD looks like this:
Array[org.apache.spark.sql.Row] = Array([1,[example1,WrappedArray([**Standford,Organisation,NNP], [is,O,VP], [good,LOCATION,ADP**])]])
I have got this from converting dataframe to rdd, dataframe schema was :
root
|-- article_id: long (nullable = true)
|-- sentence: struct (nullable = true)
| |-- sentence: string (nullable = true)
| |-- attributes: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- tokens: string (nullable = true)
| | | |-- ner: string (nullable = true)
| | | |-- pos: string (nullable = true)
Now how do access elements in row rdd, in dataframe I can use df.select("sentence"). I am looking forward to access elements like stanford/other nested elements.
As #SarveshKumarSingh wrote in a comment you can access a the rows in a RDD[Row] like you would access any other element in an RDD. Accessing the elements in the row can be done in a couple of ways. Either simply call get like this:
rowRDD.map(row => row.get(2).asInstanceOf[MyType])
or if it is a build in type, you can avoid the type cast:
rowRDD.map(row => row.getList(4))
or you might want to simply use pattern matching, like:
rowRDD.map{case Row(field1: Long, field2: MyType) => field2}
I hope this helps :)

How to modify a Spark Dataframe with a complex nested structure?

I've a complex DataFrame structure and would like to null a column easily. I've created implicit classes that wire functionality and easily address 2D DataFrame structures but once the DataFrame becomes more complicated with ArrayType or MapType I've not had much luck. For example:
I have schema defined as:
StructType(
StructField(name,StringType,true),
StructField(data,ArrayType(
StructType(
StructField(name,StringType,true),
StructField(values,
MapType(StringType,StringType,true),
true)
),
true
),
true)
)
I'd like to produce a new DF that has the field data.value of MapType set to null, but as this is an element of an array I have not been able to figure out how. I would think it would be similar to:
df.withColumn("data.values", functions.array(functions.lit(null)))
but this ultimately creates a new column of data.values and does not modify the values element of the data array.
Since Spark 1.6, you can use case classes to map your dataframes (called datasets). Then, you can map your data and transform it to the new schema you want. For example:
case class Root(name: String, data: Seq[Data])
case class Data(name: String, values: Map[String, String])
case class NullableRoot(name: String, data: Seq[NullableData])
case class NullableData(name: String, value: Map[String, String], values: Map[String, String])
val nullableDF = df.as[Root].map { root =>
val nullableData = root.data.map(data => NullableData(data.name, null, data.values))
NullableRoot(root.name, nullableData)
}.toDF()
The resulting schema of nullableDF will be:
root
|-- name: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- value: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- values: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
I ran into the same issue and assuming you don't need the result to have any new fields or fields with different types, here is a solution that can do this without having to redefine the whole struct: Change value of nested column in DataFrame