spark: how to merge rows to array of jsons - scala

Input:
id1 id2 name value epid
"xxx" "yyy" "EAN" "5057723043" "1299"
"xxx" "yyy" "MPN" "EVBD" "1299"
I want:
{ "id1": "xxx",
"id2": "yyy",
"item_specifics": [
{
"name": "EAN",
"value": "5057723043"
},
{
"name": "MPN",
"value": "EVBD"
},
{
"name": "EPID",
"value": "1299"
}
]
}
I tried the following two solutions from How to aggregate columns into json array? and how to merge rows into column of spark dataframe as vaild json to write it in mysql:
pi_df.groupBy(col("id1"), col("id2"))
//.agg(collect_list(to_json(struct(col("name"), col("value"))).alias("item_specifics"))) // => not working
.agg(collect_list(struct(col("name"),col("value"))).alias("item_specifics"))
But I got:
{ "name":"EAN","value":"5057723043", "EPID": "1299", "id1": "xxx", "id2": "yyy" }
How to fix this? Thanks

For Spark < 2.4
You can create 2 dataframes, one with name and value and other with epic as name and epic value as value and union them together. Then aggregate them as collect_set and create a json. The code should look like this.
//Creating Test Data
val df = Seq(("xxx","yyy" ,"EAN" ,"5057723043","1299"), ("xxx","yyy" ,"MPN" ,"EVBD", "1299") )
.toDF("id1", "id2", "name", "value", "epid")
df.show(false)
+---+---+----+----------+----+
|id1|id2|name|value |epid|
+---+---+----+----------+----+
|xxx|yyy|EAN |5057723043|1299|
|xxx|yyy|MPN |EVBD |1299|
+---+---+----+----------+----+
val df1 = df.withColumn("map", struct(col("name"), col("value")))
.select("id1", "id2", "map")
val df2 = df.withColumn("map", struct(lit("EPID").as("name"), col("epid").as("value")))
.select("id1", "id2", "map")
val jsonDF = df1.union(df2).groupBy("id1", "id2")
.agg(collect_set("map").as("item_specifics"))
.withColumn("json", to_json(struct("id1", "id2", "item_specifics")))
jsonDF.select("json").show(false)
+---------------------------------------------------------------------------------------------------------------------------------------------+
|json |
+---------------------------------------------------------------------------------------------------------------------------------------------+
|{"id1":"xxx","id2":"yyy","item_specifics":[{"name":"MPN","value":"EVBD"},{"name":"EAN","value":"5057723043"},{"name":"EPID","value":"1299"}]}|
+---------------------------------------------------------------------------------------------------------------------------------------------+
For Spark = 2.4
It provides a array_union method. It might be helpful in doing it without union. I haven't tried it though.
val jsonDF = df.withColumn("map1", struct(col("name"), col("value")))
.withColumn("map2", struct(lit("epid").as("name"), col("epid").as("value")))
.groupBy("id1", "id2")
.agg(collect_set("map1").as("item_specifics1"),
collect_set("map2").as("item_specifics2"))
.withColumn("item_specifics", array_union(col("item_specifics1"), col("item_specifics2")))
.withColumn("json", to_json(struct("id1", "id2", "item_specifics2")))

You're pretty close. I believe you're looking for something like this:
val pi_df2 = pi_df.withColumn("name", lit("EPID")).
withColumnRenamed("epid", "value").
select("id1", "id2", "name","value")
pi_df.select("id1", "id2", "name","value").
union(pi_df2).withColumn("item_specific", struct(col("name"), col("value"))).
groupBy(col("id1"), col("id2")).
agg(collect_list(col("item_specific")).alias("item_specifics")).
write.json(...)
The union should bring back epid into item_specifics

Here is what you need to do
import scala.util.parsing.json.JSONObject
import scala.collection.mutable.WrappedArray
//Define udf
val jsonFun = udf((id1 : String, id2 : String, item_specifics: WrappedArray[Map[String, String]], epid: String)=> {
//Add epid to item_specifics json
val item_withEPID = item_specifics :+ Map("epid" -> epid)
val item_specificsArray = item_withEPID.map(m => ( Array(Map("name" -> m.keys.toSeq(0), "value" -> m.values.toSeq(0))))).map(m => m.map( mi => JSONObject(mi).toString().replace("\\",""))).flatten.mkString("[",",","]")
//Add id1 and id2 to output json
val m = Map("id1"-> id1, "id2"-> id2, "item_specifics" -> item_specificsArray.toSeq )
JSONObject(m).toString().replace("\\","")
})
val pi_df = Seq( ("xxx","yyy","EAN","5057723043","1299"), ("xxx","yyy","MPN","EVBD","1299")).toDF("id1","id2","name","value","epid")
//Add epid as part of group by column else the column will not be available after group by and aggregation
val df = pi_df.groupBy(col("id1"), col("id2"), col("epid")).agg(collect_list(map(col("name"), col("value")) as "map").as("item_specifics")).withColumn("item_specifics",jsonFun($"id1",$"id2",$"item_specifics",$"epid"))
df.show(false)
scala> df.show(false)
+---+---+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id1|id2|epid|item_specifics |
+---+---+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|xxx|yyy|1299|{"id1" : "xxx", "id2" : "yyy", "item_specifics" : [{"name" : "MPN", "value" : "EVBD"},{"name" : "EAN", "value" : "5057723043"},{"name" : "epid", "value" : "1299"}]}|
+---+---+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Content of item_specifics column/ output
{
"id1": "xxx",
"id2": "yyy",
"item_specifics": [{
"name": "MPN",
"value": "EVBD"
}, {
"name": "EAN",
"value": "5057723043"
}, {
"name": "epid",
"value": "1299"
}]
}

Related

Assign SQL schema to Spark DataFrame

I'm converting my team's legacy Redshift SQL code to Spark SQL code. All the Spark examples I've seen define the schema in a non-SQL way using StructType and StructField and I'd prefer to define the schema in SQL, since most of my users know SQL but not Spark.
This is the ugly workaround I'm doing now. Is there a more elegant way that doesn't require defining an empty table just so that I can pull the SQL schema?
create_table_sql = '''
CREATE TABLE public.example (
id LONG,
example VARCHAR(80)
)'''
spark.sql(create_table_sql)
schema = spark.sql("DESCRIBE public.example").collect()
s3_data = spark.read.\
option("delimiter", "|")\
.csv(
path="s3a://"+s3_bucket_path,
schema=schema
)\
.saveAsTable('public.example')
Yes there is a way to create schema from string although I am not sure if it really looks like SQL! So you can use:
from pyspark.sql.types import _parse_datatype_string
_parse_datatype_string("id: long, example: string")
This will create the next schema:
StructType(List(StructField(id,LongType,true),StructField(example,StringType,true)))
Or you may have a complex schema as well:
schema = _parse_datatype_string("customers array<struct<id: long, name: string, address: string>>")
StructType(
List(StructField(
customers,ArrayType(
StructType(
List(
StructField(id,LongType,true),
StructField(name,StringType,true),
StructField(address,StringType,true)
)
),true),true)
)
)
You can check for more examples here
adding up to what has already been said, making a schema (e.g. StructType-based or JSON) is more straightforward in scala spark than in pySpark:
> import org.apache.spark.sql.types.StructType
> val s = StructType.fromDDL("customers array<struct<id: long, name: string, address: string>>")
> s
res3: org.apache.spark.sql.types.StructType = StructType(StructField(customers,ArrayType(StructType(StructField(id,LongType,true),StructField(name,StringType,true),StructField(address,StringType,true)),true),true))
> s.prettyJson
res9: String =
{
"type" : "struct",
"fields" : [ {
"name" : "customers",
"type" : {
"type" : "array",
"elementType" : {
"type" : "struct",
"fields" : [ {
"name" : "id",
"type" : "long",
"nullable" : true,
"metadata" : { }
}, {
"name" : "name",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "address",
"type" : "string",
"nullable" : true,
"metadata" : { }
} ]
},
"containsNull" : true
},
"nullable" : true,
"metadata" : { }
} ]
}

Recursive traverse JSON with circe-optics

I have a json with complex structure. Something like this:
{
"a":"aa",
"b":"bb",
"c":[
"aaa",
"bbb"
],
"d":{
"e":"ee",
"f":"ff"
}
}
And I want to uppercase all string values. The Documentation says:
root.each.string.modify(_.toUpperCase)
But only root values are updated, as expected.
How to make circe-optics traverse all string values recursively?
JSON structure is unknown in advance.
Here is the example on Scastie.
via comments:
I am expecting all string values uppercased, not only root values:
{
"a":"AA",
"b":"BB",
"c":[
"AAA",
"BBB"
],
"d":{
"e":"EE",
"f":"FF"
}
}
Here is a partial solution, as in, it is not fully recursive, but it will solve the issue with the json from your example:
val level1UpperCase = root.each.string.modify(s => s.toUpperCase)
val level2UpperCase = root.each.each.string.modify(s => s.toUpperCase)
val uppered = (level1UpperCase andThen level2UpperCase)(json.right.get)
The following might be a new way to do this. Adding it here for completeness.
import io.circe.Json
import io.circe.parser.parse
import io.circe.optics.JsonOptics._
import monocle.function.Plated
val json = parse(
"""
|{
| "a":"aa",
| "b":"bb",
| "c":[
| "aaa",
| {"k": "asdads"}
| ],
| "d":{
| "e":"ee",
| "f":"ff"
| }
|}
|""".stripMargin).right.get
val transformed = Plated.transform[Json] { j =>
j.asString match {
case Some(s) => Json.fromString(s.toUpperCase)
case None => j
}
}(json)
println(transformed.spaces2)
gives
{
"a" : "AA",
"b" : "BB",
"c" : [
"AAA",
{
"k" : "ASDADS"
}
],
"d" : {
"e" : "EE",
"f" : "FF"
}
}

Parse nested maps in a function (gatling)

I have a map like this:
{
"user":
{
"name": "Jon Doe",
"age": "6",
"birthdate": {
"timestamp": 1456424096
},
"gender": "M"
}
}
and a function like this
def setUser(user: Map[String, Any]): Map[String, Any]={
var usr = Map("name"-> user.get("name").getOrElse(""),
"gender" -> user.get("gender").getOrElse(""),
"age" -> user.get("age").getOrElse(""),
"birthday" -> patient.get("birthdate"))
return usr
}
And I want to have the value of "timestamp" (1456424096) mapped in the "birthday" field.
For now I have this : Some%28%7Btimestamp%3D1456424096%7D%29
I'm very new to this. Can someone help me get the value of "timestamp"?
Assuming that you want just to get rid of nested birthday (not nested user) it can look like that:
val oldData: Map[String, Any] = Map(
"user" -> Map(
"name" -> "John Doe",
"age" -> 6,
"birthday" -> Map("timestamp" -> 1234454666),
"gender" -> "M"
)
)
def flattenBirthday(userMap: Map[String, Any]) = Map(
"user" -> (userMap("user").asInstanceOf[Map[String, Any]] + (
"birthday" -> userMap("user").asInstanceOf[Map[String, Any]]("birthday").asInstanceOf[Map[String, Any]]("timestamp")
))
)
val newData = flattenBirthday(oldData)
But in general dealing with nested immutable maps will be ugly. If you extract that data from JSONs (like in your example) it is better to use some library to deserialize that data into case class objects.

How to convert column to vector type?

I have an RDD in Spark where the objects are based on a case class:
ExampleCaseClass(user: User, stuff: Stuff)
I want to use Spark's ML pipeline, so I convert this to a Spark data frame. As part of the pipeline, I want to transform one of the columns into a column whose entries are vectors. Since I want the length of that vector to vary with the model, it should be built into the pipeline as part of the feature transformation.
So I attempted to define a Transformer as follows:
class MyTransformer extends Transformer {
val uid = ""
val num: IntParam = new IntParam(this, "", "")
def setNum(value: Int): this.type = set(num, value)
setDefault(num -> 50)
def transform(df: DataFrame): DataFrame = {
...
}
def transformSchema(schema: StructType): StructType = {
val inputFields = schema.fields
StructType(inputFields :+ StructField("colName", ???, true))
}
def copy (extra: ParamMap): Transformer = defaultCopy(extra)
}
How do I specify the DataType of the resulting field (i.e. fill in the ???)? It will be a Vector of some simple class (Boolean, Int, Double, etc). It seems VectorUDT might have worked, but that's private to Spark. Since any RDD can be converted to a DataFrame, any case class can be converted to a custom DataType. However I can't figure out how to manually do this conversion, otherwise I could apply it to some simple case class wrapping the vector.
Furthermore, if I specify a vector type for the column, will VectorAssembler correctly process the vector into separate features when I go to fit the model?
Still new to Spark and especially to the ML Pipeline, so appreciate any advice.
import org.apache.spark.ml.linalg.SQLDataTypes.VectorType
def transformSchema(schema: StructType): StructType = {
val inputFields = schema.fields
StructType(inputFields :+ StructField("colName", VectorType, true))
}
In spark 2.1 VectorType makes VectorUDT publicly available:
package org.apache.spark.ml.linalg
import org.apache.spark.annotation.{DeveloperApi, Since}
import org.apache.spark.sql.types.DataType
/**
* :: DeveloperApi ::
* SQL data types for vectors and matrices.
*/
#Since("2.0.0")
#DeveloperApi
object SQLDataTypes {
/** Data type for [[Vector]]. */
val VectorType: DataType = new VectorUDT
/** Data type for [[Matrix]]. */
val MatrixType: DataType = new MatrixUDT
}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
case class MyVector(vector: Vector)
val vectorDF = Seq(
MyVector(Vectors.dense(1.0,3.4,4.4)),
MyVector(Vectors.dense(5.5,6.7))
).toDF
vectorDF.printSchema
root
|-- vector: vector (nullable = true)
println(vectorDF.schema.fields(0).dataType.prettyJson)
{
"type" : "udt",
"class" : "org.apache.spark.mllib.linalg.VectorUDT",
"pyClass" : "pyspark.mllib.linalg.VectorUDT",
"sqlType" : {
"type" : "struct",
"fields" : [ {
"name" : "type",
"type" : "byte",
"nullable" : false,
"metadata" : { }
}, {
"name" : "size",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "indices",
"type" : {
"type" : "array",
"elementType" : "integer",
"containsNull" : false
},
"nullable" : true,
"metadata" : { }
}, {
"name" : "values",
"type" : {
"type" : "array",
"elementType" : "double",
"containsNull" : false
},
"nullable" : true,
"metadata" : { }
} ]
}
}

PlayFramework: how to transform each element of a JSON array

Given the following JSON...
{
"values" : [
"one",
"two",
"three"
]
}
... how do I transform it like this in Scala/Play?
{
"values" : [
{ "elem": "one" },
{ "elem": "two" },
{ "elem": "three" }
]
}
It's easy with Play's JSON Transformers:
val json = Json.parse(
"""{
| "somethingOther": 5,
| "values" : [
| "one",
| "two",
| "three"
| ]
|}
""".stripMargin
)
// transform the array of strings to an array of objects
val valuesTransformer = __.read[JsArray].map {
case JsArray(values) =>
JsArray(values.map { e => Json.obj("elem" -> e) })
}
// update the "values" field in the original json
val jsonTransformer = (__ \ 'values).json.update(valuesTransformer)
// carry out the transformation
val transformedJson = json.transform(jsonTransformer)
You can use Play's JSON APIs:
import play.api.libs.json._
val json = Json parse """
{
"values" : [
"one",
"two",
"three"
]
}
"""
val newArray = json \ "values" match {
case JsArray(values) => values.map { v => JsObject(Seq("elem" -> v)) }
}
// or Json.stringify if you don't need readability
val str = Json.prettyPrint(JsObject(Seq("values" -> JsArray(newArray))))
Output:
{
"values" : [ {
"elem" : "one"
}, {
"elem" : "two"
}, {
"elem" : "three"
} ]
}