Making a column a map in spark scala - scala

Say I have this db:
{"customerIdMarketplace": 1234, itemId: "rocks"}
{"customerIdMarketplace": 1234, itemId: "pebbles"}
{"customerIdMarketplace": 1234, itemId: "papers"}
{"customerIdMarketplace": 2345, itemId: "socks"}
{"customerIdMarketplace": 2345, itemId: "shoes"}
I want to create this dataset:
{"customerIdMarketplace": 1234, items: [{"id": "rocks"}, {"id":"pebbles"}, {"id": "papers"}]}
{"customerIdMarketplace": 2345, items: [{"id": "socks"}, {id: "shoes"}]}
The goal is to group all the items into a list if the customerIdMarketplace is the same, and represent it as a map in a sense where each item has the key id.
What I have so far is:
db.select("customerIdMarketplace", "itemId")
.groupby("customerIdMarketplace")
.agg(collect_set("itemId").as("items"))
But this doesn't make the id key. How can I do this correctly?

The collect_set call is fine, but you need to make a json object for each itemId, right? so you need to create a map fist. This is the code snippet on spark shell:
// customId is short for customerIdMarketplace
scala> db.select("customerId", "itemId").groupBy("customerId").agg(collect_set(to_json(map(lit("id"), col("itemId"))))).show(false)
+----------+---------------------------------------------------+
|customerId|collect_set(to_json(map(id, itemId))) |
+----------+---------------------------------------------------+
|1234 |[{"id":"papers"}, {"id":"pebbles"}, {"id":"rocks"}]|
|2345 |[{"id":"shoes"}, {"id":"socks"}] |
+----------+---------------------------------------------------+
If you don't need json, just use map:
scala> mp.select("customerId", "itemId").groupBy("customerId").agg(collect_list(map(lit("id"), col("itemId")))).show(false)
+----------+------------------------------------------------+
|customerId|collect_list(map(id, itemId)) |
+----------+------------------------------------------------+
|1234 |[[id -> rocks], [id -> pebbles], [id -> papers]]|
|2345 |[[id -> socks], [id -> shoes]] |
+----------+------------------------------------------------+
Now if you want to store them as map instead of json, you cannot use collect_set method, since comparing maps (to check for double insertion) is not supported in spark.

Related

Splitting an array containing nested arrays using Scala in Azure Databricks

I'm currently working on a project where I'm having to extract some horribly nested data out of a json document (output from Log Analytics REST API call), document structure example below (I have a lot more columns):
{
"tables": [
{
"name": "PrimaryResult",
"columns": [
{
"name": "Category",
"type": "string"
},
{
"name": "count_",
"type": "long"
}
],
"rows": [
[
"Administrative",
20839
],
[
"Recommendation",
122
],
[
"Alert",
64
],
[
"ServiceHealth",
11
]
]
}
] }
I have managed to extract this json document into a data frame but I am stumped as to where to go from here.
The goal I am trying to achieve is an output like the below:
[{
"Category": "Administrative",
"count_": 20839
},
{
"Category": "Recommendation",
"count_": 122
},
{
"Category": "Alert",
"count_": 64
},
{
"Category": "ServiceHealth",
"count_": 11
}]
Ideally, I would like to use my columns array as the headers for each record. Then I want to split out each record array from the parent rows array in to its own record.
So far, I have tried flattening my raw imported data frame but this won't work as the rows data is an array of arrays.
How would I go about solving this conundrum?
It's a bit messy to deal with this, but here's a way to do it:
val df = spark.read.option("multiline",true).json("filepath")
val result = df.select(explode($"tables").as("tables"))
.select($"tables.columns".as("col"), explode($"tables.rows").as("row"))
.selectExpr("inline(arrays_zip(col, row))")
.groupBy()
.pivot($"col.name")
.agg(collect_list($"row"))
.selectExpr("inline(arrays_zip(Category, count_))")
result.show
+--------------+------+
| Category|count_|
+--------------+------+
|Administrative| 20839|
|Recommendation| 122|
| Alert| 64|
| ServiceHealth| 11|
+--------------+------+
To get the JSON output, you can do
val result_json = result.agg(to_json(collect_list(struct("Category", "count_"))).as("json"))
result_json.show(false)
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|json |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{"Category":"Administrative","count_":"20839"},{"Category":"Recommendation","count_":"122"},{"Category":"Alert","count_":"64"},{"Category":"ServiceHealth","count_":"11"}]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Or you can save as JSON, e.g. with result.save.json("output").
Another way using transform function :
import org.apache.spark.sql.functions._
val df = spark.read.option("multiline",true).json(inPath)
val df1 = df.withColumn("tables", explode($"tables"))
.select($"tables.rows".as("rows"))
.select(expr("inline(transform(rows, x -> struct(x[0] as Category, x[1] as _count)))"))
df1.show
//+--------------+------+
//| Category|_count|
//+--------------+------+
//|Administrative| 20839|
//|Recommendation| 122|
//| Alert| 64|
//| ServiceHealth| 11|
//+--------------+------+
Then saving into json file:
df1.save.json(outPath)

how to deal with ambiguos column in nested json in Apache Spark

I have below nested JSON with ambiguous column. Basically, the objective is to rename one of the duplicated columns after reading
[
{
"name": "Nish",
"product": "Headphone",
"Delivery": {
"name": "Nisha",
"address": "Chennai",
"mob": "1234567"
}
}
]
val spark = SparkSession.builder.master("local")
.appName("dealWithAmbigousColumnNestedJson").getOrCreate()
val readJson = spark.read.option("multiLine", true).json("input1.json")
val dropDF = readJson.select("*","Delivery.*").drop("Delivery")
I attempted till here but don't know how to proceed further.
You can simply use withColumnRenamed and change the name of one or both of those columns:
readJson.withColumnRenamed("name", "buyer_name")
.select("*","Delivery.*")
.withColumnRenamed("name", "delivery_name")
.drop("Delivery")
.show()
Which gives:
+----------+---------+-------+-------+-------------+
|buyer_name| product|address| mob|delivery_name|
+----------+---------+-------+-------+-------------+
| Nish|Headphone|Chennai|1234567| Nisha|
+----------+---------+-------+-------+-------------+

How to read a JSON file into a Map, using Scala

How can I read a JSON file into a Map, using Scala. I've been trying to accomplish this but the JSON I am reading is nested JSon and I have not found a way to easily extract the JSON into keys because of that. Scala seems to be wanting to also convert the nested JSON String into an object. Instead, I want the nested JSON as a String "value". I am hoping someone can clarify or give me a hint on how I might do this.
My JSON source might look something like this:
{
"authKey": "34534645645653455454363",
"member": {
"memberId": "whatever",
"firstName": "Jon",
"lastName": "Doe",
"address": {
"line1": "Whatever Rd",
"city": "White Salmon",
"state": "WA",
"zip": "98672"
},
"anotherProp": "wahtever",
}
}
I want to extract this JSON into a Map of 2 keys without drilling into the nested JSON. Is this possible? Once I have the Map, my intention is to add the key-values to my POST request headers, like so:
val sentHeaders = Map("Content-Type" -> "application/javascript",
"Accept" -> "text/html", "authKey" -> extractedValue,
"member" -> theMemberInfoAsStringJson)
http("Custom headers")
.post("myUrl")
.headers(sentHeaders)
Since the question is tagged 'gatling', behind the curtains this lib depends on Jackson/fasterxml for JSON processing, so we can make use of it.
There is no way to retrieve a nested structured part of JSON as String directly, but with very few additional code the result can still be achieved.
So, having the input JSON:
val json = """{
| "authKey": "34534645645653455454363",
| "member": {
| "memberId": "whatever",
| "firstName": "Jon",
| "lastName": "Doe",
| "address": {
| "line1": "Whatever Rd",
| "city": "White Salmon",
| "state": "WA",
| "zip": "98672"
| },
| "anotherProp": "wahtever"
| }
|}""".stripMargin
A Jackson's ObjectMapper can be created and configured for use in Scala:
// import com.fasterxml.jackson.module.scala.DefaultScalaModule
val mapper = new ObjectMapper().registerModule(DefaultScalaModule)
To parse the input json easily, a dedicated case class is useful:
case class SrcJson(authKey: String, member: Any) {
val memberAsString = mapper.writeValueAsString(member)
}
We also include val memberAsString in it, which will contain our target JSON string, obtained through a reverse conversion from initially parsed member which actually is a Map.
Now, to parse the input json:
val parsed = mapper.readValue(json, classOf[SrcJson])
The references parsed.authKey and parsed.memberAsString will contain the researched values.
have a look at the scala play library - it has support for handling JSON. From what you describe, it should be pretty straightforward to read in the JSON and get the string value from any desired node.
Scala Play - JSON

What is a good ExtJS component to manage a Map data structure with dynamic key/value pairs

I'd like to ask for your advice with the following problem.
I have an entity representation in JSON that looks like this:
{
id: 1,
prop1: "someValue",
prop2: "someValue",
dynamicProperties: {
name1: "value1",
name2: "value2",
name3: "value3"
}
}
As you see, my entity has properties "prop1" and "prop2" which can take any value, and it also has a property "dynamicProperties" which can have a variable number of properties (e.g. "name1, "name2", "name3", and so on).
I want my users to be able to create/update/delete the property dynamicProperties. That is, it will be possible to add a new property "name4" with value "value4" and change the value of the property "name2" and/or delete the pair "name1"/"value1".
I initially thought about using Ext.grid.PropertyGrid in order to show the dynamicProperties. This allows me to edit the values of my properties, but it doesn't allow me to add new properties. By default, the name column of PropertyGrid is not editable and I haven't been able to change this.
is there a way to achieve what I am looking for?
The only alternative that I have thought of is to change my JSON representation to something like the following JSON and use a regular Ext.grid.Panel to manage it:
{
id: 1,
prop1: "someValue",
prop2: "someValue",
dynamicProperties: [
{
name: "name1",
value: "value1"
},
{
name: "name2",
value: "value2"
},
{
name: "name3",
value: "value3"
}
]
}
I don't like this approach because I would need to validate that the name must be unique, and maybe add an id property.
On the backend I use a Java Entity where the dynamicProperties is a HashMap like this:
#ElementCollection
#MapKeyColumn(name="name")
#Column(name="value")
#CollectionTable(name="entity_dynamic_properties", joinColumns=#JoinColumn(name="entity_id"))
private Map<String, String> dynamicProperties;
Thanks for your advice.
You can include additional components inside the row editor.
In your case I would suggest putting a cell editing grid beneath the non-dynamic properties that allows for two fields (name/value) and users can type what they want.
I ran into a similar problem and solved it with this question and answer which I think covers the basis of how to do this.

How do you model case classes to reflect database queries results in a reusable manner

I will go with an example.
Say I have three tables defined like this:
(pseudocode)
Realm
id: number, pk
name: text, not null
Family
id: number, pk
realm_id: number, fk to Realm, pk
name: text, not null
Species
id: number, pk
realm_id: number, fk to Family (and therefore to Realm), pk,
family_id: number, fk to Family, pk,
name: text, not null
A temptative case classes definition would be
case class Realm (
id: Int,
name: String
)
case class Family (
id: Int,
realm: Realm,
name: String
)
case class Species (
id: Int,
family: Family,
name: String
)
If I make a json out of this after querying the database it would look like this:
SELECT *
FROM realm
JOIN family
ON family.realm_id = realm.id
JOIN species
ON species.family_id = family.id
AND species.realm_id = family.realm_id
Example data:
[{
"id": 1,
"family": {
"id": 1,
"name": "Mammal",
"realm": {
"id": 1,
"name": "Animal"
}
},
"name": "Human"
},
{
"id": 2,
"family": {
"id": 1,
"name": "Mammal",
"realm": {
"id": 1,
"name": "Animal"
}
},
"name": "Cat"
}]
Ok, so far... This is usable, if I need to show every species grouped by realm, I would transform the JsValue or in javascript do filters, etc. However when posting data back to the server, these classes seem a little awkward. If I want to add a new species I would have to post something like this:
{
"id": ???,
"family": {
"id": 1,
"name": "Mammal", // Awkward
"realm": {
"id": 1,
"name": "Animal" // Awkward
}
},
"name": "Cat"
}
Should my classes be then:
case class Realm (
id: Int,
name: Option[String]
)
case class Family (
id: Int,
realm: Realm,
name: Option[String]
)
case class Species (
id: Option[Int],
family: Family,
name: String
)
Like this, I can omit posting what it seems to be unnecesary data, but then the classes definition don't reflect what is in the database which are not nullable fields.
Queries are projection of data. More or like Table.map(function) => Table2. When data is extracted from the database and I don't get the name field, it doesn't mean it is null. How do you deal with this things?
One way to deal with it is to represent the interconnection using other data structures instead of letting each level know about the next.
For example, in the places where you need to represent the entire tree, you could represent it with:
Map[Realm, Map[Family, Seq[Species]]]
And then just Realm in some places for example as a REST/JSON resource, and maybe (Species, Family, Realm) in some place where you only want to work with one species but need to know about the other two levels in the hierarchy.
I would also advice you to think two or three times about letting your model structure define your JSON structure, what happens with the codes that consumes your JSON when you change anything in your model classes? (And if you really want that, do you actually need to go via a model structure, why not build your JSON directly from the database results and skip one level of data transformation?)