Splitting an array containing nested arrays using Scala in Azure Databricks - scala

I'm currently working on a project where I'm having to extract some horribly nested data out of a json document (output from Log Analytics REST API call), document structure example below (I have a lot more columns):
{
"tables": [
{
"name": "PrimaryResult",
"columns": [
{
"name": "Category",
"type": "string"
},
{
"name": "count_",
"type": "long"
}
],
"rows": [
[
"Administrative",
20839
],
[
"Recommendation",
122
],
[
"Alert",
64
],
[
"ServiceHealth",
11
]
]
}
] }
I have managed to extract this json document into a data frame but I am stumped as to where to go from here.
The goal I am trying to achieve is an output like the below:
[{
"Category": "Administrative",
"count_": 20839
},
{
"Category": "Recommendation",
"count_": 122
},
{
"Category": "Alert",
"count_": 64
},
{
"Category": "ServiceHealth",
"count_": 11
}]
Ideally, I would like to use my columns array as the headers for each record. Then I want to split out each record array from the parent rows array in to its own record.
So far, I have tried flattening my raw imported data frame but this won't work as the rows data is an array of arrays.
How would I go about solving this conundrum?

It's a bit messy to deal with this, but here's a way to do it:
val df = spark.read.option("multiline",true).json("filepath")
val result = df.select(explode($"tables").as("tables"))
.select($"tables.columns".as("col"), explode($"tables.rows").as("row"))
.selectExpr("inline(arrays_zip(col, row))")
.groupBy()
.pivot($"col.name")
.agg(collect_list($"row"))
.selectExpr("inline(arrays_zip(Category, count_))")
result.show
+--------------+------+
| Category|count_|
+--------------+------+
|Administrative| 20839|
|Recommendation| 122|
| Alert| 64|
| ServiceHealth| 11|
+--------------+------+
To get the JSON output, you can do
val result_json = result.agg(to_json(collect_list(struct("Category", "count_"))).as("json"))
result_json.show(false)
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|json |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{"Category":"Administrative","count_":"20839"},{"Category":"Recommendation","count_":"122"},{"Category":"Alert","count_":"64"},{"Category":"ServiceHealth","count_":"11"}]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Or you can save as JSON, e.g. with result.save.json("output").

Another way using transform function :
import org.apache.spark.sql.functions._
val df = spark.read.option("multiline",true).json(inPath)
val df1 = df.withColumn("tables", explode($"tables"))
.select($"tables.rows".as("rows"))
.select(expr("inline(transform(rows, x -> struct(x[0] as Category, x[1] as _count)))"))
df1.show
//+--------------+------+
//| Category|_count|
//+--------------+------+
//|Administrative| 20839|
//|Recommendation| 122|
//| Alert| 64|
//| ServiceHealth| 11|
//+--------------+------+
Then saving into json file:
df1.save.json(outPath)

Related

how to deal with ambiguos column in nested json in Apache Spark

I have below nested JSON with ambiguous column. Basically, the objective is to rename one of the duplicated columns after reading
[
{
"name": "Nish",
"product": "Headphone",
"Delivery": {
"name": "Nisha",
"address": "Chennai",
"mob": "1234567"
}
}
]
val spark = SparkSession.builder.master("local")
.appName("dealWithAmbigousColumnNestedJson").getOrCreate()
val readJson = spark.read.option("multiLine", true).json("input1.json")
val dropDF = readJson.select("*","Delivery.*").drop("Delivery")
I attempted till here but don't know how to proceed further.
You can simply use withColumnRenamed and change the name of one or both of those columns:
readJson.withColumnRenamed("name", "buyer_name")
.select("*","Delivery.*")
.withColumnRenamed("name", "delivery_name")
.drop("Delivery")
.show()
Which gives:
+----------+---------+-------+-------+-------------+
|buyer_name| product|address| mob|delivery_name|
+----------+---------+-------+-------+-------------+
| Nish|Headphone|Chennai|1234567| Nisha|
+----------+---------+-------+-------+-------------+

How to read a JSON file into a Map, using Scala

How can I read a JSON file into a Map, using Scala. I've been trying to accomplish this but the JSON I am reading is nested JSon and I have not found a way to easily extract the JSON into keys because of that. Scala seems to be wanting to also convert the nested JSON String into an object. Instead, I want the nested JSON as a String "value". I am hoping someone can clarify or give me a hint on how I might do this.
My JSON source might look something like this:
{
"authKey": "34534645645653455454363",
"member": {
"memberId": "whatever",
"firstName": "Jon",
"lastName": "Doe",
"address": {
"line1": "Whatever Rd",
"city": "White Salmon",
"state": "WA",
"zip": "98672"
},
"anotherProp": "wahtever",
}
}
I want to extract this JSON into a Map of 2 keys without drilling into the nested JSON. Is this possible? Once I have the Map, my intention is to add the key-values to my POST request headers, like so:
val sentHeaders = Map("Content-Type" -> "application/javascript",
"Accept" -> "text/html", "authKey" -> extractedValue,
"member" -> theMemberInfoAsStringJson)
http("Custom headers")
.post("myUrl")
.headers(sentHeaders)
Since the question is tagged 'gatling', behind the curtains this lib depends on Jackson/fasterxml for JSON processing, so we can make use of it.
There is no way to retrieve a nested structured part of JSON as String directly, but with very few additional code the result can still be achieved.
So, having the input JSON:
val json = """{
| "authKey": "34534645645653455454363",
| "member": {
| "memberId": "whatever",
| "firstName": "Jon",
| "lastName": "Doe",
| "address": {
| "line1": "Whatever Rd",
| "city": "White Salmon",
| "state": "WA",
| "zip": "98672"
| },
| "anotherProp": "wahtever"
| }
|}""".stripMargin
A Jackson's ObjectMapper can be created and configured for use in Scala:
// import com.fasterxml.jackson.module.scala.DefaultScalaModule
val mapper = new ObjectMapper().registerModule(DefaultScalaModule)
To parse the input json easily, a dedicated case class is useful:
case class SrcJson(authKey: String, member: Any) {
val memberAsString = mapper.writeValueAsString(member)
}
We also include val memberAsString in it, which will contain our target JSON string, obtained through a reverse conversion from initially parsed member which actually is a Map.
Now, to parse the input json:
val parsed = mapper.readValue(json, classOf[SrcJson])
The references parsed.authKey and parsed.memberAsString will contain the researched values.
have a look at the scala play library - it has support for handling JSON. From what you describe, it should be pretty straightforward to read in the JSON and get the string value from any desired node.
Scala Play - JSON

Cross-venue visitor reporting approach in Location Based Service system

I'm finding an approach to resolve cross-venue vistor report for my client, he wants an HTTP API that return the total unique count of his customer who has visited more than one shop in day range (that API must return in 1-2 seconds).
The raw data sample (...millions records in reality):
--------------------------
DAY | CUSTOMER | VENUE
--------------------------
1 | cust_1 | A
2 | cust_2 | A
3 | cust_1 | B
3 | cust_2 | A
4 | cust_1 | C
5 | cust_3 | C
6 | cust_3 | A
Now, I want to calculate the cross-visitor report. IMO the steps would be as following:
Step 1: aggregate raw data from day 1 to 6
--------------------------
CUSTOMER | VENUE VISIT
--------------------------
cus_1 | [A, B, C]
cus_2 | [A]
cus_3 | [A, C]
Step 2: produce the final result
Total unique cross-customer: 2 (cus_1 and cus_3)
I've tried somes solutions:
I firstly used MongoDB to store data, then using Flask to write an API that uses MongoDB's utilities: aggregation, addToSet, group, count... But the API's response time is unacceptable.
Then, I switched to ElasticSearch with hope on its Aggregation command sets, but they do not support pipeline group command on the output result from the first "terms" aggregation.
After that, I read about Redis Sets, Sorted Sets,... But they couldn't help.
Could you please show me a clue to solve my problem.
Thank in advanced!
You can easily do this with Elasticsearch by leveraging one date_histogram aggregation to bucket by day, two terms aggregations (first bucket by customer and then by venue) and then only select the customers which visited more than one venue any given day using the bucket_selector pipeline aggregation. It looks like this:
POST /sales/_search
{
"size": 0,
"aggs": {
"by_day": {
"date_histogram": {
"field": "date",
"interval": "day"
},
"aggs": {
"customers": {
"terms": {
"field": "customer.keyword"
},
"aggs": {
"venues": {
"terms": {
"field": "venue.keyword"
}
},
"cross_selector": {
"bucket_selector": {
"buckets_path": {
"venues_count": "venues._bucket_count"
},
"script": {
"source": "params.venues_count > 1"
}
}
}
}
}
}
}
}
}
In the result set, you'll get customers 1 and 3 as expected.
UPDATE:
Another approach involves using a scripted_metric aggregation in order to implement the logic yourself. It's a bit more complicated and might not perform well depending on the number of documents and hardware you have, but the following algorithm would yield the response 2 exactly as you expect:
POST sales/_search
{
"size":0,
"aggs": {
"unique": {
"scripted_metric": {
"init_script": "params._agg.visits = new HashMap()",
"map_script": "def cust = doc['customer.keyword'].value; def venue = doc['venue.keyword'].value; def venues = params._agg.visits.get(cust); if (venues == null) { venues = new HashSet(); } venues.add(venue); params._agg.visits.put(cust, venues)",
"combine_script": "def merged = new HashMap(); for (v in params._agg.visits.entrySet()) { def cust = merged.get(v.key); if (cust == null) { merged.put(v.key, v.value) } else { cust.addAll(v.value); } } return merged",
"reduce_script": "def merged = new HashMap(); for (agg in params._aggs) { for (v in agg.entrySet()) {def cust = merged.get(v.key); if (cust == null) {merged.put(v.key, v.value)} else {cust.addAll(v.value); }}} def unique = 0; for (m in merged.entrySet()) { if (m.value.size() > 1) unique++;} return unique"
}
}
}
}
Response:
{
"took": 1413,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 0,
"hits": []
},
"aggregations": {
"unique": {
"value": 2
}
}
}

How to select child tag from JSON file using scala

Good Day!!
I am writing a Scala code to select the multiple child tag from json file however I am not getting exact solution. The code looks like below,
Code:
val spark = SparkSession.builder.master("local").appName("").config("spark.sql.warehouse.dir", "C:/temp").getOrCreate()
val df = spark.read.option("header", "true").json("C:/Users/Desktop/data.json").select("type", "city", "id","name")
println(df.show())
Data.json
{"claims":[
{ "type":"Part B",
"city":"Chennai",
"subscriber":[
{ "id":11 },
{ "name":"Harvey" }
] },
{ "type":"Part D",
"city":"Bangalore",
"subscriber":[
{ "id":12 },
{ "name":"andrew" }
] } ]}
Expected Result:
type city subscriber/0/id subscriber/1/name
Part B Chennai 11 Harvey
Part D Bangalore 12 Andrew
Please help me with the above code.
If I'm not mistaken Apache Spark expects each line to be a separate JSON object, so it will fail if you’ll try to load a pretty formatted JSON file.
https://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
http://jsonlines.org/examples/

Parse response to JSON format

I am using BigQuery from Scala. I tried the sample Scala code to call Google bigQuery API
Scala:
val queryInfo: QueryRequest =
new QueryRequest().setQuery(s"SELECT * FROM $PROJECT_ID:$dataSetId.$tableId;")
val queryRequest: Bigquery#Jobs#Query =
bigquery.jobs().query(PROJECT_ID, queryInfo)
val queryResponse: QueryResponse =
queryRequest.execute()
Above BQ returns:
{
"jobComplete":true,
"jobReference":{
"jobId":"job_xxx",
"projectId":"xxx"
},
"kind":"bigquery#queryResponse",
"rows":[{"f":[{"v":"1"},{"v":"1364206559422"}]}],
"schema": {
"fields":[
{"mode":"NULLABLE","name":"id","type":"STRING"},
{"mode":"NULLABLE","name":"timestamp","type":"INTEGER"}
]
},
"totalRows":"1",
"pageToken":"xxxx"
}
Please help me parse the values from above the results in JSON Format or change the query to return the result of the format like this:
{"id": "1", "timestamp": "1364206559422"}
I like lift json.
Look at the lotto example, it's straight forward with case classes