I want to extract keys from nested json using spark.
I have below JSON
{
"predicates": {
"API_No": "http://www.oilandgas.com/api_no",
"Facility_ID": "http://www.oilandgas.com/facility_id"
},
"prefixes": {
"API_No": "http://www.oilandgas.com/api_no/ ",
"Facility_ID": "http://www.oilandgas.com/facility_id/ "
},
"relations": {
"API_No": [
"Facility_ID",
"County"
]
},
"grahName": "http://www.oilandgas.com/data"
}
I wrote below code read json
val df = spark.read.option("multiline", "true").json("path/to/above/json")
df.select(explode(array(col("relations")))).columns.foreach(println)
I want to get key in 'relations' as 'API_No' from dataframe.
Thanks In advance.
For getting the key in relations as API_No from dataframe you have to just project (select) just relations key. Since type of relations key is struct, by projecting it you can get the desired result. Like the following:
df.select("relations.*").columns.foreach(println)
It will give the following result:
API_No
I hope it helps!
Related
Using Azure Data Factory and a data transformation flow. I have a csv that contains a column with a json object string, below an example including the header:
"Id","Name","Timestamp","Value","Metadata"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-18 05:53:00.0000000","0","{""unit"":""%""}"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-19 05:53:00.0000000","4","{""jobName"":""RecipeB""}"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-16 02:12:30.0000000","state","{""jobEndState"":""negative""}"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-19 06:33:00.0000000","23","{""unit"":""kg""}"
Want to store the data in a json like this:
{
"id": "99c9347ab7c34733a4fe0623e1496ffd",
"name": "data1",
"values": [
{
"timestamp": "2021-03-18 05:53:00.0000000",
"value": "0",
"metadata": {
"unit": "%"
}
},
{
"timestamp": "2021-03-19 05:53:00.0000000",
"value": "4",
"metadata": {
"jobName": "RecipeB"
}
}
....
]
}
The challenge is that metadata has dynamic content, meaning, that it will be always a json object but the content can vary. Therefore I cannot define a schema. Currently the column "metadata" on the sink schema is defined as object, but whenever I run the transformation I run into an exception:
Conversion from ArrayType(StructType(StructField(timestamp,StringType,false),
StructField(value,StringType,false), StructField(metadata,StringType,false)),true) to ArrayType(StructType(StructField(timestamp,StringType,true),
StructField(value,StringType,true), StructField(metadata,StructType(StructField(,StringType,true)),true)),false) not defined
We can get the output you expected, we need the expression to get the object Metadata.value.
Please ref my steps, here's my source:
Derived column expressions, create a JSON schema to convert the data:
#(id=Id,
name=Name,
values=#(timestamp=Timestamp,
value=Value,
metadata=#(unit=substring(split(Metadata,':')[2], 3, length(split(Metadata,':')[2])-6))))
Sink mapping and output data preview:
The key is that your matadata value is an object and may have different schema and content, may be 'value' or other key. We only can manually build the schema, it doesn't support expression. That's the limit.
We can't achieve that within Data Factory.
HTH.
Good Day!!
I am writing a Scala code to select the multiple child tag from json file however I am not getting exact solution. The code looks like below,
Code:
val spark = SparkSession.builder.master("local").appName("").config("spark.sql.warehouse.dir", "C:/temp").getOrCreate()
val df = spark.read.option("header", "true").json("C:/Users/Desktop/data.json").select("type", "city", "id","name")
println(df.show())
Data.json
{"claims":[
{ "type":"Part B",
"city":"Chennai",
"subscriber":[
{ "id":11 },
{ "name":"Harvey" }
] },
{ "type":"Part D",
"city":"Bangalore",
"subscriber":[
{ "id":12 },
{ "name":"andrew" }
] } ]}
Expected Result:
type city subscriber/0/id subscriber/1/name
Part B Chennai 11 Harvey
Part D Bangalore 12 Andrew
Please help me with the above code.
If I'm not mistaken Apache Spark expects each line to be a separate JSON object, so it will fail if you’ll try to load a pretty formatted JSON file.
https://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
http://jsonlines.org/examples/
I am having difficulties getting multiple datasets out of my database with RestTemplate. I have many routines that extract a single row, with a format like:
IndicatorModel indicatorModel = restTemplate.getForObject(URL + id,
IndicatorModel.class);
and they work fine. However, if I try to extract a set of data, such as:
Map<String, List<S_ServiceCoreTypeModel>> coreTypesMap =
restTemplate.getForObject(URL + id, Map.class);
this returns values in a
Map<String, LinkedHashMap<>>
format. Is there an easy way to return a List<> or Set<> in the desired format?
Fundamentally the issue is that your Java object model does not match the structure of your json document. You are attempting to deserialize a single json element into a java List. Your JSON document looks like:
{
"serviceCoreTypes":[
{
"serviceCoreType":{
"name":"ALL",
"description":"All",
"dateCreated":"2016-06-23 14:46:32.09",
"dateModified":"2016-06-23 14:46:32.09",
"deleted":false,
"id":1
}
},
{
"serviceCoreType":{
"name":"HSI",
"description":"High-speed Internet",
"dateCreated":"2016-06-23 14:47:31.317",
"dateModified":"2016-06-23 14:47:31.317",
"deleted":false,
"id":2
}
}
]
}
But you cannot turn a serviceCoreTypes into a List, you can only turn a Json Array into a List. For instance if you removed the unnecessary wrapper elements from your json and your input document looked like:
[
{
"name": "ALL",
"description": "All",
"dateCreated": "2016-06-23 14:46:32.09",
"dateModified": "2016-06-23 14:46:32.09",
"deleted": false,
"id": 1
},
{
"name": "HSI",
"description": "High-speed Internet",
"dateCreated": "2016-06-23 14:47:31.317",
"dateModified": "2016-06-23 14:47:31.317",
"deleted": false,
"id": 2
}
]
You should be able to then deserialize THAT into a List< S_ServiceCoreTypeModel>. Alternately if you cannot change the json structure, you could create a Java object model that models the json document by creating some wrapper classes. Something like:
class ServiceCoreTypes {
List<ServiceCoreType> serviceCoreTypes;
...
}
class ServiceCoreTypeWrapper {
ServiceCoreType serviceCoreType;
...
}
class ServiceCoreType {
String name;
String description;
...
}
I'm assuming you don't actually mean database, but instead a restful service as you're using RestTemplate
The problem you're facing is that you want to get a Collection back, but the getForObject method can only take in a single type parameter and cannot figure out what the type of the returned collection is.
I'd encourage you to consider using RestTemplate.exchange(...)
which should allow you request for and receive back a collection type.
I have a solution that works, for now at least. I would prefer a solution such as the one proposed by Ben, where I can get the HTTP response body as a list of items in the format I chose, but at least here I can extract each individual item from the JSON node. The code:
S_ServiceCoreTypeModel endModel;
RestTemplate restTemplate = new RestTemplate();
JsonNode node = restTemplate.getForObject(URL, JsonNode.class);
JsonNode allNodes = node.get("serviceCoreTypes");
JsonNode oneNode = allNodes.get(1);
ObjectMapper objectMapper = new ObjectMapper();
endModel = objectMapper.readValue(oneNode.toString(), S_ServiceCoreTypeModel.class);
If anyone has thoughts on how to make Ben's solution work, I would love to hear it.
I am trying to save a nested JSON to redshift using the spark-redshift connector
The problem is redshift wont accept the structure of the dataframe because it has an array
So my question is, is there a way to flatten the array of columns foo and bar and convert their values to a string?
here is what I have so far to get the items as an array
val basketItems = df.select($"OrderContainer.BasketInfo.BasketId",
$"OrderContainer.BasketInfo.MenuId",
explode($"OrderContainer.BasketInfo.Items")).toDF("BasketId","MenuId","Items")
and here is the json I am using (formatted for readability):
{
"OrderContainer":{
"BasketInfo":{
"BasketId":"kjOIxlJFc0WYdQXm2AXksg",
"MenuId":119949,
"Items":[
{
"ProductId":12310,
"UnitPrice":5.5,
"foo":[1,2,3],
"bar":["a","b","c"]
},
{
"ProductId":456323,
"UnitPrice":5.5,
"foo":[1,2,3],
"bar":["a","b","c"]
},
{
"ProductId":23432432,
"UnitPrice":5.5,
"foo":[1,2,3],
"bar":["a","b","c"]
}
]
}
}
}
FYI
I have solved it by creating a function to make the array a string.
val mkString = udf((a: Seq[Any]) => a.mkString(","))
Make sure to import the udf function.
Then all you have to use is the withColumn function.
.withColumn("foo", mkString($"foo"))
I am using BigQuery from Scala. I tried the sample Scala code to call Google bigQuery API
Scala:
val queryInfo: QueryRequest =
new QueryRequest().setQuery(s"SELECT * FROM $PROJECT_ID:$dataSetId.$tableId;")
val queryRequest: Bigquery#Jobs#Query =
bigquery.jobs().query(PROJECT_ID, queryInfo)
val queryResponse: QueryResponse =
queryRequest.execute()
Above BQ returns:
{
"jobComplete":true,
"jobReference":{
"jobId":"job_xxx",
"projectId":"xxx"
},
"kind":"bigquery#queryResponse",
"rows":[{"f":[{"v":"1"},{"v":"1364206559422"}]}],
"schema": {
"fields":[
{"mode":"NULLABLE","name":"id","type":"STRING"},
{"mode":"NULLABLE","name":"timestamp","type":"INTEGER"}
]
},
"totalRows":"1",
"pageToken":"xxxx"
}
Please help me parse the values from above the results in JSON Format or change the query to return the result of the format like this:
{"id": "1", "timestamp": "1364206559422"}
I like lift json.
Look at the lotto example, it's straight forward with case classes