AWS Glue: Pyspark: Problems while flattening DynamoDB JSON - pyspark

Need help with reading and converting a JSON element.
From Glue ETL job, I am using DynamoDB connector to export data from cross account DynamoDB.
Example Data from DynamoDB:
{
"header_name": {
"S": "2caee47b-4e6d-4eba-812d-65e4098f1f78"
},
"additional_info": {
"S": "test additionalIdentifier"
},
"biz_group": {
"S": "64b27e6a-863d-4ee6-9a33-386675ce348a"
},
"key_id": {
"S": "2caee47b-4e6d-4eba-812d-65e4098f1f78"
},
"workflow_id": {
"M": {
"48281078dd41": {
"N": "1"
}
}
}
}
There are 2 million unique “workflow_id”s.
I’m reading the data using glue connector to DynamoDB exports
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options={
"dynamodb.export": "ddb",
"dynamodb.s3.bucket": "s3_bucket",
"dynamodb.s3.prefix": "prefix-",
"dynamodb.tableArn": "TABLE_ARN",
"dynamodb.unnestDDBJson": False,
"dynamodb.sts.roleArn": "STS_ROLE_ARN",
"dynamodb.region": "us-west-2",
"dynamodb.s3.bucketOwner":"ACCOUNT ID BUCKET OWNER"
},
transformation_ctx="S3bucket_node1",
)
When I try to infer schema type for workflow_id, it is interpreting the values as individual columns. workflow_id is getting exploded (pasting only a few below).
root
|-- header_name: string
|-- additional_info: string
|-- biz_group: string
|-- key_id: string
|-- workflow_id: struct
| |-- 48281078dd41: long
| |-- 48281078dd42: long
.............
.............
.............
.............
.............
.............
I would like to consider the element workflow_id as string instead of interpreting as struct.
Question: is there a way I can convert the field workflow_id to string. I tried using crawler and changing the type to string but it is causing “internal exception” issues. Is there a way we can manipulate the schema while reading the data?
I tried using crawler and changing the type to string but it is causing “internal exception” issue. I am trying to avoid using crawler and use minimal amount of layers possible.
Use Case:
DynamoDB (cross account)->AWS Glue DDB connectors->dataframe->flatten->Parquet->Redshift Spectrum

Related

Read CosmosDb items from pyspark (Databricks) with an inconsistent model

Let's say I have two items in CosmosDb:
{
"Test": {
"InconsistentA": 10
},
"Common": 1
}
{
"Test": {
"InconsistentB": 10
},
"Common": 2
}
How to read this data so to have the following schema:
Test: string (the JSON string of the inconsistent part of the model)
Common: int (the consistent part of the model)
I don't know in advance what the model is and the spark CosmosDb driver (com.microsoft.azure.cosmosdb.spark) only reads X first items in CosmosDb to infer the schema.
What I have tried is enforcing the schema this way:
|-- Test: string (nullable = true)
|-- Common: integer (nullable = true)
But the result of the Test column is:
'{ InconsistentA=10 }'
Instead of:
'{ "InconsistentA": 10 }'

Postgres graph with TypeORM or something better?

I need store and build fast query for next structure:
class Model {
id: number:
alias: string;
schema: Record<string, any>;
}
where schema it's can be:
{
someField: '$model_alias',
otherField: {
nestedField: '$other_model_alias'
}
}
Example data:
{ id: 1, alias: "model_one", schema: { field1: "test", field2: "demo" } }
{ id: 2, alias: "model_second", schema: { someField: "$model_one", otherField: { nestedField: 5 } }
{ id: 3, alias: "model_third", schema: { field5: "$model_second", field6: "$model_one", field7: "$model_fourth" } }
{ id: 4, alias: "model_fourth", schema: { field8: "$model_second" } }
As you can see, json field schema contains fields which may refer to another models with schemas. Thus, there can be a lot of nesting, and relationships can be many-to-many.
Is it possible to achieve such a structure with Postgres or should some alternative be used? I need possible to easy manage structure and very fast queries (get tree children or get tree parents).
Thanks.
Choosing the right db type is tricky at the best of times. Can you provide more information about what sorts of queries you'd be doing? And how big is your dataset?
If your requirement is to exclusively get the parents and children of a model, a relational db such as postgres would do it. If the relations are many-to-many, you'll have a bridging table (https://dzone.com/articles/how-to-handle-a-many-to-many-relationship-in-datab), and will be able to do efficient queries on that.
If you're doing significant, multi-hop traversals between the relationships, you might indeed want to look at a graph database to avoid expensive joins. Postgres even has a plugin that allows this: https://www.postgresql.org/about/news/announcing-age-a-multi-model-graph-database-extension-for-postgresql-2050/
I wouldn't recommend a document store for data that's heavily relational like this, just because managing relationships between documents has to be handled manually by the user, and that's normally more trouble that it's worth.

How to map a json string into object type in sink transformation

Using Azure Data Factory and a data transformation flow. I have a csv that contains a column with a json object string, below an example including the header:
"Id","Name","Timestamp","Value","Metadata"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-18 05:53:00.0000000","0","{""unit"":""%""}"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-19 05:53:00.0000000","4","{""jobName"":""RecipeB""}"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-16 02:12:30.0000000","state","{""jobEndState"":""negative""}"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-19 06:33:00.0000000","23","{""unit"":""kg""}"
Want to store the data in a json like this:
{
"id": "99c9347ab7c34733a4fe0623e1496ffd",
"name": "data1",
"values": [
{
"timestamp": "2021-03-18 05:53:00.0000000",
"value": "0",
"metadata": {
"unit": "%"
}
},
{
"timestamp": "2021-03-19 05:53:00.0000000",
"value": "4",
"metadata": {
"jobName": "RecipeB"
}
}
....
]
}
The challenge is that metadata has dynamic content, meaning, that it will be always a json object but the content can vary. Therefore I cannot define a schema. Currently the column "metadata" on the sink schema is defined as object, but whenever I run the transformation I run into an exception:
Conversion from ArrayType(StructType(StructField(timestamp,StringType,false),
StructField(value,StringType,false), StructField(metadata,StringType,false)),true) to ArrayType(StructType(StructField(timestamp,StringType,true),
StructField(value,StringType,true), StructField(metadata,StructType(StructField(,StringType,true)),true)),false) not defined
We can get the output you expected, we need the expression to get the object Metadata.value.
Please ref my steps, here's my source:
Derived column expressions, create a JSON schema to convert the data:
#(id=Id,
name=Name,
values=#(timestamp=Timestamp,
value=Value,
metadata=#(unit=substring(split(Metadata,':')[2], 3, length(split(Metadata,':')[2])-6))))
Sink mapping and output data preview:
The key is that your matadata value is an object and may have different schema and content, may be 'value' or other key. We only can manually build the schema, it doesn't support expression. That's the limit.
We can't achieve that within Data Factory.
HTH.

How to read a JSON file into a Map, using Scala

How can I read a JSON file into a Map, using Scala. I've been trying to accomplish this but the JSON I am reading is nested JSon and I have not found a way to easily extract the JSON into keys because of that. Scala seems to be wanting to also convert the nested JSON String into an object. Instead, I want the nested JSON as a String "value". I am hoping someone can clarify or give me a hint on how I might do this.
My JSON source might look something like this:
{
"authKey": "34534645645653455454363",
"member": {
"memberId": "whatever",
"firstName": "Jon",
"lastName": "Doe",
"address": {
"line1": "Whatever Rd",
"city": "White Salmon",
"state": "WA",
"zip": "98672"
},
"anotherProp": "wahtever",
}
}
I want to extract this JSON into a Map of 2 keys without drilling into the nested JSON. Is this possible? Once I have the Map, my intention is to add the key-values to my POST request headers, like so:
val sentHeaders = Map("Content-Type" -> "application/javascript",
"Accept" -> "text/html", "authKey" -> extractedValue,
"member" -> theMemberInfoAsStringJson)
http("Custom headers")
.post("myUrl")
.headers(sentHeaders)
Since the question is tagged 'gatling', behind the curtains this lib depends on Jackson/fasterxml for JSON processing, so we can make use of it.
There is no way to retrieve a nested structured part of JSON as String directly, but with very few additional code the result can still be achieved.
So, having the input JSON:
val json = """{
| "authKey": "34534645645653455454363",
| "member": {
| "memberId": "whatever",
| "firstName": "Jon",
| "lastName": "Doe",
| "address": {
| "line1": "Whatever Rd",
| "city": "White Salmon",
| "state": "WA",
| "zip": "98672"
| },
| "anotherProp": "wahtever"
| }
|}""".stripMargin
A Jackson's ObjectMapper can be created and configured for use in Scala:
// import com.fasterxml.jackson.module.scala.DefaultScalaModule
val mapper = new ObjectMapper().registerModule(DefaultScalaModule)
To parse the input json easily, a dedicated case class is useful:
case class SrcJson(authKey: String, member: Any) {
val memberAsString = mapper.writeValueAsString(member)
}
We also include val memberAsString in it, which will contain our target JSON string, obtained through a reverse conversion from initially parsed member which actually is a Map.
Now, to parse the input json:
val parsed = mapper.readValue(json, classOf[SrcJson])
The references parsed.authKey and parsed.memberAsString will contain the researched values.
have a look at the scala play library - it has support for handling JSON. From what you describe, it should be pretty straightforward to read in the JSON and get the string value from any desired node.
Scala Play - JSON

How to select child tag from JSON file using scala

Good Day!!
I am writing a Scala code to select the multiple child tag from json file however I am not getting exact solution. The code looks like below,
Code:
val spark = SparkSession.builder.master("local").appName("").config("spark.sql.warehouse.dir", "C:/temp").getOrCreate()
val df = spark.read.option("header", "true").json("C:/Users/Desktop/data.json").select("type", "city", "id","name")
println(df.show())
Data.json
{"claims":[
{ "type":"Part B",
"city":"Chennai",
"subscriber":[
{ "id":11 },
{ "name":"Harvey" }
] },
{ "type":"Part D",
"city":"Bangalore",
"subscriber":[
{ "id":12 },
{ "name":"andrew" }
] } ]}
Expected Result:
type city subscriber/0/id subscriber/1/name
Part B Chennai 11 Harvey
Part D Bangalore 12 Andrew
Please help me with the above code.
If I'm not mistaken Apache Spark expects each line to be a separate JSON object, so it will fail if you’ll try to load a pretty formatted JSON file.
https://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
http://jsonlines.org/examples/