Spark to redshift. Flatten array to string

Spark to redshift. Flatten array to string - scala

I am trying to save a nested JSON to redshift using the spark-redshift connector
The problem is redshift wont accept the structure of the dataframe because it has an array
So my question is, is there a way to flatten the array of columns foo and bar and convert their values to a string?
here is what I have so far to get the items as an array
val basketItems = df.select($"OrderContainer.BasketInfo.BasketId",
$"OrderContainer.BasketInfo.MenuId",
explode($"OrderContainer.BasketInfo.Items")).toDF("BasketId","MenuId","Items")
and here is the json I am using (formatted for readability):
{
"OrderContainer":{
"BasketInfo":{
"BasketId":"kjOIxlJFc0WYdQXm2AXksg",
"MenuId":119949,
"Items":[
{
"ProductId":12310,
"UnitPrice":5.5,
"foo":[1,2,3],
"bar":["a","b","c"]
},
{
"ProductId":456323,
"UnitPrice":5.5,
"foo":[1,2,3],
"bar":["a","b","c"]
},
{
"ProductId":23432432,
"UnitPrice":5.5,
"foo":[1,2,3],
"bar":["a","b","c"]
}
]
}
}
}

FYI
I have solved it by creating a function to make the array a string.
val mkString = udf((a: Seq[Any]) => a.mkString(","))
Make sure to import the udf function.
Then all you have to use is the withColumn function.
.withColumn("foo", mkString($"foo"))

Related

How to apply a function to every string in a dataframe

{
"cars": {
"Nissan": {
"Sentra": {"doors":4, "transmission":"automatic"},
"Maxima": {"doors":4, "transmission":"automatic","colors":["b#lack","pin###k"]}
},
"Ford": {
"Taurus": {"doors":4, "transmission":"automatic"},
"Escort": {"doors":4, "transmission":"auto#matic"}
}
}
}
I have this JSON that I have read, and I want to remove every # symbol in every string that may exist. My problem is doing this function generic, so it could work on every schema that I may encounter and not only this schema as used in JSON above.

You could do something like this: Get all the fields from the schema, use fold with the DataFrame itself as an accumulator and, apply the function that you want
def replaceSymbol(df: DataFrame): DataFrame =
df.schema.fieldNames.foldLeft(df)((df, field) => df.withColumn(field, regexp_replace(col(field), "#", "")))
You might need to check if the column is String or not.

Sequelize returns wrong format from JSONB field

My "Item" model has a JSONB column named "formula".
I want to get it either as Object or JSON string. But what it returns is a string without quoted keys that can't be parsed as JSON.
My code:
async function itemsRead (where) {
const items = await models.item.findAll({
where
})
console.log(items)
return items
}
And what I see and get is:
[
item {
dataValues: {
id: 123,
formula: '{a:-0.81, x:5.12}',
}
},
.
.
.
]

My mistake was in insert (create) phase. I had to pass the original formula object (and not JSON stringified or other string forms) to create()
let formula = {a:-0.81, x:5.12}
models.item.create({id:123, formula})

How to extract keys from nested JSON

I want to extract keys from nested json using spark.
I have below JSON
{
"predicates": {
"API_No": "http://www.oilandgas.com/api_no",
"Facility_ID": "http://www.oilandgas.com/facility_id"
},
"prefixes": {
"API_No": "http://www.oilandgas.com/api_no/ ",
"Facility_ID": "http://www.oilandgas.com/facility_id/ "
},
"relations": {
"API_No": [
"Facility_ID",
"County"
]
},
"grahName": "http://www.oilandgas.com/data"
}
I wrote below code read json
val df = spark.read.option("multiline", "true").json("path/to/above/json")
df.select(explode(array(col("relations")))).columns.foreach(println)
I want to get key in 'relations' as 'API_No' from dataframe.
Thanks In advance.

For getting the key in relations as API_No from dataframe you have to just project (select) just relations key. Since type of relations key is struct, by projecting it you can get the desired result. Like the following:
df.select("relations.*").columns.foreach(println)
It will give the following result:
API_No
I hope it helps!

spark parse json field and match to different case class

I have some json like below, when I loaded this json some fields is string of json,
How to parse this json using spark scala and look for the key words I am looking for in that json
{"main":"{\"payload\": { \"mode\": [\"Node\"], \"currentSatate\": \"Ready\", \"Previousstate\": \"slow\", \"trigger\": [\"11\", \"12\"], \"AllStates\": [\"Ready\", \"slow\", \"fast\", \"new\"],\"UnusedStates\": [\"slow\", \"new\"],\"Percentage\": \"70\",\"trigger\": [\"11\"]}"}
{"main":"{\"payload\": {\"trigger\": [\"11\", \"22\"],\"mode\": [\"None\"],\"cangeState\": \"Open\"}}"}
{"main":"{\"payload\": { \"trigger\": [\"23\", \"45\"], \"mode\": [\"Edge\"], \"node.postions\": [\"12\", \"23\", \"45\", \"67\"], \"node.names\": [\"aa\", \"bb\", \"cc\", \"dd\"]}}" }
This is how its looking after loading in to data frame
val df = spark.read.json("<pathtojson")
df.show(false)
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|main |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"payload": { "mode": ["Node"], "currentSatate": "Ready", "Previousstate": "slow", "trigger": ["11", "12"], "AllStates": ["Ready", "slow", "fast", "new"],"UnusedStates": ["slow", "new"],"Percentage": "70","trigger": ["11"]}|
|{"payload": {"trigger": ["11", "22"],"mode": ["None"],"cangeState": "Open"}} |
|{"payload": { "trigger": ["23", "45"], "mode": ["Edge"], "node.postions": ["12", "23", "45", "67"], "node.names": ["aa", "bb", "cc", "dd"]}} |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Since my json filed is different for all the 3 json strings , is there a way to match define 3 case class and match
I know only matching to one class
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
val parsedJson = mapper.readValue[classname](jsonstring)
is there a way to create a multiple matching case class and match to any particular class ?

You are using Spark SQL, the first thing you have to do is to turn it into a dataset, and then use the spark's methods to deal with them. Don't use Json, all over the place (e.g., like in Play). The first task is to turn it into a dataset.
You could turn the serialize a Json into a case class:
val jsonFilePath: String = "/whatever/data.json"
val myDataSet = sparkSession.read.json(jsonFilePath).as[StudentRecord]
Then here you have the dataset for StudentRecord. So, you can now use the spark's groupBy method to get the data of the column you want from the dataset:
myDataSet.groupBy("whateverTable.whateverColumn").max() //could be min(), count(), etc...
Extra Note: Your Json, should "cleaned up" a little. For example, if it is within your program you can use the multi line way of declaring your Json, and then you don't need to use escape character all over the place:
val myJson: String =
"""
{
}
""".stripMargin
If it is in the file, then the Json you wrote is not correct. So first, make sure you have a syntactically correct Json to work on.

Parsing Json in Spark and populate a column in dataframe dynamically based on nodes value

I am using spark 1.6.3 to parse a json strucuture
I have a json structure below :
{
"events":[
{
"_update_date":1500301647576,
"eventKey":"depth2Name",
"depth2Name":"XYZ"
},
{
"_update_date":1500301647577,
"eventKey":"journey_start",
"journey_start":"2017-07-17T14:27:27.144Z"
}]
}
i want parse the above JSON to 3 columns in dataframe. eventKey's value(deapth2Name) is a node in Json(deapth2Name) and i want to read the value from corresponding node add it to a column "eventValue" so that i can accommodate any new events dynamically.
Here is the expected output:
_update_date,eventKey,eventValue
1500301647576,depth2Name,XYZ
1500301647577,journey_start,2017-07-17T14:27:27.144Z
sample code:
val x = sc.wholeTextFiles("/user/jx665240/events.json").map(x => x._2)
val namesJson = sqlContext.read.json(x)
namesJson.printSchema()
namesJson.registerTempTable("namesJson")
val eventJson=namesJson.select("events")
val mentions1 =eventJson.select(explode($"events")).toDF("events").select($"events._update_date",$"events.eventKey",$"events.$"events.eventKey"")
$"events.$"events.eventKey"" is not working.
Can you please suggest how to fix this issue.
Thanks,
Sree

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark to redshift. Flatten array to string - scala

FYI I have solved it by creating a function to make the array a string. val mkString = udf((a: Seq[Any]) => a.mkString(",")) Make sure to import the udf function. Then all you have to use is the withColumn function. .withColumn("foo", mkString($"foo"))

Related

How to apply a function to every string in a dataframe

Sequelize returns wrong format from JSONB field

How to extract keys from nested JSON

spark parse json field and match to different case class

Parsing Json in Spark and populate a column in dataframe dynamically based on nodes value

Categories

Resources