How to apply a function to every string in a dataframe

How to apply a function to every string in a dataframe - scala

{
"cars": {
"Nissan": {
"Sentra": {"doors":4, "transmission":"automatic"},
"Maxima": {"doors":4, "transmission":"automatic","colors":["b#lack","pin###k"]}
},
"Ford": {
"Taurus": {"doors":4, "transmission":"automatic"},
"Escort": {"doors":4, "transmission":"auto#matic"}
}
}
}
I have this JSON that I have read, and I want to remove every # symbol in every string that may exist. My problem is doing this function generic, so it could work on every schema that I may encounter and not only this schema as used in JSON above.

You could do something like this: Get all the fields from the schema, use fold with the DataFrame itself as an accumulator and, apply the function that you want
def replaceSymbol(df: DataFrame): DataFrame =
df.schema.fieldNames.foldLeft(df)((df, field) => df.withColumn(field, regexp_replace(col(field), "#", "")))
You might need to check if the column is String or not.

Related

How to edit a value (list of entries) from an api response to use in a request body in Gatling/Scala

I have an issue that I'm hoping someone can help me with. I'm pretty new to coding and Gatling, so I'm not sure how to proceed.
I'm using Gatling (with Scala) to create a performance test scenario that contains two API-calls.
GetInformation
SendInformation
I'm storing some of the values from the GetInformation response so I can use it in the body for the SendInformation request. The problem is that some information from the GetInformation response needs to be edited/removed before it is included in the body for SendInformation.
Extract of the GetInformation response:
{
"parameter": [
{
"name": "ResponseFromGetInfo",
"type": "document",
"total": 3,
"entry": [
{
"fullUrl": "urn:uuid:4ea859d0-daa4-4d2a-8fbc-1571cd7dfdb0",
"resource": {
"resourceType": "Composition"
}
},
{
"fullUrl": "urn:uuid:1b10ed79-333b-4838-93a5-a40d22508f0a",
"resource": {
"resourceType": "Practitioner"
}
},
{
"fullUrl": "urn:uuid:650b8e7a-2cfc-4b0b-a23b-a85d1bf782de",
"resource": {
"resourceType": "Dispense"
}
}
]
}
]
}
What I want is to store the list in "entry" and remove the entries with resourceType = "Dispense" so I can use it in the body for SendInformation.
It would have been ok if the entry list always had the same number of entries and order, but that is not the case. The number of entries can be several hundred and the order of entries varies. The number of entries are equal to the "total" value that is included in the GetInformation response.
I've thought about a few ways to solve it, but now I'm stuck. Some alternatives:
Extract the entire "entry" list using .check(jsonPath("$.parameter[0].entry").saveAs("entryList")) and then iterate through the list to remove the entries with resourceTypes = "Dispense".
But I don't know how to iterate over a value of type io.gatling.core.session.SessionAttribute, or if this is possible. It would have been nice if I could iterate over the entry list and check if parameter[0].entry[0].resourceType = "Dispense", and remove the entry if the statement is true.
I'm also considering If I can use StringBuilder in some way. Maybe if I check one entry at the time using .check(parameter[0].entry[X].resourceType != dispense, and if true then append it to a stringBuilder.
Does someone know how I can do this? Either by one of the alternatives that I listed, or in a different way? All help is appreciated :)
So maybe in the end it will look something like this:
val scn = scenario("getAndSendInformation")
.exec(http("getInformation")
.post("/Information/$getInformation")
.body(ElFileBody("bodies/getInformtion.json"))
// I can save total, så I know the total number of entries in the entry list
.check(jsonPath("$.parameter[0].total").saveAs("total"))
//Store entire entry list
.check(jsonPath("$.parameter[0].entry").saveAs("entryList"))
//Or store all entries separatly and check afterwards who have resourceType = "dispense"? Not sure how to do this..
.check(jsonPath("$.parameter[0].entry[0]").saveAs("entry_0"))
.check(jsonPath("$.parameter[0].entry[1]").saveAs("entry_1"))
//...
.check(jsonPath("$.parameter[0].entry[X]").saveAs("entry_X"))
)
//Alternativ 1
.repeat("${total}", "counter") {
exec(session => {
//Do some magic here
//Check if session("parameter[0]_entry[counter].resourceType") = "Dispense" {
// if yes, remove entry from entry list}
session})}
//Alternativ 2
val entryString = new StringBuilder("")
.repeat("${total}", "counter") {
exec(session => {
//Do some magic here
//Check if session("parameter[0]_entry[counter].resourceType") != "Dispense" {
// if yes, add to StringBuilder}
// entryString.append(session("parameter[0]_entry[counter]").as[String] + ", ")
session})}
.exec(http("sendInformation")
.post("/Information/$sendInformation")
.body(ElFileBody("bodies/sendInformationRequest.json")))

I'm pretty new to coding
I'm using Gatling (with Scala)
Gatling with Java would probably be an easier solution for you.
check(jsonPath("$.parameter[0].entry").saveAs("entryList"))
This is going to capture a String, not a list. In order to be able to iterate, you have to use ofXXX/ofType[], see https://gatling.io/docs/gatling/reference/current/core/check/#jsonpath
Then, in order to generate the next request's body, you could consider a templating engine such as PebbleBody (https://gatling.io/docs/gatling/reference/current/http/request/#pebblestringbody) or indeed use StringBody with a function with a StringBuilder.

spark parse json field and match to different case class

I have some json like below, when I loaded this json some fields is string of json,
How to parse this json using spark scala and look for the key words I am looking for in that json
{"main":"{\"payload\": { \"mode\": [\"Node\"], \"currentSatate\": \"Ready\", \"Previousstate\": \"slow\", \"trigger\": [\"11\", \"12\"], \"AllStates\": [\"Ready\", \"slow\", \"fast\", \"new\"],\"UnusedStates\": [\"slow\", \"new\"],\"Percentage\": \"70\",\"trigger\": [\"11\"]}"}
{"main":"{\"payload\": {\"trigger\": [\"11\", \"22\"],\"mode\": [\"None\"],\"cangeState\": \"Open\"}}"}
{"main":"{\"payload\": { \"trigger\": [\"23\", \"45\"], \"mode\": [\"Edge\"], \"node.postions\": [\"12\", \"23\", \"45\", \"67\"], \"node.names\": [\"aa\", \"bb\", \"cc\", \"dd\"]}}" }
This is how its looking after loading in to data frame
val df = spark.read.json("<pathtojson")
df.show(false)
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|main |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"payload": { "mode": ["Node"], "currentSatate": "Ready", "Previousstate": "slow", "trigger": ["11", "12"], "AllStates": ["Ready", "slow", "fast", "new"],"UnusedStates": ["slow", "new"],"Percentage": "70","trigger": ["11"]}|
|{"payload": {"trigger": ["11", "22"],"mode": ["None"],"cangeState": "Open"}} |
|{"payload": { "trigger": ["23", "45"], "mode": ["Edge"], "node.postions": ["12", "23", "45", "67"], "node.names": ["aa", "bb", "cc", "dd"]}} |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Since my json filed is different for all the 3 json strings , is there a way to match define 3 case class and match
I know only matching to one class
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
val parsedJson = mapper.readValue[classname](jsonstring)
is there a way to create a multiple matching case class and match to any particular class ?

You are using Spark SQL, the first thing you have to do is to turn it into a dataset, and then use the spark's methods to deal with them. Don't use Json, all over the place (e.g., like in Play). The first task is to turn it into a dataset.
You could turn the serialize a Json into a case class:
val jsonFilePath: String = "/whatever/data.json"
val myDataSet = sparkSession.read.json(jsonFilePath).as[StudentRecord]
Then here you have the dataset for StudentRecord. So, you can now use the spark's groupBy method to get the data of the column you want from the dataset:
myDataSet.groupBy("whateverTable.whateverColumn").max() //could be min(), count(), etc...
Extra Note: Your Json, should "cleaned up" a little. For example, if it is within your program you can use the multi line way of declaring your Json, and then you don't need to use escape character all over the place:
val myJson: String =
"""
{
}
""".stripMargin
If it is in the file, then the Json you wrote is not correct. So first, make sure you have a syntactically correct Json to work on.

Avro Generic Record not taking aliases into account

I have some JsonData (fastxml.jackson objects) and I want to convert this into a GenericAvro Record. As I don't know by forehand what data I will be getting, only that there is an Avro schema available in a schema repository. I can't have predefined classes. So hence the generic record.
When I pretty print my schema, I can see my keys/values and their aliases. However the Generic record "put" method does not seem to know these aliases.
I get the following exception Exception in thread "main" org.apache.avro.AvroRuntimeException: Not a valid schema field: device/id
Is this by design? How can I make this schema look at the aliases as well?
schema extract:
"fields" : [ {
"name" : "device_id",
"type" : "long",
"doc" : " The id of the device.",
"aliases" : [ "deviceid", "device/id" ]
}, {
............
}]
code:
def jsonToAvro(jSONObject: JsonNode, schema: Schema): GenericRecord = {
val converter = new JsonAvroConverter
println(jSONObject.toString) // correct
println(schema.toString(true)) // correct
println(schema.getField("device_id")) //correct
println(schema.getField("device_id").aliases()toString) //correct
val avroRecord = new GenericData.Record(schema)
val iter = jSONObject.fields()
while (iter.hasNext) {
import java.util
val e = jSONObject.fields()
val entry = iter.next.asInstanceOf[util.Map.Entry[String, Nothing]]
println(s"adding ${entry.getKey.toString} and ${entry.getValue} with ${entry.getValue.getClass.getName}") // adding device/id and 8711 with com.fasterxml.jackson.databind.node.IntNode
avroRecord.put(entry.getKey.toString, entry.getValue) // throws
}
avroRecord
}

I tried on Avro 1.8.2, it still throws this exception when I read a json string into a GenericRecord:
org.apache.avro.AvroTypeException: Expected field name not found:
But I saw some sample used alias correctly two years ago:
https://www.waitingforcode.com/apache-avro/serialization-and-deserialization-with-schemas-in-apache-avro/read
So I guess Avro changed that behaviour recently

It seems like the schema is only this flexible when reading.
Writing AVRO only looks at the current field name.
Not only that, but I'm using "/" in my field names (json), this is not supported as a field name.
Schema validation does not complain when it's in the alias, so that might work (haven't tested this)

Parsing Json in Spark and populate a column in dataframe dynamically based on nodes value

I am using spark 1.6.3 to parse a json strucuture
I have a json structure below :
{
"events":[
{
"_update_date":1500301647576,
"eventKey":"depth2Name",
"depth2Name":"XYZ"
},
{
"_update_date":1500301647577,
"eventKey":"journey_start",
"journey_start":"2017-07-17T14:27:27.144Z"
}]
}
i want parse the above JSON to 3 columns in dataframe. eventKey's value(deapth2Name) is a node in Json(deapth2Name) and i want to read the value from corresponding node add it to a column "eventValue" so that i can accommodate any new events dynamically.
Here is the expected output:
_update_date,eventKey,eventValue
1500301647576,depth2Name,XYZ
1500301647577,journey_start,2017-07-17T14:27:27.144Z
sample code:
val x = sc.wholeTextFiles("/user/jx665240/events.json").map(x => x._2)
val namesJson = sqlContext.read.json(x)
namesJson.printSchema()
namesJson.registerTempTable("namesJson")
val eventJson=namesJson.select("events")
val mentions1 =eventJson.select(explode($"events")).toDF("events").select($"events._update_date",$"events.eventKey",$"events.$"events.eventKey"")
$"events.$"events.eventKey"" is not working.
Can you please suggest how to fix this issue.
Thanks,
Sree

Spark to redshift. Flatten array to string

I am trying to save a nested JSON to redshift using the spark-redshift connector
The problem is redshift wont accept the structure of the dataframe because it has an array
So my question is, is there a way to flatten the array of columns foo and bar and convert their values to a string?
here is what I have so far to get the items as an array
val basketItems = df.select($"OrderContainer.BasketInfo.BasketId",
$"OrderContainer.BasketInfo.MenuId",
explode($"OrderContainer.BasketInfo.Items")).toDF("BasketId","MenuId","Items")
and here is the json I am using (formatted for readability):
{
"OrderContainer":{
"BasketInfo":{
"BasketId":"kjOIxlJFc0WYdQXm2AXksg",
"MenuId":119949,
"Items":[
{
"ProductId":12310,
"UnitPrice":5.5,
"foo":[1,2,3],
"bar":["a","b","c"]
},
{
"ProductId":456323,
"UnitPrice":5.5,
"foo":[1,2,3],
"bar":["a","b","c"]
},
{
"ProductId":23432432,
"UnitPrice":5.5,
"foo":[1,2,3],
"bar":["a","b","c"]
}
]
}
}
}

FYI
I have solved it by creating a function to make the array a string.
val mkString = udf((a: Seq[Any]) => a.mkString(","))
Make sure to import the udf function.
Then all you have to use is the withColumn function.
.withColumn("foo", mkString($"foo"))