How to convert Array<String> to Array<Struct> in spark scala? - scala

I have an Array of JSON String, which I need to parse and convert into a struct.
transDf schema:
root
|-- logs: array (nullable = true)
| |-- element: string (containsNull = true)
This is the code I tried
val logsSchema = new ArrayType(spark.read.json(transDf.select("logs").as[String]).schema, true)
transDf = transDf.withColumn("logs", from_json(col("logs"), logsSchema))
but the above thing only works for string -> struct but not for Array struct.
How can I convert the array for JSON string into Array<Struct> without knowing the schema of the JSON.

You can schema_of_json function to get schema from JSON string and pass it to from_json function get struct type.
val logsSchema = schema_of_json(transDf.select(col("logs").cast("string")).as[String].first())
transDf = transDf.withColumn("logs", from_json(col("logs"), logsSchema))

Related

How can I access data from a DynamicFrame in nested json fields / structs with AWS Glue

Within a Map.apply() function on an AWS Glue DynamicFrame, I am trying to access data from a nested json column, but the DynamicFrame returned is empty.
Data structure:
root
|-- id: string
|-- policyId: string
|-- productId: string
|-- createdBy: string
|-- status: string
|-- data: struct
| |-- values: struct
| | |-- G1Q1: string
| | |-- G1Q2: string
My code:
dyf1 = glueContext.create_dynamic_frame.from_catalog(
database="db",
table_name="table1",
transformation_ctx="table_1",
)
dyf1 = Unbox.apply(frame = dyf1, path = "data", format = "json")
def ProcessEntry(r):
r["question1"] = r.data.values.G1Q1
return r
dyf2 = Map.apply(frame = dyf1, f = ProcessEntry)
dyf2.toDF().show()
I have also tried to use this:
r["question1"] = `r.data.values.G1Q1`
and this:
r["question1"] = r["data.values.G1Q1"]
But the result which is returned is always empty instead of the full DataFrame with the additional column "question1" and the values from the nested column:
++
||
++
++
How can I correctly access the (nested) value in the (unboxed) DataFrame within the Map.apply() function?
Finally figured it out myself from this link:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-map.html
It requires Python dictionary syntax:
r["question1"] = r["data"]["values"]["G1Q1"]

Casting an array of Doubles to String in spark sql

I'm trying read data from a JSON which has an array having lat, long values something like [48.597315,-43.206085] and I want to parse them in spark sql as a single string. is there a way I can do that?
my JSON input will look something like below.
{"id":"11700","position":{"type":"Point","coordinates":[48.597315,-43.206085]}
I'm trying to push this to a rdbms store and when I'm trying to cast position.coordinates to string it's giving me
Can't get JDBC type for array<string>
as the destination datatype is nvarchar. any kind help is appreciated.!
You can read your json file into a DataFrame, then 1) use concat_ws to stringify your lat/lon array into a single column, and 2) use struct to re-assemble the position struct-type column as follows:
// jsonfile:
// {"id":"11700","position":{"type":"Point","coordinates":[48.597315,-43.206085]}}
import org.apache.spark.sql.functions._
val df = spark.read.json("/path/to/jsonfile")
// printSchema:
// root
// |-- id: string (nullable = true)
// |-- position: struct (nullable = true)
// | |-- coordinates: array (nullable = true)
// | | |-- element: double (containsNull = true)
// | |-- type: string (nullable = true)
df.withColumn("coordinates", concat_ws(",", $"position.coordinates")).
select($"id", struct($"coordinates", $"position.type").as("position")).
show(false)
// +-----+----------------------------+
// |id |position |
// +-----+----------------------------+
// |11700|[48.597315,-43.206085,Point]|
// +-----+----------------------------+
// printSchema:
// root
// |-- id: string (nullable = true)
// |-- position: struct (nullable = false)
// | |-- coordinates: string (nullable = false)
// | |-- type: string (nullable = true)
[UPDATE]
Using Spark SQL:
df.createOrReplaceTempView("position_table")
spark.sql("""
select id, concat_ws(',', position.coordinates) as position_coordinates
from position_table
""").
show(false)
//+-----+--------------------+
//|id |position_coordinates|
//+-----+--------------------+
//|11700|48.597315,-43.206085|
//|11800|49.611254,-43.90223 |
//+-----+--------------------+
You have to transform the given column into a string before loading it into the target datasource. For example, the following code creates a new column position.coordinates with value as joined string of given arrays of double, by using Array's toString and removing the brackets afterward.
df.withColumn("position.coordinates", regexp_replace($"position.coordinates".cast("string"), "\\[|\\]", ""))
Alternatively, you can use UDF to do create a custom transformation function on Row objects. That way you can maintain the nested structure of the column. The following source (answer number 2) can give you some idea how to take up UDF for your case: Spark UDF with nested structure as input parameter.

Spark 1.6 Data Frame - convert all field values in struct to uppercase

Have the following data frame and using Spark 1.6
scala> dfShow.printSchema
root
|-- email: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- EML: string (nullable = true)
| | |-- EMT: string (nullable = true)
This reads fine from source json. How can I convert the all the values to uppercase? Show on data frame gives
scala> dfShow.show
[{"EML":"a.b#c.COM","EMT":"b"},{"EML":"d.e#x.COM","EMT":"h"}]
[{"EML":"g#g.COM","EMT":"h"},{"EML":"f#f.com","EMT":"x"}]
null
[{"EML":"j#j.org","EMT":"b"},{"EML":"r.r#r.COM","EMT":"t"}]
How do I change the values in 2 fields EML and EMT to be upper case? Would like the fields to look as follows
[{"EML":"A.B#C.COM","EMT":"B"},{"EML":"D.E#X.COM","EMT":"H"}]
[{"EML":"G#G.COM","EMT":"H"},{"EML":"F#F.com","EMT":"X"}]
null
[{"EML":"J#J.org","EMT":"B"},{"EML":"R.R#R.COM","EMT":"T"}]
This is one struct field. I have other struct fields like address (line1,line2,city,state,zip,country) , phone etc. Is there a generic way to convert values for all struct fields to upper?

Spark Dataframe - How to get a particular field from a struct type column

I have a data frame with a structure like this:
root
|-- npaDetails: struct (nullable = true)
| |-- additionalInformation: struct (nullable = true)
| |-- npaStatus: struct (nullable = true)
| |-- npaDetails: struct (nullable = true)
|-- npaHeaderData: struct (nullable = true)
| |-- npaNumber: string (nullable = true)
| |-- npaDownloadDate: string (nullable = true)
| |-- npaDownloadTime: string (nullable = true)
I want to retrieve all npaNumber from all the rows in the dataframe.
My approach was to iterate over all rows in the data frame to extract for each one the value stored in the column npaHeaderData, in the field npaNumber. So I code the following lines:
parquetFileDF.foreach { newRow =>
//To retrieve the second column
val column = newRow.get(1)
//The following line is not allowed
//val npaNumber= column.getAs[String]("npaNumber")
println(column)
}
The content of column printed in each iteration looks like:
[207400956,27FEB17,09.30.00]
But column is of type Any and I am not able extract any of its fields. Can anyone tell what am I doing wrong or what approach should I follow instead of this?
Thanks
if you are looking to extract only npaNumber then you can do
parquetFileDF.select($"npaHeaderData.npaNumber".as("npaNumber"))
you should have a dataframe with npaNumber column only.
you can call select() on dataframe which will give you a new dataframe with only specified column
var newDataFrame = dataFrame.select(dataFrame("npaHeaderData.npaNumber").as("npaNumber"))
You can do as below , which will avoid the [] ,while reading data from a data frame.
ids[DataFrame]: {id, name}
val idRDDs = ids.rdd.map(x => x.getAs[String](0))
for(id <- idRDDs){
id.map(x => println(x))
}
The above way will solve your issues.

Get elements of type structure of row by name in SPARK SCALA

In a DataFrame object in Apache Spark (I'm using the Scala interface), if I'm iterating over its Row objects, is there any way to extract structure values by name?
I am using the below code to extract by name but I am facing problem on how to read the struct value .
If values had been of type string then we could have done this:
val resultDF=joinedDF.rdd.map{row=>
val id=row.getAs[Long]("id")
val values=row.getAs[String]("slotSize")
val feilds=row.getAs[String](values)
(id,values,feilds)
}.toDF("id","values","feilds")
But in my case values has the below schema
v1: struct (nullable = true)
| |-- level1: string (nullable = true)
| |-- level2: string (nullable = true)
| |-- level3: string (nullable = true)
| |-- level4: string (nullable = true)
| |-- level5: string (nullable = true)
What shall I replace this line with to make the code work given that value has the above structure.
row.getAs[String](values)
You can access the struct elements my first extracting another Row (structs are modeled as another Row in spark) from the toplevel Row like this:
Scala Implementation
val level1 = row.getAs[Row]("struct").getAs[String]("level1")
Java Implementation
String level1 = f.<Row>getAs("struct).getAs("level1").toString();