Spark Scala DataFrame Single Row conversion to JSON for PostrgeSQL Insertion - postgresql

With a DataFrame called lastTail, I can iterate like this:
import scalikejdbc._
// ...
// Do Kafka Streaming to create DataFrame lastTail
// ...
lastTail.printSchema
lastTail.foreachPartition(iter => {
// open database connection from connection pool
// with scalikeJDBC (to PostgreSQL)
while(iter.hasNext) {
val item = iter.next()
println("****")
println(item.getClass)
println(item.getAs("fileGid"))
println("Schema: "+item.schema)
println("String: "+item.toString())
println("Seqnce: "+item.toSeq)
// convert this item into an XXX format (like JSON)
// write row to DB in the selected format
}
})
This outputs "something like" (with redaction):
root
|-- fileGid: string (nullable = true)
|-- eventStruct: struct (nullable = false)
| |-- eventIndex: integer (nullable = true)
| |-- eventGid: string (nullable = true)
| |-- eventType: string (nullable = true)
|-- revisionStruct: struct (nullable = false)
| |-- eventIndex: integer (nullable = true)
| |-- eventGid: string (nullable = true)
| |-- eventType: string (nullable = true)
and (with just one iteration item - redacted, but hopefully with good enough syntax as well)
****
class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
12345
Schema: StructType(StructField(fileGid,StringType,true), StructField(eventStruct,StructType(StructField(eventIndex,IntegerType,true), StructField(eventGid,StringType,true), StructField(eventType,StringType,true)), StructField(revisionStruct,StructType(StructField(eventIndex,IntegerType,true), StructField(eventGid,StringType,true), StructField(eventType,StringType,true), StructField(editIndex,IntegerType,true)),false))
String: [12345,[1,4,edit],[1,4,revision]]
Seqnce: WrappedArray(12345, [1,4,edit], [1,4,revision])
Note: I doing the part like val metric = iter.sum on https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerPartition.scala, but with DataFrames instead. I am also following "Design Patterns for using foreachRDD" seen at http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning.
How can I convert this
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
(see https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala)
iteration item into a something that is easily written (JSON or ...? - I'm open) into PostgreSQL. (If not JSON, please suggest how to read this value back into a DataFrame for use at another point.)

Well I figured out a different way to do this as a work around.
val ltk = lastTail.select($"fileGid").rdd.map(fileGid => fileGid.toString)
val ltv = lastTail.toJSON
val kvPair = ltk.zip(ltv)
Then I would simply iterate over the RDD instead of the DataFrame.
kvPair.foreachPartition(iter => {
while(iter.hasNext) {
val item = iter.next()
println(item.getClass)
println(item)
}
})
The data aside, I get class scala.Tuple2 which makes for a easier way to store KV pairs in JDBC / PostgreSQL.
I'm sure that there could yet other ways that are not work-arounds.

Related

Error while adding a new utf8 string column to Row in Scala spark

I am trying to add a new column in each row of DataFrame like this
def addNamespace(iter: Iterator[Row]): Iterator[Row] = {
iter.map (row => {
println(row.getString(0))
// Row.fromSeq(row.toSeq ++ Array[String]("shared"))
val newseq = row.toSeq ++ Array[String]("shared")
Row(newseq: _*)
})
iter
}
def transformDf(source: DataFrame)(implicit spark: SparkSession): DataFrame = {
val newSchema = StructType(source.schema.fields ++ Array(StructField("namespace", StringType, nullable = true)))
val df = spark.sqlContext.createDataFrame(source.rdd.mapPartitions(addNamespace), newSchema)
df.show()
df
}
But I keep getting this error - Caused by: java.lang.RuntimeException: org.apache.spark.unsafe.types.UTF8String is not a valid external type for schema of string on the line df.show()
Can somebody please help in figuring out this. I have searched around in multiple posts but whatever I have tried is giving me this error.
I have also tried val again = sourceDF.withColumn("namespace", functions.lit("shared")) but it has the same issue.
Schema of already read data
root
|-- name: string (nullable = true)
|-- data: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- description: string (nullable = true)
| |-- activates_on: timestamp (nullable = true)
| |-- expires_on: timestamp (nullable = true)
| |-- created_by: string (nullable = true)
| |-- created_on: timestamp (nullable = true)
| |-- updated_by: string (nullable = true)
| |-- updated_on: timestamp (nullable = true)
| |-- properties: map (nullable = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
Caused by: java.lang.RuntimeException:
org.apache.spark.unsafe.types.UTF8String is not a valid external type
for schema of string
means its unable to understand as string type... for newly added "namespace" column.
Clearly indicates datatype mismatch error at catalyst level...
see spark code here..
override def eval(input: InternalRow): Any = {
val result = child.eval(input)
if (checkType(result)) {
result
} else {
throw new RuntimeException(s"${result.getClass.getName}$errMsg")
}
}
and error message is s" is not a valid external type for schema of ${expected.catalogString}"
So UTF String is not real string you need to encode/decode it before passing it as string type otherwise catalyst will not able to understand what you are passing.
How to fix it ?
Below are the SO content which will address how to encode/decode to/from utfstring to string and viceversa... you may need to apply suitable solution for this.
https://stackoverflow.com/a/5943395/647053
string decode utf-8
Note :
This online UTF-8 encoder/decoder tool is very handy to put sample data and convert that to string. try this first....

Casting an array of Doubles to String in spark sql

I'm trying read data from a JSON which has an array having lat, long values something like [48.597315,-43.206085] and I want to parse them in spark sql as a single string. is there a way I can do that?
my JSON input will look something like below.
{"id":"11700","position":{"type":"Point","coordinates":[48.597315,-43.206085]}
I'm trying to push this to a rdbms store and when I'm trying to cast position.coordinates to string it's giving me
Can't get JDBC type for array<string>
as the destination datatype is nvarchar. any kind help is appreciated.!
You can read your json file into a DataFrame, then 1) use concat_ws to stringify your lat/lon array into a single column, and 2) use struct to re-assemble the position struct-type column as follows:
// jsonfile:
// {"id":"11700","position":{"type":"Point","coordinates":[48.597315,-43.206085]}}
import org.apache.spark.sql.functions._
val df = spark.read.json("/path/to/jsonfile")
// printSchema:
// root
// |-- id: string (nullable = true)
// |-- position: struct (nullable = true)
// | |-- coordinates: array (nullable = true)
// | | |-- element: double (containsNull = true)
// | |-- type: string (nullable = true)
df.withColumn("coordinates", concat_ws(",", $"position.coordinates")).
select($"id", struct($"coordinates", $"position.type").as("position")).
show(false)
// +-----+----------------------------+
// |id |position |
// +-----+----------------------------+
// |11700|[48.597315,-43.206085,Point]|
// +-----+----------------------------+
// printSchema:
// root
// |-- id: string (nullable = true)
// |-- position: struct (nullable = false)
// | |-- coordinates: string (nullable = false)
// | |-- type: string (nullable = true)
[UPDATE]
Using Spark SQL:
df.createOrReplaceTempView("position_table")
spark.sql("""
select id, concat_ws(',', position.coordinates) as position_coordinates
from position_table
""").
show(false)
//+-----+--------------------+
//|id |position_coordinates|
//+-----+--------------------+
//|11700|48.597315,-43.206085|
//|11800|49.611254,-43.90223 |
//+-----+--------------------+
You have to transform the given column into a string before loading it into the target datasource. For example, the following code creates a new column position.coordinates with value as joined string of given arrays of double, by using Array's toString and removing the brackets afterward.
df.withColumn("position.coordinates", regexp_replace($"position.coordinates".cast("string"), "\\[|\\]", ""))
Alternatively, you can use UDF to do create a custom transformation function on Row objects. That way you can maintain the nested structure of the column. The following source (answer number 2) can give you some idea how to take up UDF for your case: Spark UDF with nested structure as input parameter.

Spark Dataframe - How to get a particular field from a struct type column

I have a data frame with a structure like this:
root
|-- npaDetails: struct (nullable = true)
| |-- additionalInformation: struct (nullable = true)
| |-- npaStatus: struct (nullable = true)
| |-- npaDetails: struct (nullable = true)
|-- npaHeaderData: struct (nullable = true)
| |-- npaNumber: string (nullable = true)
| |-- npaDownloadDate: string (nullable = true)
| |-- npaDownloadTime: string (nullable = true)
I want to retrieve all npaNumber from all the rows in the dataframe.
My approach was to iterate over all rows in the data frame to extract for each one the value stored in the column npaHeaderData, in the field npaNumber. So I code the following lines:
parquetFileDF.foreach { newRow =>
//To retrieve the second column
val column = newRow.get(1)
//The following line is not allowed
//val npaNumber= column.getAs[String]("npaNumber")
println(column)
}
The content of column printed in each iteration looks like:
[207400956,27FEB17,09.30.00]
But column is of type Any and I am not able extract any of its fields. Can anyone tell what am I doing wrong or what approach should I follow instead of this?
Thanks
if you are looking to extract only npaNumber then you can do
parquetFileDF.select($"npaHeaderData.npaNumber".as("npaNumber"))
you should have a dataframe with npaNumber column only.
you can call select() on dataframe which will give you a new dataframe with only specified column
var newDataFrame = dataFrame.select(dataFrame("npaHeaderData.npaNumber").as("npaNumber"))
You can do as below , which will avoid the [] ,while reading data from a data frame.
ids[DataFrame]: {id, name}
val idRDDs = ids.rdd.map(x => x.getAs[String](0))
for(id <- idRDDs){
id.map(x => println(x))
}
The above way will solve your issues.

Convert Json WrappedArray to String using spark sql

I'm working on a zeppelin notebook and try to load data from a table using sql.
In the table, each row has one column which is a JSON blob. For example, [{'timestamp':12345,'value':10},{'timestamp':12346,'value':11},{'timestamp':12347,'value':12}]
I want to select the JSON blob as a string, like the original string. But spark automatically load it as a WrappedArray.
It seems that I have to write a UDF to convert the WrappedArray to a string. The following is my code.
I first define a Scala function and then register the function. And then use the registered function on the column.
val unwraparr = udf ((x: WrappedArray[(Int, Int)]) => x.map { case Row(val1: String) => + "," + val2 })
sqlContext.udf.register("fwa", unwraparr)
It doesn't work. I would really appreciate if anyone can help.
The following is the schema of the part I'm working on. There will be many amount and timeStamp pairs.
-- targetColumn: array (nullable = true)
|-- element: struct (containsNull = true)
| |-- value: long (nullable = true)
| |-- timeStamp: string (nullable = true)
UPDATE:
I come up with the following code:
val f = (x: Seq[Row]) => x.map { case Row(val1: Long, val2: String) => x.mkString("+") }
I need it to concat the objects/struct/row (not sure how to call the struct) to a single string.
If your loaded data as dataframe/dataset in spark is as below with schema as
+------------------------------------+
|targetColumn |
+------------------------------------+
|[[12345,10], [12346,11], [12347,12]]|
|[[12345,10], [12346,11], [12347,12]]|
+------------------------------------+
root
|-- targetColumn: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- timeStamp: string (nullable = true)
| | |-- value: long (nullable = true)
Then you can write the dataframe as json to a temporary json file and read it as text file and parse the String line and convert it to dataframe as below (/home/testing/test.json is the temporary json file location)
df.write.mode(SaveMode.Overwrite).json("/home/testing/test.json")
val data = sc.textFile("/home/testing/test.json")
val rowRdd = data.map(jsonLine => Row(jsonLine.split(":\\[")(1).replace("]}", "")))
val stringDF = sqlContext.createDataFrame(rowRdd, StructType(Array(StructField("targetColumn", StringType, true))))
Which should leave you with following dataframe and schema
+--------------------------------------------------------------------------------------------------+
|targetColumn |
+--------------------------------------------------------------------------------------------------+
|{"timeStamp":"12345","value":10},{"timeStamp":"12346","value":11},{"timeStamp":"12347","value":12}|
|{"timeStamp":"12345","value":10},{"timeStamp":"12346","value":11},{"timeStamp":"12347","value":12}|
+--------------------------------------------------------------------------------------------------+
root
|-- targetColumn: string (nullable = true)
I hope the answer is helpful
read initially as text not dataframe
You can use my second phase of answer i.e. reading from json file and parsing, into your first phase of getting dataframe.

Get elements of type structure of row by name in SPARK SCALA

In a DataFrame object in Apache Spark (I'm using the Scala interface), if I'm iterating over its Row objects, is there any way to extract structure values by name?
I am using the below code to extract by name but I am facing problem on how to read the struct value .
If values had been of type string then we could have done this:
val resultDF=joinedDF.rdd.map{row=>
val id=row.getAs[Long]("id")
val values=row.getAs[String]("slotSize")
val feilds=row.getAs[String](values)
(id,values,feilds)
}.toDF("id","values","feilds")
But in my case values has the below schema
v1: struct (nullable = true)
| |-- level1: string (nullable = true)
| |-- level2: string (nullable = true)
| |-- level3: string (nullable = true)
| |-- level4: string (nullable = true)
| |-- level5: string (nullable = true)
What shall I replace this line with to make the code work given that value has the above structure.
row.getAs[String](values)
You can access the struct elements my first extracting another Row (structs are modeled as another Row in spark) from the toplevel Row like this:
Scala Implementation
val level1 = row.getAs[Row]("struct").getAs[String]("level1")
Java Implementation
String level1 = f.<Row>getAs("struct).getAs("level1").toString();