I'm working on a zeppelin notebook and try to load data from a table using sql.
In the table, each row has one column which is a JSON blob. For example, [{'timestamp':12345,'value':10},{'timestamp':12346,'value':11},{'timestamp':12347,'value':12}]
I want to select the JSON blob as a string, like the original string. But spark automatically load it as a WrappedArray.
It seems that I have to write a UDF to convert the WrappedArray to a string. The following is my code.
I first define a Scala function and then register the function. And then use the registered function on the column.
val unwraparr = udf ((x: WrappedArray[(Int, Int)]) => x.map { case Row(val1: String) => + "," + val2 })
sqlContext.udf.register("fwa", unwraparr)
It doesn't work. I would really appreciate if anyone can help.
The following is the schema of the part I'm working on. There will be many amount and timeStamp pairs.
-- targetColumn: array (nullable = true)
|-- element: struct (containsNull = true)
| |-- value: long (nullable = true)
| |-- timeStamp: string (nullable = true)
UPDATE:
I come up with the following code:
val f = (x: Seq[Row]) => x.map { case Row(val1: Long, val2: String) => x.mkString("+") }
I need it to concat the objects/struct/row (not sure how to call the struct) to a single string.
If your loaded data as dataframe/dataset in spark is as below with schema as
+------------------------------------+
|targetColumn |
+------------------------------------+
|[[12345,10], [12346,11], [12347,12]]|
|[[12345,10], [12346,11], [12347,12]]|
+------------------------------------+
root
|-- targetColumn: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- timeStamp: string (nullable = true)
| | |-- value: long (nullable = true)
Then you can write the dataframe as json to a temporary json file and read it as text file and parse the String line and convert it to dataframe as below (/home/testing/test.json is the temporary json file location)
df.write.mode(SaveMode.Overwrite).json("/home/testing/test.json")
val data = sc.textFile("/home/testing/test.json")
val rowRdd = data.map(jsonLine => Row(jsonLine.split(":\\[")(1).replace("]}", "")))
val stringDF = sqlContext.createDataFrame(rowRdd, StructType(Array(StructField("targetColumn", StringType, true))))
Which should leave you with following dataframe and schema
+--------------------------------------------------------------------------------------------------+
|targetColumn |
+--------------------------------------------------------------------------------------------------+
|{"timeStamp":"12345","value":10},{"timeStamp":"12346","value":11},{"timeStamp":"12347","value":12}|
|{"timeStamp":"12345","value":10},{"timeStamp":"12346","value":11},{"timeStamp":"12347","value":12}|
+--------------------------------------------------------------------------------------------------+
root
|-- targetColumn: string (nullable = true)
I hope the answer is helpful
read initially as text not dataframe
You can use my second phase of answer i.e. reading from json file and parsing, into your first phase of getting dataframe.
Related
I am currently having some problems with creating a Spark Row object and converting it to a spark dataframe. What i am trying to achieve is,
I have a two lists of custom types that look more or less like the classes below,
case class MyObject(name:String,age:Int)
case class MyObject2(xyz:String,abc:Double)
val listOne = List(MyObject("aaa",22),MyObject("sss",223)),
val listTwo = List(MyObject2("bbb",23),MyObject2("abc",2332))
Using these two lists I want to create a Dataframe which has one row and two fields (fieldOne and fieldTwo),
fieldOne --> is a List of StructType (similar to MyObject)
fieldTwo --> is a list of StructType (similar to MyObject2)
In order to achieve this i created my custom structtypes for MyObject, MyObject2 and my ResultType.
val myObjSchema = StructType(List(
StructField("name",StringType),
StructField("age",IntegerType)
))
val myObjSchema2 = StructType(List(
StructField("xyz",StringType),
StructField("abc",DoubleType)
))
val myRecType = StructType(
List(
StructField("myField",ArrayType(myObjSchema)),
StructField("myField2",ArrayType(myObjSchema2))
)
)
I populated my data within the spark Row object and created a dataframe
val data = Row(
List(MyObject("aaa",22),MyObject("sss",223)),
List(MyObject2("bbb",23),MyObject2("abc",2332))
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(Seq(data)),myRecType
)
when i call printSchema on the dataframe, the output is exactly what i would expect
root
|-- myField: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- age: integer (nullable = true)
|-- myField2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- xyz: string (nullable = true)
| | |-- abc: double (nullable = true)
However when i do a show, i get a runtime exception
Caused by: java.lang.RuntimeException: spark_utilities.example.MyObject is not a valid external type for schema of struct<name:string,age:int>
It looks like something is wrong with the Row object, can you please explain what is going wrong here?
Thanks a lot for your help!
ps: I know i can create a custom case class like case class PH(ls:List[MyObject],ls2:List[MyObject2]) populate it and convert it to a dataset. But due to some limitations i cannot use this approach and would like to solve it in the way mentioned above.
You can not simply insert your case class objects inside a row, you need to convert those objects to rows
val data = Row(
List(Row("aaa",22.toInt),Row("sss",223.toInt)),
List(Row("bbb",23d),Row("abc",2332d))
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(Seq(data)),myRecType
)
I'm trying read data from a JSON which has an array having lat, long values something like [48.597315,-43.206085] and I want to parse them in spark sql as a single string. is there a way I can do that?
my JSON input will look something like below.
{"id":"11700","position":{"type":"Point","coordinates":[48.597315,-43.206085]}
I'm trying to push this to a rdbms store and when I'm trying to cast position.coordinates to string it's giving me
Can't get JDBC type for array<string>
as the destination datatype is nvarchar. any kind help is appreciated.!
You can read your json file into a DataFrame, then 1) use concat_ws to stringify your lat/lon array into a single column, and 2) use struct to re-assemble the position struct-type column as follows:
// jsonfile:
// {"id":"11700","position":{"type":"Point","coordinates":[48.597315,-43.206085]}}
import org.apache.spark.sql.functions._
val df = spark.read.json("/path/to/jsonfile")
// printSchema:
// root
// |-- id: string (nullable = true)
// |-- position: struct (nullable = true)
// | |-- coordinates: array (nullable = true)
// | | |-- element: double (containsNull = true)
// | |-- type: string (nullable = true)
df.withColumn("coordinates", concat_ws(",", $"position.coordinates")).
select($"id", struct($"coordinates", $"position.type").as("position")).
show(false)
// +-----+----------------------------+
// |id |position |
// +-----+----------------------------+
// |11700|[48.597315,-43.206085,Point]|
// +-----+----------------------------+
// printSchema:
// root
// |-- id: string (nullable = true)
// |-- position: struct (nullable = false)
// | |-- coordinates: string (nullable = false)
// | |-- type: string (nullable = true)
[UPDATE]
Using Spark SQL:
df.createOrReplaceTempView("position_table")
spark.sql("""
select id, concat_ws(',', position.coordinates) as position_coordinates
from position_table
""").
show(false)
//+-----+--------------------+
//|id |position_coordinates|
//+-----+--------------------+
//|11700|48.597315,-43.206085|
//|11800|49.611254,-43.90223 |
//+-----+--------------------+
You have to transform the given column into a string before loading it into the target datasource. For example, the following code creates a new column position.coordinates with value as joined string of given arrays of double, by using Array's toString and removing the brackets afterward.
df.withColumn("position.coordinates", regexp_replace($"position.coordinates".cast("string"), "\\[|\\]", ""))
Alternatively, you can use UDF to do create a custom transformation function on Row objects. That way you can maintain the nested structure of the column. The following source (answer number 2) can give you some idea how to take up UDF for your case: Spark UDF with nested structure as input parameter.
I have a dataframe without schema and every column stored as StringType such as:
ID | LOG_IN_DATE | USER
1 | 2017-11-01 | Johns
Now I created a schema dataframe as [(ID,"double"),("LOG_IN_DATE","date"),(USER,"string")] and I would like to apply to the above Dataframe in Spark 2.0.2 with Scala 2.11.
I already tried:
schema.map(x => df.withColumn(x._1, col(x._1).cast(x._2)))
There's no error while running this but afterwards when I call the df.schema, nothing is changed.
Any idea on how I could programmatically apply the schema to df? My friend told me I can use foldLeft method but I don't think this is a method in Spark 2.0.2 neither in df nor rdd.
If you already have a list [(ID,"double"),("LOG_IN_DATE","date"),(USER,"string")], you can use select with each column casting to its type from the list
Your dataframe
val df = Seq(("1", "2017-11-01", "Johns"), ("2", "2018-01-03", "jons2")).toDF("ID", "LOG_IN_DATE", "USER")
Your schema
val schema = List(("ID", "double"), ("LOG_IN_DATE", "date"), ("USER", "string"))
Cast all the columns to its type from the list
val newColumns = schema.map(c => col(c._1).cast(c._2))
select all te casted columns
val newDF = df.select(newColumns:_*)
Print Schema
newDF.printSchema()
root
|-- ID: double (nullable = true)
|-- LOG_IN_DATE: date (nullable = true)
|-- USER: string (nullable = true)
Show Dataframe
newDF.show()
Output:
+---+-----------+-----+
|ID |LOG_IN_DATE|USER |
+---+-----------+-----+
|1.0|2017-11-01 |Johns|
|2.0|2018-01-03 |Jons2|
+---+-----------+-----+
My friend told me I can use foldLeft method but I don't think this is a method in Spark 2.0.2 neither in df nor rdd
Yes, foldLeft is the way to go
This is the schema before using foldLeft
root
|-- ID: string (nullable = true)
|-- LOG_IN_DATE: string (nullable = true)
|-- USER: string (nullable = true)
Using foldLeft
val schema = List(("ID","double"),("LOG_IN_DATE","date"),("USER","string"))
import org.apache.spark.sql.functions._
schema.foldLeft(df){case(tempdf, x)=> tempdf.withColumn(x._1, col(x._1).cast(x._2))}.printSchema()
and this is the schema after foldLeft
root
|-- ID: double (nullable = true)
|-- LOG_IN_DATE: date (nullable = true)
|-- USER: string (nullable = true)
I hope the answer is helpful
If you apply any function of Scala, It returns modified data so you can't change the data type of existing schema.
Below is the code to create new data frame of modified schema by casting column.
1.Create a new DataFrame
val df=Seq((1,"2017-11-01","Johns"),(2,"2018-01-03","Alice")).toDF("ID","LOG_IN_DATE","USER")
2.Register DataFrame as temp table
df.registerTempTable("user")
3.Now create new DataFrame by casting column data type
val new_df=spark.sql("""SELECT ID,TO_DATE(CAST(UNIX_TIMESTAMP(LOG_IN_DATE, 'yyyy-MM-dd') AS TIMESTAMP)) AS LOG_IN_DATE,USER from user""")
4. Display schema
new_df.printSchema
root
|-- ID: integer (nullable = false)
|-- LOG_IN_DATE: date (nullable = true)
|-- USER: string (nullable = true)
Actually what you did:
schema.map(x => df.withColumn(x._1, col(x._1).cast(x._2)))
could work but you need to define your dataframe as a var and do like this:
for((name, type) <- schema) {
df = df.withColumn(name, col(name).cast(type)))
}
Also you could try reading your dataframe like this:
case class MyClass(ID: Int, LOG_IN_DATE: Date, USER: String)
//Suppose you are reading from json
val df = spark.read.json(path).as[MyClass]
Hope this helps!
I have as input a set of files formatted as a single JSON object per line. The problem, however, is that one field on these JSON objects is a JSON-escaped String. Example
{"clientAttributes":{"backfillId":null,"clientPrimaryKey":"abc"},"escapedJsonPayload":"{\"name\":\"Akash\",\"surname\":\"Patel\",\"items\":[{\"itemId\":\"abc\",\"itemName\":\"xyz\"}"}
As I create a data frame by reading json file, it is creating data frame as below
val df = spark.sqlContext.read.json("file:///home/akaspate/sample.json")
df: org.apache.spark.sql.DataFrame = [clientAttributes: struct<backfillId: string, clientPrimaryKey: string>, escapedJsonPayload: string]
As we can see "escapedJsonPayload" is String and I need it to be Struct.
Note: I got similar question in StackOverflow and followed it (How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?) but it is giving me "[_corrupt_record: string]"
I have tried below steps
val df = spark.sqlContext.read.json("file:///home/akaspate/sample.json") (Work file)
val escapedJsons: RDD[String] = sc.parallelize(Seq("""df"""))
val unescapedJsons: RDD[String] = escapedJsons.map(_.replace("\"{", "{").replace("\"}", "}").replace("\\\"", "\""))
val dfJsons: DataFrame = spark.sqlContext.read.json(unescapedJsons) (This results in [_corrupt_record: string])
Any help would be appreciated
First of all the JSON you have provided is of wrong format (syntactically). The corrected JSON is as follows:
{"clientAttributes":{"backfillId":null,"clientPrimaryKey":"abc"},"escapedJsonPayload":{\"name\":\"Akash\",\"surname\":\"Patel\",\"items\":[{\"itemId\":\"abc\",\"itemName\":\"xyz\"}]}}
Next, to parse the JSON correctly from the above JSON, you have to use following code:
val rdd = spark.read.textFile("file:///home/akaspate/sample.json").toJSON.map(value => value.replace("\\", "").replace("{\"value\":\"", "").replace("}\"}", "}")).rdd
val df = spark.read.json(rdd)
Above code will give you following output:
df.show(false)
+----------------+-------------------------------------+
|clientAttributes|escapedJsonPayload |
+----------------+-------------------------------------+
|[null,abc] |[WrappedArray([abc,xyz]),Akash,Patel]|
+----------------+-------------------------------------+
With following schema:
df.printSchema
root
|-- clientAttributes: struct (nullable = true)
| |-- backfillId: string (nullable = true)
| |-- clientPrimaryKey: string (nullable = true)
|-- escapedJsonPayload: struct (nullable = true)
| |-- items: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- itemId: string (nullable = true)
| | | |-- itemName: string (nullable = true)
| |-- name: string (nullable = true)
| |-- surname: string (nullable = true)
I hope this helps !
With a DataFrame called lastTail, I can iterate like this:
import scalikejdbc._
// ...
// Do Kafka Streaming to create DataFrame lastTail
// ...
lastTail.printSchema
lastTail.foreachPartition(iter => {
// open database connection from connection pool
// with scalikeJDBC (to PostgreSQL)
while(iter.hasNext) {
val item = iter.next()
println("****")
println(item.getClass)
println(item.getAs("fileGid"))
println("Schema: "+item.schema)
println("String: "+item.toString())
println("Seqnce: "+item.toSeq)
// convert this item into an XXX format (like JSON)
// write row to DB in the selected format
}
})
This outputs "something like" (with redaction):
root
|-- fileGid: string (nullable = true)
|-- eventStruct: struct (nullable = false)
| |-- eventIndex: integer (nullable = true)
| |-- eventGid: string (nullable = true)
| |-- eventType: string (nullable = true)
|-- revisionStruct: struct (nullable = false)
| |-- eventIndex: integer (nullable = true)
| |-- eventGid: string (nullable = true)
| |-- eventType: string (nullable = true)
and (with just one iteration item - redacted, but hopefully with good enough syntax as well)
****
class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
12345
Schema: StructType(StructField(fileGid,StringType,true), StructField(eventStruct,StructType(StructField(eventIndex,IntegerType,true), StructField(eventGid,StringType,true), StructField(eventType,StringType,true)), StructField(revisionStruct,StructType(StructField(eventIndex,IntegerType,true), StructField(eventGid,StringType,true), StructField(eventType,StringType,true), StructField(editIndex,IntegerType,true)),false))
String: [12345,[1,4,edit],[1,4,revision]]
Seqnce: WrappedArray(12345, [1,4,edit], [1,4,revision])
Note: I doing the part like val metric = iter.sum on https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerPartition.scala, but with DataFrames instead. I am also following "Design Patterns for using foreachRDD" seen at http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning.
How can I convert this
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
(see https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala)
iteration item into a something that is easily written (JSON or ...? - I'm open) into PostgreSQL. (If not JSON, please suggest how to read this value back into a DataFrame for use at another point.)
Well I figured out a different way to do this as a work around.
val ltk = lastTail.select($"fileGid").rdd.map(fileGid => fileGid.toString)
val ltv = lastTail.toJSON
val kvPair = ltk.zip(ltv)
Then I would simply iterate over the RDD instead of the DataFrame.
kvPair.foreachPartition(iter => {
while(iter.hasNext) {
val item = iter.next()
println(item.getClass)
println(item)
}
})
The data aside, I get class scala.Tuple2 which makes for a easier way to store KV pairs in JDBC / PostgreSQL.
I'm sure that there could yet other ways that are not work-arounds.