I have the following schema that I read from csv:
val PersonSchema = StructType(Array(StructField("PersonID",StringType,true), StructField("Name",StringType,true)))
val AddressSchema = StructType(Array(StructField("PersonID",StringType,true), StructField("StreetNumber",StringType,true), StructField("StreetName",StringType,true)))
One person can have multiple addresses and is related through PersonID.
Can someone help transform the records to a PersonAddress records as in the following case class definition?
case class Address(StreetNumber:String, StreetName:String)
case class PersonAddress(PersonID:String, Name:String, Addresses:Array[Address])
I have tried the following but it is giving exception in the last step:
val results = personData.join(addressData, Seq("PersonID"), "left_outer").groupBy("PersonID","Name").agg(collect_list(struct("StreetNumber","StreetName")) as "Addresses")
val personAddresses = results .map(data => PersonAddress(data.getAs("PersonID"),data.getAs("Name"),data.getAs("Addresses")))
personAddresses.show
Gives an error:
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to $line26.$read$$iw$$iw$Address
The easiest solution in this situtation would be to use an UDF. First, collect the street numbers and names as two separate lists, then use the UDF to convert everything into a dataframe of PersonAddress.
val convertToCase = udf((id: String, name: String, streetName: Seq[String], streetNumber: Seq[String]) => {
val addresses = streetNumber.zip(streetName)
PersonAddress(id, name, addresses.map(t => Address(t._1, t._2)).toArray)
})
val results = personData.join(addressData, Seq("PersonID"), "left_outer")
.groupBy("PersonID","Name")
.agg(collect_list($"StreetNumber").as("StreetNumbers"),
collect_list($"StreetName").as("StreetNames"))
val personAddresses = results.select(convertToCase($"PersonID", $"Name", $"StreetNumbers", $"StreetNames").as("Person"))
This will give you a schema as below.
root
|-- Person: struct (nullable = true)
| |-- PersonID: string (nullable = true)
| |-- Name: string (nullable = true)
| |-- Addresses: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- StreetNumber: string (nullable = true)
| | | |-- StreetName: string (nullable = true)
I have a schema and called a udf on this column called referencesTypes
|-- referenceTypes: struct (nullable = true)
| |-- map: map (nullable = true)
| | |-- key: string
| | |-- value: long (valueContainsNull = true)
The udf
val mapfilter = udf[Map[String,Long],Map[String,Long]](map => {
map.keySet.exists(_ != "Family")
val newMap = map.updated("Family",1L)
newMap
})
Now after the udf is used my schema goes to this
|-- referenceTypes: map (nullable = true)
| |-- key: string
| |-- value: long (valueContainsNull = false)
What do i do to get back referenceType as Struct and map as subroot. In other words how do i convert it back to the orginal on the top with Struct and map one level below.. Bottom has to look like top again, but dont know what changes to make to the udf.
Tried toArray(thought it can be struct) and tomap as well?
basically need to bring back []
actual : Map(Family -> 1)
EXPECTED : [Map(Family -> 1)]
You have to add struct:
import org.apache.spark.sql.functions.struct
df.withColumn(
"referenceTypes",
struct(mapFilter($"referenceTypes.map").alias("map")))
I have an RDD that has been created from some JSON, each record in the RDD contains key/value pairs. My RDD looks like:
myRdd.foreach(println)
{"sequence":89,"id":8697344444103393,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527636408955},1],
{"sequence":153,"id":8697389197662617,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637852762},1],
{"sequence":155,"id":8697389381205360,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637858607},1],
{"sequence":136,"id":8697374208897843,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637405129},1],
{"sequence":189,"id":8697413135394406,"trackingInfo":{"row":0,"trackId":14272744,"requestId":"284929d9-6147-4924-a19f-4a308730354c-3348447","rank":0,"videoId":80075830,"location":"PostPlay\/Next"},"type":["Play","Action","Session"],"time":527638558756},1],
{"sequence":130,"id":8697373887446384,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637394083}]
I would to convert each record to a row in a spark dataframe, the nested fields in trackingInfo should be there own columns and the type list should be its own column also.
So far I've tired to split it using a case class :
case class Event(
sequence: String,
id: String,
trackingInfo:String,
location:String,
row:String,
trackId: String,
listrequestId: String,
videoId:String,
rank: String,
requestId: String,
`type`:String,
time: String)
val dataframeRdd = myRdd.map(line => line.split(",")).
map(array => Event(
array(0).split(":")(1),
array(1).split(":")(1),
array(2).split(":")(1),
array(3).split(":")(1),
array(4).split(":")(1),
array(5).split(":")(1),
array(6).split(":")(1),
array(7).split(":")(1),
array(8).split(":")(1),
array(9).split(":")(1),
array(10).split(":")(1),
array(11).split(":")(1)
))
However I keep getting java.lang.ArrayIndexOutOfBoundsException: 1 errors.
What is the best way to do this ? As you can see record number 5 has a slight difference in the ordering of some attributes. Is it possible to parse based on attribute names instead of splitting on "," etc.
I'm using Spark 1.6.x
Your json rdd seems to be invalid jsons. You need to convert them to valid jsons as
val validJsonRdd = myRdd.map(x => x.replace(",1],", ",").replace("}]", "}"))
then you can use the sqlContext to read the valid rdd jsons into a dataframe as
val df = sqlContext.read.json(validJsonRdd)
which should give you dataframe ( i used the invalid json you provided in the question)
+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
|id |sequence|time |trackingInfo |type |
+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
|8697344444103393|89 |527636408955|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697389197662617|153 |527637852762|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697389381205360|155 |527637858607|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697374208897843|136 |527637405129|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697413135394406|189 |527638558756|[null,PostPlay/Next,0,284929d9-6147-4924-a19f-4a308730354c-3348447,0,14272744,80075830] |[Play, Action, Session]|
|8697373887446384|130 |527637394083|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
and the schema for the dataframe is
root
|-- id: long (nullable = true)
|-- sequence: long (nullable = true)
|-- time: long (nullable = true)
|-- trackingInfo: struct (nullable = true)
| |-- listId: string (nullable = true)
| |-- location: string (nullable = true)
| |-- rank: long (nullable = true)
| |-- requestId: string (nullable = true)
| |-- row: long (nullable = true)
| |-- trackId: long (nullable = true)
| |-- videoId: long (nullable = true)
|-- type: array (nullable = true)
| |-- element: string (containsNull = true)
I hope the answer is helpful
You can use sqlContext.read.json(myRDD.map(_._2)) to read json into a dataframe
I'm working on a zeppelin notebook and try to load data from a table using sql.
In the table, each row has one column which is a JSON blob. For example, [{'timestamp':12345,'value':10},{'timestamp':12346,'value':11},{'timestamp':12347,'value':12}]
I want to select the JSON blob as a string, like the original string. But spark automatically load it as a WrappedArray.
It seems that I have to write a UDF to convert the WrappedArray to a string. The following is my code.
I first define a Scala function and then register the function. And then use the registered function on the column.
val unwraparr = udf ((x: WrappedArray[(Int, Int)]) => x.map { case Row(val1: String) => + "," + val2 })
sqlContext.udf.register("fwa", unwraparr)
It doesn't work. I would really appreciate if anyone can help.
The following is the schema of the part I'm working on. There will be many amount and timeStamp pairs.
-- targetColumn: array (nullable = true)
|-- element: struct (containsNull = true)
| |-- value: long (nullable = true)
| |-- timeStamp: string (nullable = true)
UPDATE:
I come up with the following code:
val f = (x: Seq[Row]) => x.map { case Row(val1: Long, val2: String) => x.mkString("+") }
I need it to concat the objects/struct/row (not sure how to call the struct) to a single string.
If your loaded data as dataframe/dataset in spark is as below with schema as
+------------------------------------+
|targetColumn |
+------------------------------------+
|[[12345,10], [12346,11], [12347,12]]|
|[[12345,10], [12346,11], [12347,12]]|
+------------------------------------+
root
|-- targetColumn: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- timeStamp: string (nullable = true)
| | |-- value: long (nullable = true)
Then you can write the dataframe as json to a temporary json file and read it as text file and parse the String line and convert it to dataframe as below (/home/testing/test.json is the temporary json file location)
df.write.mode(SaveMode.Overwrite).json("/home/testing/test.json")
val data = sc.textFile("/home/testing/test.json")
val rowRdd = data.map(jsonLine => Row(jsonLine.split(":\\[")(1).replace("]}", "")))
val stringDF = sqlContext.createDataFrame(rowRdd, StructType(Array(StructField("targetColumn", StringType, true))))
Which should leave you with following dataframe and schema
+--------------------------------------------------------------------------------------------------+
|targetColumn |
+--------------------------------------------------------------------------------------------------+
|{"timeStamp":"12345","value":10},{"timeStamp":"12346","value":11},{"timeStamp":"12347","value":12}|
|{"timeStamp":"12345","value":10},{"timeStamp":"12346","value":11},{"timeStamp":"12347","value":12}|
+--------------------------------------------------------------------------------------------------+
root
|-- targetColumn: string (nullable = true)
I hope the answer is helpful
read initially as text not dataframe
You can use my second phase of answer i.e. reading from json file and parsing, into your first phase of getting dataframe.
I have a UDF that converts a Map (in this case String -> String) to an Array of Struct using the Scala built-in toArray function
val toArray = udf((vs: Map[String, String]) => vs.toArray)
The field names of structs are _1 and _2.
How can I change the UDF definition such that field (key) name was "key" and value name "value" as part of the UDF definition?
[{"_1":"aKey","_2":"aValue"}]
to
[{"key":"aKey","value":"aValue"}]
You can use a class:
case class KV(key:String, value: String)
val toArray = udf((vs: Map[String, String]) => vs.map {
case (k, v) => KV(k, v)
}.toArray )
Spark 3.0+
map_entries($"col_name")
This converts a map to an array of struct with struct field names key and value.
Example:
val df = Seq((Map("aKey"->"aValue", "bKey"->"bValue"))).toDF("col_name")
val df2 = df.withColumn("col_name", map_entries($"col_name"))
df2.printSchema()
// root
// |-- col_name: array (nullable = true)
// | |-- element: struct (containsNull = false)
// | | |-- key: string (nullable = false)
// | | |-- value: string (nullable = true)
For custom field names, just cast a new column schema:
val new_schema = "array<struct<k2:string,v2:string>>"
val df2 = df.withColumn("col_name", map_entries($"col_name").cast(new_schema))
df2.printSchema()
// root
// |-- col_name: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- k2: string (nullable = true)
// | | |-- v2: string (nullable = true)