How to pass a dataframe column to scala function - scala

I've written a scala function which will convert time(HH:mm:ss.SSS) to seconds. First it will ignore milliseconds and will take only (HH:mm:ss) and convert into seconds(int). It works fine when testing in spark-shell.
def hoursToSeconds(a: Any): Int = {
val sec = a.toString.split('.')
val fields = sec(0).split(':')
val creationSeconds = fields(0).toInt*3600 + fields(1).toInt*60 + fields(2).toInt
return creationSeconds
}
print(hoursToSeconds("03:51:21.2550000"))
13881
I would need to pass this function to one of the dataframe column(running), which i was trying with the withColumn method, but getting error Type mismatch, expected: column, actual String. Any help would be appreciated, is there a way i can pass the scala function to udf and then use udf in df.withColumn.
df.printSchema
root
|-- vin: string (nullable = true)
|-- BeginOfDay: string (nullable = true)
|-- Timezone: string (nullable = true)
|-- Version: timestamp (nullable = true)
|-- Running: string (nullable = true)
|-- Idling: string (nullable = true)
|-- Stopped: string (nullable = true)
|-- dlLoadDate: string (nullable = false)
sample running column values.
df.withColumn("running", hoursToSeconds(df("Running")

You can create a udf for the hoursToSeconds function by using the following sytax :
val hoursToSecUdf = udf(hoursToSeconds _)
Further to use it on a particular column the following sytax can be used :
df.withColumn("TimeInSeconds",hoursToSecUdf(col("running")))

Related

Change Spark Dataframe Array[String] to Array[Double]

I am very new to scala and I have the following issue.
I have a spark dataframe with the following schema:
df.printSchema()
root
|-- word: string (nullable = true)
|-- vector: array (nullable = true)
| |-- element: string (containsNull = true)
I need to convert this to the following schema:
root
|-- word: string (nullable = true)
|-- vector: array (nullable = true)
| |-- element: double (containsNull = true)
I do not want to specify the schema before hand, but instead change the existing one.
I have tried the following
df.withColumn("vector", col("vector").cast("array<element: double>"))
I have also tried converting it into an RDD to use map to change the elements and then turn it back into a dataframe, but I get the following data type Array[WrappedArray] and I am not sure how to handle it.
Using pyspark and numpy, I could do this by df.select("vector").rdd.map(lambda x: numpy.asarray(x)).
Any help would be greatly appreciated.
You're close. Try this code:
val df2 = df.withColumn("vector", col("vector").cast("array<double>"))

How to dynamically infer a schema using SparkSession

I have just started learning Spark. I am aware of the fact that if we set inferSchema option to true, the schema is automatically inferred. I am reading a simple csv file. How do i dynamically infer a schema without specifying any custom schema in my code. The code should be able to build schema for any incoming dataset.
Is it possible to do so?
I tried using readStream and specified my format as csv skipping the inferschema option altogether but it seems i need to provide that option in any case.
val ds1: DataFrame = spark
.readStream
.format("csv")
.load("/home/vaibha/Downloads/C2ImportCalEventSample.csv")
println(ds1.show(2))
You can dynamically infer schema but might get bit tedious in some cases of csv format. More read here. Referring to CSV file in your code sample and assuming it is same as the one here, something like below will give what you need:
scala> val df = spark.read.
| option("header", "true").
| option("inferSchema", "true").
| option("timestampFormat","MM/dd/yyyy").
| csv("D:\\texts\\C2ImportCalEventSample.csv")
df: org.apache.spark.sql.DataFrame = [Start Date : timestamp, Start Time: string ... 15 more fields]
scala> df.printSchema
root
|-- Start Date : timestamp (nullable = true)
|-- Start Time: string (nullable = true)
|-- End Date: timestamp (nullable = true)
|-- End Time: string (nullable = true)
|-- Event Title : string (nullable = true)
|-- All Day Event: string (nullable = true)
|-- No End Time: string (nullable = true)
|-- Event Description: string (nullable = true)
|-- Contact : string (nullable = true)
|-- Contact Email: string (nullable = true)
|-- Contact Phone: string (nullable = true)
|-- Location: string (nullable = true)
|-- Category: integer (nullable = true)
|-- Mandatory: string (nullable = true)
|-- Registration: string (nullable = true)
|-- Maximum: integer (nullable = true)
|-- Last Date To Register: timestamp (nullable = true)

Convert Struct data type to Map data type in Scala

How can I convert a column with the data type of struct to Map or String. This is the schema:
root
|-- Col1: string (nullable = true)
|-- Col2: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: integer (nullable = false)
The second column makes the problem when I want to dump the dataframe into a file. I have tried many different ways such as casting to string but it changed the values in the second column. I also tried to convert the Col2 to a map but i was not successful.
I tried to get the first value in struct(_1) through a udf but it has error:
Failed to execute user defined function($anonfun$1: (struct<_1:string,_2:int>) => string)
Select Col1, Col2._1, Col2._2 from <your table>
By spark.sql, you can try this and save it to another dataframe and then write to CSV.
In Scala we could do in this way:
val df_new = df_old.select($"Col1", $"Col2._1", $"Col3._2")
You can also * notation to expand all the columns from Struct data type.
Schema
root
|-- address: struct (nullable = false)
| |-- street: string (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
Expansion SQL
val df1 = df.select("address.*").show(false)
df1.printSchema
root
|-- street: string (nullable = true)
|-- city: string (nullable = true)
|-- state: string (nullable = true)

Spark-xml creating `_VALUE` column which colflicts with other column with _value

I am using Spark to process some datas stored in an XML file.
I successfuly loaded my datas and printed the schema :
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag","elementTag")
.load(myPath+"/myfile.xml")
df.printSchema
Which give me a result that look like this :
root
|-- _id: string (nullable = true)
|-- _type: string (nullable = true)
|-- creationDate: struct (nullable = true)
| |-- _VALUE: string (nullable = true)
| |-- _value: string (nullable = true)
|-- lastUpdateDate: struct (nullable = true)
| |-- _VALUE: string (nullable = true)
| |-- _value: string (nullable = true)
From this datas, I want to extract only certain fields , which should be easy with a 'select'. So I am doing the folowing request :
df.select("_id","creationDate._value","lastUpdateDate._value")
But I get the error :
org.apache.spark.sql.AnalysisException: Ambiguous reference to fields StructField(_VALUE,StringType,true), StructField(_value,StringType,true);
My problem is that spark sql is not case sensitive and my file contains field _value and _VALUE and I can't change my input file.
Is there a way to solve this probleme with Spark?
Spark-xml creates _VALUE there is no child in a xml tag which cause conflict with other.
You can change default value _VALUE by adding option while reading xml as
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag","elementTag")
.option("valueTag", "anyName")
.load(myPath+"/myfile.xml")
Hope this helps!

How to convert RDD of JSONs to Dataframe?

I have an RDD that has been created from some JSON, each record in the RDD contains key/value pairs. My RDD looks like:
myRdd.foreach(println)
{"sequence":89,"id":8697344444103393,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527636408955},1],
{"sequence":153,"id":8697389197662617,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637852762},1],
{"sequence":155,"id":8697389381205360,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637858607},1],
{"sequence":136,"id":8697374208897843,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637405129},1],
{"sequence":189,"id":8697413135394406,"trackingInfo":{"row":0,"trackId":14272744,"requestId":"284929d9-6147-4924-a19f-4a308730354c-3348447","rank":0,"videoId":80075830,"location":"PostPlay\/Next"},"type":["Play","Action","Session"],"time":527638558756},1],
{"sequence":130,"id":8697373887446384,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637394083}]
I would to convert each record to a row in a spark dataframe, the nested fields in trackingInfo should be there own columns and the type list should be its own column also.
So far I've tired to split it using a case class :
case class Event(
sequence: String,
id: String,
trackingInfo:String,
location:String,
row:String,
trackId: String,
listrequestId: String,
videoId:String,
rank: String,
requestId: String,
`type`:String,
time: String)
val dataframeRdd = myRdd.map(line => line.split(",")).
map(array => Event(
array(0).split(":")(1),
array(1).split(":")(1),
array(2).split(":")(1),
array(3).split(":")(1),
array(4).split(":")(1),
array(5).split(":")(1),
array(6).split(":")(1),
array(7).split(":")(1),
array(8).split(":")(1),
array(9).split(":")(1),
array(10).split(":")(1),
array(11).split(":")(1)
))
However I keep getting java.lang.ArrayIndexOutOfBoundsException: 1 errors.
What is the best way to do this ? As you can see record number 5 has a slight difference in the ordering of some attributes. Is it possible to parse based on attribute names instead of splitting on "," etc.
I'm using Spark 1.6.x
Your json rdd seems to be invalid jsons. You need to convert them to valid jsons as
val validJsonRdd = myRdd.map(x => x.replace(",1],", ",").replace("}]", "}"))
then you can use the sqlContext to read the valid rdd jsons into a dataframe as
val df = sqlContext.read.json(validJsonRdd)
which should give you dataframe ( i used the invalid json you provided in the question)
+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
|id |sequence|time |trackingInfo |type |
+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
|8697344444103393|89 |527636408955|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697389197662617|153 |527637852762|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697389381205360|155 |527637858607|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697374208897843|136 |527637405129|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697413135394406|189 |527638558756|[null,PostPlay/Next,0,284929d9-6147-4924-a19f-4a308730354c-3348447,0,14272744,80075830] |[Play, Action, Session]|
|8697373887446384|130 |527637394083|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
and the schema for the dataframe is
root
|-- id: long (nullable = true)
|-- sequence: long (nullable = true)
|-- time: long (nullable = true)
|-- trackingInfo: struct (nullable = true)
| |-- listId: string (nullable = true)
| |-- location: string (nullable = true)
| |-- rank: long (nullable = true)
| |-- requestId: string (nullable = true)
| |-- row: long (nullable = true)
| |-- trackId: long (nullable = true)
| |-- videoId: long (nullable = true)
|-- type: array (nullable = true)
| |-- element: string (containsNull = true)
I hope the answer is helpful
You can use sqlContext.read.json(myRDD.map(_._2)) to read json into a dataframe