Good evening.
I am doing some comparative work on the performance of RDDs, Dataframes and Datasets in Spark 2.1.0 (using built-in Scala 2.11.8). I have downloaded some freely available data from https://data.london.gov.uk/dataset/smartmeter-energy-use-data-in-london-households and executed the script later on shown on it. To give you a preview, the interrogated data looks as follows:
LCLid,stdorToU,DateTime,KWH/hh (per half hour) ,Acorn,Acorn_grouped
MAC000002,Std,2012-10-12 00:30:00.0000000, 0 ,ACORN-A,Affluent
MAC000002,Std,2012-10-12 01:00:00.0000000, 0 ,ACORN-A,Affluent
MAC000002,Std,2012-10-12 01:30:00.0000000, 0 ,ACORN-A,Affluent
MAC000002,Std,2012-10-12 02:00:00.0000000, 0 ,ACORN-A,Affluent
MAC000002,Std,2012-10-12 02:30:00.0000000, 0 ,ACORN-A,Affluent
MAC000002,Std,2012-10-12 03:00:00.0000000, 0 ,ACORN-A,Affluent
MAC000002,Std,2012-10-12 03:30:00.0000000, 0 ,ACORN-A,Affluent
MAC000002,Std,2012-10-12 04:00:00.0000000, 0 ,ACORN-A,Affluent
To achieve my comparative work, I time Spark at different stages of the import and transformation [String, String, Timestamp, Double, String, String] of the 6 variables expressed above. I have successfully managed to map the data into a Dataframe and a Dataset but cannot quite achieve the same in terms of RDD. Everytime I try to convert the file into an RDD, I get the following error:
ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
I am very confused since the variable 'DateTime' is already expressed as a timestamp format of 'yyyy-mm-dd hh:mm:ss[.fffffffff]'. I have read posts such as these (Convert Date to Timestamp in Scala, How to convert unix timestamp to date in Spark, Spark SQL: parse timestamp without seconds) but do not satisfy my needs.
It's even more confusing as the defined class 'londonDataSchemaDS' I constructed works on my Dataset conversion but not on my RDD one.
This is the script I have used:
import java.io.File
import java.sql.Timestamp
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{DataTypes, StructField, StructType}
val sparkSession = SparkSession.builder.appName("SmartData London").master("local[*]").getOrCreate()
val LCLid = StructField("LCLid", DataTypes.StringType)
val stdorToU = StructField("stdorToU", DataTypes.StringType)
val DateTime = StructField("DateTime", DataTypes.TimestampType)
val KWHhh = StructField("KWH/hh (per half hour) ", DataTypes.DoubleType)
val Acorn = StructField("Acorn", DataTypes.StringType)
val Acorn_grouped = StructField("Acorn_grouped", DataTypes.StringType)
val fields = Array(LCLid,stdorToU,DateTime,KWHhh,Acorn,Acorn_grouped)
val londonDataSchemaDF = StructType(fields)
import sparkSession.implicits._
case class londonDataSchemaDS(LCLid: String, stdorToU: String, DateTime: java.sql.Timestamp, KWHhh: Double, Acorn: String, Acorn_grouped: String)
val t0 = System.nanoTime()
val loadFileRDD=sparkSession.sparkContext.textFile("C:/Data/Smart_Data_London/Power-Networks-LCL-June2015(withAcornGps).csv_Pieces/Power-Networks-LCL-June2015(withAcornGps)v2_1.csv")
.map(_.split(","))
.map(r=>londonDataSchemaDS(r(0), r(1), Timestamp.valueOf(r(2)), r(3).toDouble, r(4), r(5)))
val t1 = System.nanoTime()
val loadFileDF=sparkSession.read.schema(londonDataSchemaDF).option("header", true)
.csv("C:/Data/Smart_Data_London/Power-Networks-LCL-June2015(withAcornGps).csv_Pieces/Power-Networks-LCL-June2015(withAcornGps)v2_1.csv")
val t2=System.nanoTime()
val loadFileDS=sparkSession.read.option("header", "true")
.csv("C:/Data/Smart_Data_London/Power-Networks-LCL-June2015(withAcornGps).csv_Pieces/Power-Networks-LCL-June2015(withAcornGps)v2_1.csv")
.withColumn("DateTime", $"DateTime".cast("timestamp"))
.withColumnRenamed("KWH/hh (per half hour) ", "KWHhh")
.withColumn("KWHhh", $"KWHhh".cast("double"))
.as[londonDataSchemaDS]
val t3 = System.nanoTime()
loadFileRDD.take(10)
loadFileDF.show(10, false)
loadFileDF.printSchema()
loadFileDS.show(10, false)
loadFileDS.printSchema()
println("Time Elapsed to implement RDD: " + (t1 - t0) * 1E-9 + " seconds")
println("Time Elapsed to implement DataFrame: " + (t2 - t1) * 1E-9 + " seconds")
println("Time Elapsed to implement Dataset: " + (t3 - t2) * 1E-9 + " seconds")
Any help on this would be most appreciated and/or a nudge in the right direction.
Many thanks,
Christian
I know what I did wrong. I was so caught up in the DataFrame and Dataset conversion which has got a built-in function to skip the header, that I forgot to remove the header from the RDD conversion process.
By adding the lines below, I remove the header and successfully convert my csv to an RDD (This explains why I was getting a formatting error in Timestamp):
val loadFileRDDwH=sparkSession.sparkContext.textFile("C:/Data/Smart_Data_London/Power-Networks-LCL-June2015(withAcornGps).csv_Pieces/Power-Networks-LCL-June2015(withAcornGps)v2_1.csv").map(_.split(","))
val header=loadFileRDDwH.first()
val loadFileRDD=loadFileRDDwH.filter(_(0) != header(0)).map(r=>londonDataSchemaDS(r(0), r(1), Timestamp.valueOf(r(2)), r(3).split("\\s+").mkString.toDouble, r(4), r(5)))
Thanks for reading
Christian
Related
I want to group by the records by date. but the date is in epoch timestamp in millisec.
Here is the sample data.
date, Col1
1506838074000, a
1506868446000, b
1506868534000, c
1506869064000, a
1506869211000, c
1506871846000, f
1506874462000, g
1506879651000, a
Here is what I'm trying to achieve.
**date Count of records**
02-10-2017 4
04-10-2017 3
03-10-2017 5
Here is the code which I tried to group by,
import java.text.SimpleDateFormat
val dateformat:SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd")
val df = sqlContext.read.csv("<path>")
val result = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
But while executing code I am getting below exception.
<console>:30: error: value toLong is not a member of org.apache.spark.sql.ColumnName
val t = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
Please help me to resolve the issue.
you would need to change the date column, which seems to be in long, to date data type. This can be done by using from_unixtime built-in function. And then its just a groupBy and agg function calls and use count function.
import org.apache.spark.sql.functions._
def stringDate = udf((date: Long) => new java.text.SimpleDateFormat("dd-MM-yyyy").format(date))
df.withColumn("date", stringDate($"date"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Above answer is using udf function which should be avoided as much as possible, since udf is a black box and requires serialization and deserialisation of columns.
Updated
Thanks to #philantrovert for his suggestion to divide by 1000
import org.apache.spark.sql.functions._
df.withColumn("date", from_unixtime($"date"/1000, "yyyy-MM-dd"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Both ways work.
I am getting the error:
org.apache.spark.sql.analysisexception: cannot resolve 'year'
My input data:
1,2012-07-21,2014-04-09
My code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class c (id:Int,start:String,end:String)
val c1 = sc.textFile("date.txt")
val c2 = c1.map(_.split(",")).map(r=>(c(r(0).toInt,r(1).toString,r(2).toString)))
val c3 = c2.toDF();
c3.registerTempTable("c4")
val r = sqlContext.sql("select id,datediff(year,to_date(end), to_date(start)) AS date from c4")
What can I do resolve above error?
I have tried the following code but I got the output in days and I need it in years
val r = sqlContext.sql("select id,datediff(to_date(end), to_date(start)) AS date from c4")
Please advise me if i can use any function like to_date to get year difference.
Another simple way to cast the string to dateType in spark sql and apply sql dates and time functions on the columns like following :
import org.apache.spark.sql.types._
val c4 = c3.select(col("id"),col("start").cast(DateType),col("end").cast(DateType))
c4.withColumn("dateDifference", datediff(col("end"),col("start")))
.withColumn("monthDifference", months_between(col("end"),col("start")))
.withColumn("yearDifference", year(col("end"))-year(col("start")))
.show()
One of the above answers doesn't return the right Year when days between two dates less than 365. Below example provides the right year and rounds the month and year to 2 decimal.
Seq(("2019-07-01"),("2019-06-24"),("2019-08-24"),("2018-12-23"),("2018-07-20")).toDF("startDate").select(
col("startDate"),current_date().as("endDate"))
.withColumn("datesDiff", datediff(col("endDate"),col("startDate")))
.withColumn("montsDiff", months_between(col("endDate"),col("startDate")))
.withColumn("montsDiff_round", round(months_between(col("endDate"),col("startDate")),2))
.withColumn("yearsDiff", months_between(col("endDate"),col("startDate"),true).divide(12))
.withColumn("yearsDiff_round", round(months_between(col("endDate"),col("startDate"),true).divide(12),2))
.show()
Outputs:
+----------+----------+---------+-----------+---------------+--------------------+---------------+
| startDate| endDate|datesDiff| montsDiff|montsDiff_round| yearsDiff|yearsDiff_round|
+----------+----------+---------+-----------+---------------+--------------------+---------------+
|2019-07-01|2019-07-24| 23| 0.74193548| 0.74| 0.06182795666666666| 0.06|
|2019-06-24|2019-07-24| 30| 1.0| 1.0| 0.08333333333333333| 0.08|
|2019-08-24|2019-07-24| -31| -1.0| -1.0|-0.08333333333333333| -0.08|
|2018-12-23|2019-07-24| 213| 7.03225806| 7.03| 0.586021505| 0.59|
|2018-07-20|2019-07-24| 369|12.12903226| 12.13| 1.0107526883333333| 1.01|
+----------+----------+---------+-----------+---------------+--------------------+---------------+
You can find a complete working example at below URL
https://sparkbyexamples.com/spark-calculate-difference-between-two-dates-in-days-months-and-years/
Hope this helps.
Happy Learning !!
val r = sqlContext.sql("select id,datediff(year,to_date(end), to_date(start)) AS date from c4")
In the above code, "year" is not a column in the data frame i.e it is not a valid column in table "c4" that is why analysis exception is thrown as query is invalid, query is not able to find the "year" column.
Use Spark User Defined Function (UDF), that will be a more robust approach.
Since dateDiff only returns the difference between days. I prefer to use my own UDF.
import java.sql.Timestamp
import java.time.Instant
import java.time.temporal.ChronoUnit
import org.apache.spark.sql.functions.{udf, col}
import org.apache.spark.sql.DataFrame
def timeDiff(chronoUnit: ChronoUnit)(dateA: Timestamp, dateB: Timestamp): Long = {
chronoUnit.between(
Instant.ofEpochMilli(dateA.getTime),
Instant.ofEpochMilli(dateB.getTime)
)
}
def withTimeDiff(dateA: String, dateB: String, colName: String, chronoUnit: ChronoUnit)(df: DataFrame): DataFrame = {
val timeDiffUDF = udf[Long, Timestamp, Timestamp](timeDiff(chronoUnit))
df.withColumn(colName, timeDiffUDF(col(dateA), col(dateB)))
}
Then I call it as a dataframe transformation.
df.transform(withTimeDiff("sleepTime", "wakeupTime", "minutes", ChronoUnit.MINUTES)
Noodling around with Spark, using union to build up a suitably large test dataset. This works OK:
val df = spark.read.json("/opt/spark/examples/src/main/resources/people.json")
df.union(df).union(df).count()
But I'd like to do something like this:
val df = spark.read.json("/opt/spark/examples/src/main/resources/people.json")
for (a <- 1 until 10){
df = df.union(df)
}
that barfs with error
<console>:27: error: reassignment to val
df = df.union(df)
^
I know this technique would work using python, but this is my first time using scala so I'm unsure of the syntax.
How can I recursively union a dataframe with itself n times?
If you use val on the dataset it becomes an immutable variable. That means you can't do any reassignments. If you change your definition to var df your code should work.
A functional approach without mutable data is:
val df = List(1,2,3,4,5).toDF
val bigDf = ( for (a <- 1 until 10) yield df ) reduce (_ union _)
The for loop will create a IndexedSeq of the specified length containing your DataFrame and the reduce function will take the first DataFrame union it with the second and will start again using the result.
Even shorter without the for loop:
val df = List(1,2,3,4,5).toDF
val bigDf = 1 until 10 map (_ => df) reduce (_ union _)
You could also do this with tail recursion using an arbitrary range:
#tailrec
def bigUnion(rng: Range, df: DataFrame): DataFrame = {
if (rng.isEmpty) df
else bigUnion(rng.tail, df.union(df))
}
val resultingBigDF = bigUnion(1.to(10), myDataFrame)
Please note this is untested code based on a similar things I had done.
I have two dataframes,
val df1 = sqlContext.csvFile("/data/testData.csv")
val df2 = sqlContext.csvFile("/data/someValues.csv")
df1=
startTime name cause1 cause2
15679 CCY 5 7
15683 2 5
15685 1 9
15690 9 6
df2=
cause description causeType
3 Xxxxx cause1
1 xxxxx cause1
3 xxxxx cause2
4 xxxxx
2 Xxxxx
and I want to apply a complex function getTimeCust to both cause1 and cause2 to determine a final cause, then match the description of this final cause code in df2. I must have a new df (or rdd) with the following columns:
startTime name cause descriptionCause
My solution was
val rdd2 = df1.map(row => {
val (cause, descriptionCause) = getTimeCust(row.getInt(2), row.getInt(3), df2)
Row (row(0),row(1),cause,descriptionCause)
})
If a run the code below I have a NullPointerException because the df2 is not visible.
The function getTimeCust(Int, Int, DataFrame) works well outside the map.
Use df1.join(df2, <join condition>) to join your dataframes together then select the fields you need from the joined dataframe.
You can't use spark's distributed structures (rdd, dataframe, etc) in code that runs on an executor (like inside a map).
Try something like this:
def f1(cause1: Int, cause2: Int): Int = some logic to calculate cause
import org.apache.spark.sql.functions.udf
val dfCause = df1.withColumn("df1_cause", udf(f1)($"cause1", $"cause2"))
val dfJoined = dfCause.join(df2, on= df1Cause("df1_cause")===df2("cause"))
dfJoined.select("cause", "description").show()
Thank you #Assaf. Thanks to your answer and the spark udf with data frame. I have resolved the this problem. The solution is:
val getTimeCust= udf((cause1: Any, cause2: Any) => {
var lastCause = 0
var categoryCause=""
var descCause=""
lastCause= .............
categoryCause= ........
(lastCause, categoryCause)
})
and after call the udf as:
val dfWithCause = df1.withColumn("df1_cause", getTimeCust( $"cause1", $"cause2"))
ANd finally the join
val dfFinale=dfWithCause.join(df2, dfWithCause.col("df1_cause._1") === df2.col("cause") and dfWithCause.col("df1_cause._2") === df2.col("causeType"),'outer' )
I have a dataframe in Spark with many columns and a udf that I defined. I want the same dataframe back, except with one column transformed. Furthermore, my udf takes in a string and returns a timestamp. Is there an easy way to do this? I tried
val test = myDF.select("my_column").rdd.map(r => getTimestamp(r))
but this returns an RDD and just with the transformed column.
If you really need to use your function, I can suggest two options:
Using map / toDF:
import org.apache.spark.sql.Row
import sqlContext.implicits._
def getTimestamp: (String => java.sql.Timestamp) = // your function here
val test = myDF.select("my_column").rdd.map {
case Row(string_val: String) => (string_val, getTimestamp(string_val))
}.toDF("my_column", "new_column")
Using UDFs (UserDefinedFunction):
import org.apache.spark.sql.functions._
def getTimestamp: (String => java.sql.Timestamp) = // your function here
val newCol = udf(getTimestamp).apply(col("my_column")) // creates the new column
val test = myDF.withColumn("new_column", newCol) // adds the new column to original DF
Alternatively,
If you just want to transform a StringType column into a TimestampType column you can use the unix_timestamp column function available since Spark SQL 1.5:
val test = myDF
.withColumn("new_column", unix_timestamp(col("my_column"), "yyyy-MM-dd HH:mm")
.cast("timestamp"))
Note: For spark 1.5.x, it is necessary to multiply the result of unix_timestamp by 1000 before casting to timestamp (issue SPARK-11724). The resulting code would be:
val test = myDF
.withColumn("new_column", (unix_timestamp(col("my_column"), "yyyy-MM-dd HH:mm") *1000L)
.cast("timestamp"))
Edit: Added udf option