How to group by on epoch timestame field in Scala spark - scala

I want to group by the records by date. but the date is in epoch timestamp in millisec.
Here is the sample data.
date, Col1
1506838074000, a
1506868446000, b
1506868534000, c
1506869064000, a
1506869211000, c
1506871846000, f
1506874462000, g
1506879651000, a
Here is what I'm trying to achieve.
**date Count of records**
02-10-2017 4
04-10-2017 3
03-10-2017 5
Here is the code which I tried to group by,
import java.text.SimpleDateFormat
val dateformat:SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd")
val df = sqlContext.read.csv("<path>")
val result = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
But while executing code I am getting below exception.
<console>:30: error: value toLong is not a member of org.apache.spark.sql.ColumnName
val t = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
Please help me to resolve the issue.

you would need to change the date column, which seems to be in long, to date data type. This can be done by using from_unixtime built-in function. And then its just a groupBy and agg function calls and use count function.
import org.apache.spark.sql.functions._
def stringDate = udf((date: Long) => new java.text.SimpleDateFormat("dd-MM-yyyy").format(date))
df.withColumn("date", stringDate($"date"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Above answer is using udf function which should be avoided as much as possible, since udf is a black box and requires serialization and deserialisation of columns.
Updated
Thanks to #philantrovert for his suggestion to divide by 1000
import org.apache.spark.sql.functions._
df.withColumn("date", from_unixtime($"date"/1000, "yyyy-MM-dd"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Both ways work.

Related

Convert timestamp to day-of-week string with date_format in Spark/Scala

I keep on getting error, that I pass too many arguments and not sure why, as I am following the exact examples from:
command-629173529675356:9: error: too many arguments for method apply: (colName: String)org.apache.spark.sql.Column in class Dataset
val df_2 = date_format.withColumn("day_of_week", date_format(col("date"), "EEEE"))
My code:
val date_format = df_filter.withColumn("date", to_date(col("pickup_datetime")))
val df_2 = date_format.withColumn("day_of_week", date_format(col("date"), "EEEE"))
Thank you for the help!
dayofweek is the function you're looking for, so something like this
import org.apache.spark.sql.functions.dayofweek
date_format.withColumn("day_of_week", dayofweek(col("date")))
You get your error because you named your first dataframe date_format, which is the same name as the Spark's built-in function you want to use. So when you call date_format, you're retrieving your dataframe instead of date_format built-in function.
To solve this, you should either rename your first dataframe:
val df_1 = df_filter.withColumn("date", to_date(col("pickup_datetime")))
val df_2 = df_1.withColumn("day_of_week", date_format(col("date"), "EEEE"))
Or ensure that you're calling right date_format by importing functions and then call functions.date_format when extracting day of week:
import org.apache.spark.sql.functions
val date_format = df_filter.withColumn("date", to_date(col("pickup_datetime")))
val df_2 = date_format.withColumn("day_of_week", functions.date_format(col("date"), "EEEE"))

Combining two columns, casting two timestamp and selecting from df causes no error, but casting one column to timestamp and selecting causes error

Description
When I try to select a column that is cast to unix_timestamp and then timestamp from a dataframe there is a sparkanalysisexception error. See link below.
However, when I combine two columns, and then cast the combo to a unix_timestamp and then timestamp type and then select from a df there is no error.
Disparate Cases
Error:
How to extract year from a date string?
No Error
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val spark: SparkSession = SparkSession.builder().
appName("myapp").master("local").getOrCreate()
case class Person(id: Int, date: String, time:String)
import spark.implicits._
val mydf: DataFrame = Seq(Person(1,"9/16/13", "11:11:11")).toDF()
//solution.show()
//column modificaton
val datecol: Column = mydf("date")
val timecol: Column = mydf("time")
val newcol: Column = unix_timestamp(concat(datecol,lit(" "),timecol),"MM/dd/yy").cast(TimestampType)
mydf.select(newcol).show()
Results
Expected:
Error-sparkanalysis, can't find unix_timestamp(concat(....)) in mydf
Actual:
+------------------------------------------------------------------+
|CAST(unix_timestamp(concat(date, , time), MM/dd/yy) AS TIMESTAMP)|
+------------------------------------------------------------------+
| 2013-09-16 00:00:...|
These do not seem disparate cases. In the erroneous case, you had a new dataframe with changed column names. See below :-
val select_df: DataFrame = mydf.select(unix_timestamp(mydf("date"),"MM/dd/yy").cast(TimestampType))
select_df.select(year($"date")).show()
Here, select_df dataframe has changed column names from date to something like cast(unix_timestamp(mydf("date"),"MM/dd/yy")) as Timestamp
While in the case mentioned above, you are just defining a new column when you say :-
val newcol: Column = unix_timestamp(concat(datecol,lit(" "),timecol),"MM/dd/yy").cast(TimestampType)
And then you use this to select from your dataframe and thus it gives out expected results.
Hope this makes things clearer.

how to get months,years difference between two dates in sparksql

I am getting the error:
org.apache.spark.sql.analysisexception: cannot resolve 'year'
My input data:
1,2012-07-21,2014-04-09
My code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class c (id:Int,start:String,end:String)
val c1 = sc.textFile("date.txt")
val c2 = c1.map(_.split(",")).map(r=>(c(r(0).toInt,r(1).toString,r(2).toString)))
val c3 = c2.toDF();
c3.registerTempTable("c4")
val r = sqlContext.sql("select id,datediff(year,to_date(end), to_date(start)) AS date from c4")
What can I do resolve above error?
I have tried the following code but I got the output in days and I need it in years
val r = sqlContext.sql("select id,datediff(to_date(end), to_date(start)) AS date from c4")
Please advise me if i can use any function like to_date to get year difference.
Another simple way to cast the string to dateType in spark sql and apply sql dates and time functions on the columns like following :
import org.apache.spark.sql.types._
val c4 = c3.select(col("id"),col("start").cast(DateType),col("end").cast(DateType))
c4.withColumn("dateDifference", datediff(col("end"),col("start")))
.withColumn("monthDifference", months_between(col("end"),col("start")))
.withColumn("yearDifference", year(col("end"))-year(col("start")))
.show()
One of the above answers doesn't return the right Year when days between two dates less than 365. Below example provides the right year and rounds the month and year to 2 decimal.
Seq(("2019-07-01"),("2019-06-24"),("2019-08-24"),("2018-12-23"),("2018-07-20")).toDF("startDate").select(
col("startDate"),current_date().as("endDate"))
.withColumn("datesDiff", datediff(col("endDate"),col("startDate")))
.withColumn("montsDiff", months_between(col("endDate"),col("startDate")))
.withColumn("montsDiff_round", round(months_between(col("endDate"),col("startDate")),2))
.withColumn("yearsDiff", months_between(col("endDate"),col("startDate"),true).divide(12))
.withColumn("yearsDiff_round", round(months_between(col("endDate"),col("startDate"),true).divide(12),2))
.show()
Outputs:
+----------+----------+---------+-----------+---------------+--------------------+---------------+
| startDate| endDate|datesDiff| montsDiff|montsDiff_round| yearsDiff|yearsDiff_round|
+----------+----------+---------+-----------+---------------+--------------------+---------------+
|2019-07-01|2019-07-24| 23| 0.74193548| 0.74| 0.06182795666666666| 0.06|
|2019-06-24|2019-07-24| 30| 1.0| 1.0| 0.08333333333333333| 0.08|
|2019-08-24|2019-07-24| -31| -1.0| -1.0|-0.08333333333333333| -0.08|
|2018-12-23|2019-07-24| 213| 7.03225806| 7.03| 0.586021505| 0.59|
|2018-07-20|2019-07-24| 369|12.12903226| 12.13| 1.0107526883333333| 1.01|
+----------+----------+---------+-----------+---------------+--------------------+---------------+
You can find a complete working example at below URL
https://sparkbyexamples.com/spark-calculate-difference-between-two-dates-in-days-months-and-years/
Hope this helps.
Happy Learning !!
val r = sqlContext.sql("select id,datediff(year,to_date(end), to_date(start)) AS date from c4")
In the above code, "year" is not a column in the data frame i.e it is not a valid column in table "c4" that is why analysis exception is thrown as query is invalid, query is not able to find the "year" column.
Use Spark User Defined Function (UDF), that will be a more robust approach.
Since dateDiff only returns the difference between days. I prefer to use my own UDF.
import java.sql.Timestamp
import java.time.Instant
import java.time.temporal.ChronoUnit
import org.apache.spark.sql.functions.{udf, col}
import org.apache.spark.sql.DataFrame
def timeDiff(chronoUnit: ChronoUnit)(dateA: Timestamp, dateB: Timestamp): Long = {
chronoUnit.between(
Instant.ofEpochMilli(dateA.getTime),
Instant.ofEpochMilli(dateB.getTime)
)
}
def withTimeDiff(dateA: String, dateB: String, colName: String, chronoUnit: ChronoUnit)(df: DataFrame): DataFrame = {
val timeDiffUDF = udf[Long, Timestamp, Timestamp](timeDiff(chronoUnit))
df.withColumn(colName, timeDiffUDF(col(dateA), col(dateB)))
}
Then I call it as a dataframe transformation.
df.transform(withTimeDiff("sleepTime", "wakeupTime", "minutes", ChronoUnit.MINUTES)

Spark Scala: How to transform a column in a DF

I have a dataframe in Spark with many columns and a udf that I defined. I want the same dataframe back, except with one column transformed. Furthermore, my udf takes in a string and returns a timestamp. Is there an easy way to do this? I tried
val test = myDF.select("my_column").rdd.map(r => getTimestamp(r))
but this returns an RDD and just with the transformed column.
If you really need to use your function, I can suggest two options:
Using map / toDF:
import org.apache.spark.sql.Row
import sqlContext.implicits._
def getTimestamp: (String => java.sql.Timestamp) = // your function here
val test = myDF.select("my_column").rdd.map {
case Row(string_val: String) => (string_val, getTimestamp(string_val))
}.toDF("my_column", "new_column")
Using UDFs (UserDefinedFunction):
import org.apache.spark.sql.functions._
def getTimestamp: (String => java.sql.Timestamp) = // your function here
val newCol = udf(getTimestamp).apply(col("my_column")) // creates the new column
val test = myDF.withColumn("new_column", newCol) // adds the new column to original DF
Alternatively,
If you just want to transform a StringType column into a TimestampType column you can use the unix_timestamp column function available since Spark SQL 1.5:
val test = myDF
.withColumn("new_column", unix_timestamp(col("my_column"), "yyyy-MM-dd HH:mm")
.cast("timestamp"))
Note: For spark 1.5.x, it is necessary to multiply the result of unix_timestamp by 1000 before casting to timestamp (issue SPARK-11724). The resulting code would be:
val test = myDF
.withColumn("new_column", (unix_timestamp(col("my_column"), "yyyy-MM-dd HH:mm") *1000L)
.cast("timestamp"))
Edit: Added udf option

How to sum the values of one column of a dataframe in spark/scala

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.
I want to sum the values of each column, for instance the total number of steps on "steps" column.
As far as I see I want to use these kind of functions:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
But I can understand how to use the function sum.
When I write the following:
val df = CSV.load(args(0))
val sumSteps = df.sum("steps")
the function sum cannot be resolved.
Do I use the function sum wrongly?
Do Ι need to use first the function map? and if yes how?
A simple example would be very helpful! I started writing Scala recently.
You must first import the functions:
import org.apache.spark.sql.functions._
Then you can use them like this:
val df = CSV.load(args(0))
val sumSteps = df.agg(sum("steps")).first.get(0)
You can also cast the result if needed:
val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)
Edit:
For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:
val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first
Edit2:
For dynamically applying the aggregations, the following options are available:
Applying to all numeric columns at once:
df.groupBy().sum()
Applying to a list of numeric column names:
val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)
Applying to a list of numeric column names with aliases and/or casts:
val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()
If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)
//res1 Int = 19
Simply apply aggregation function, Sum on your column
df.groupby('steps').sum().show()
Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
Not sure this was around when this question was asked but:
df.describe().show("columnName")
gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()
Using spark sql query..just incase if it helps anyone!
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import java.util.stream.Collectors
val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()
df.createOrReplaceTempView("steps")
val sum = spark.sql("select sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
println("steps sum = " + sum) //prints 28