[Spark][DataFrame] Most elegant way to calculate 1st dayofWeek - scala

I have data in a DataFrame with a column as DateTime. Now I wish to find out the first day of the the week , First day means MONDAY.
So I have thought of the following Ways to achieve this -
With an arithmetic Calculation:
import org.apache.spark.sql.functions._
val df1 = Seq((1, "2020-05-12 10:23:45", 5000), (2, "2020-11-11 12:12:12", 2000)).toDF("id", "DateTime", "miliseconds")
val new_df1=df1.withColumn("week",date_sub(next_day(col("DateTime"),"monday"),7))
Result -
Creating an UDF which does the similar activity ( This is of least priority )
df1.withColumn("week", date_trunc("week", $"DateTime"))
If there are any more methods to achieve this, I would like to see more implementations.
Thanks in Advance.

Related

Subquery vs Dataframe filter function in spark

I am running the below spark SQL with the subquery.
val df = spark.sql("""select * from employeesTableTempview where dep_id in (select dep_id from departmentTableTempview)""")
df.count()
I also run the same with the help of dataframe functional way like below, Let's assume we read the employee table and department table as a dataframes and their names should be empDF and DepDF respectively,
val depidList = DepDF.map(x=>x(0).string).collect().toList()
val empdf2 = empDF.filter(col("dep_id").isin(depidList:_*))
empdf2.count
In these above two scenarios, which one gives better performance and why? Please help me to understand this scenarios in spark scala.
I can give you classic answer: it depends :D
Lets take a look at first case. I prepared similar example:
import org.apache.spark.sql.functions._
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
val data = Seq(("test", "3"),("test", "3"), ("test2", "5"), ("test3", "7"), ("test55", "86"))
val data2 = Seq(("test", "3"),("test", "3"), ("test2", "5"), ("test3", "6"), ("test33", "76"))
val df1 = data.toDF("name", "dep_id")
val df2 = data2.toDF("name", "dep_id")
df1.createOrReplaceTempView("employeesTableTempview")
df2.createOrReplaceTempView("departmentTableTempview")
val result = spark.sql("select * from employeesTableTempview where dep_id in (select dep_id from departmentTableTempview)")
result.count
I am setting autoBroadcastJoinThreshold to -1 because i assume that your datasets are going to be bigger than default 10mb for this parameter
This Sql query generates this plan:
As you can see spark is performing a SMJ which will be a case most of the time for datasets bigger than 10mb. This requires data to be shuffled and then sorted so its quiet heavy operation
Now lets check option2 (first lines of codes are the same as previously):
val depidList = df1.map(x=>x.getString(1)).collect().toList
val empdf2 = df2.filter(col("dep_id").isin(depidList:_*))
empdf2.count
For this option plan is different. You dont have the join obviously but there are two separate sqls. First is for reading DepDF dataset and then collecting one column as a list. In second sql this list is used to filter the data in empDF dataset.
When DepDF is relatively small it should be fine, but if you need more generic solution you may stick to sub-query which is going to resolve to join. You can also use join directly on your dataframes with Spark df api

How to calculate an hourly count (grouped by a timeStamp type) in Spark dataframe?

For a dataframe df1 where col1 is of type DateType, I do the following to get the daily count.
val df1_new=df1.groupBy("col1").count()
However, for my dataframe df2 where col2 is of type TimestampType, I want to get the count on a per-hour basis. But replicating the above code for this, results in separate count for every timestamp that differs by even a second.
What should I be doing to achieve the count on an hourly-basis for df2?
You can use date_trunc to truncate the timestamps to the hour level:
val df2_new = df2.groupBy(date_trunc("hour", col("col2"))).count()

Spark Scala Convert Int Column into Datetime

I have Datetime stored in the following format - YYYYMMDDHHMMSS. (Data Type -Long Int)
Sample Data -
This Temp View - ingestionView comes from a DataFrame.
Now I want to introduce a new column newingestiontime in the dataframe which is of the format YYYY-MM-DD HH:MM:SS.
One of the ways I have tried this is, but it didnt work either -
val res = ingestiondatetimeDf.select(col("ingestiontime"), unix_timestamp(col("newingestiontime"), "yyyyMMddHHmmss").cast(TimestampType).as("timestamp"))
Output -
Please help me here , and If there is a better way to establish this, I will be delighted to learn new thing.
Thanks in advance.
Use from_unixtime & unix_timestamp.
Check below code.
scala> df
.withColumn(
"newingestiontime",
from_unixtime(
unix_timestamp($"ingestiontime".cast("string"),
"yyyyMMddHHmmss")
)
)
.show(false)
+--------------+-------------------+
|ingestiontime |newingestiontime |
+--------------+-------------------+
|20200501230000|2020-05-01 23:00:00|
+--------------+-------------------+

How to get integer value with leading zero in Spark (Scala)

I have spark dataframe and and trying to add Year, Month and Day columns to it.
But the problem is after adding the YTD columns it does not keeps the leading zero with the date and month columns.
val cityDF= Seq(("Delhi","India"),("Kolkata","India"),("Mumbai","India"),("Nairobi","Kenya"),("Colombo","Srilanka"),("Tibet","China")).toDF("City","Country")
val dateString = "2020-01-01"
val dateCol = org.apache.spark.sql.functions.to_date(lit(dateString))
val finaldf = cityDF.select($"*", year(dateCol).alias("Year"), month(dateCol).alias("Month"), dayofmonth(dateCol).alias("Day"))
I want to keep the leading zero from the Month and Day columns but it is giving me result as 1 instead of 01.
As I am using year month date columns for the spark partition creation. so I want to keep the leading zeros intact.
So my question is: How do I keep the leading zero in my dataframe columns.
Integer type can be converted to String type, where leading zeroes are possibe, with "format_string" function:
val finaldf =
cityDF
.select($"*",
year(dateCol).alias("Year"),
format_string("%02d", month(dateCol)).alias("Month"),
format_string("%02d", dayofmonth(dateCol)).alias("Day")
)
Why not simply use date_format for that?
val finaldf = cityDF.select(
$"*",
year(dateCol).alias("Year"),
date_format(dateCol, "MM").alias("Month"),
date_format(dateCol, "dd").alias("Day")
)

How to group by on epoch timestame field in Scala spark

I want to group by the records by date. but the date is in epoch timestamp in millisec.
Here is the sample data.
date, Col1
1506838074000, a
1506868446000, b
1506868534000, c
1506869064000, a
1506869211000, c
1506871846000, f
1506874462000, g
1506879651000, a
Here is what I'm trying to achieve.
**date Count of records**
02-10-2017 4
04-10-2017 3
03-10-2017 5
Here is the code which I tried to group by,
import java.text.SimpleDateFormat
val dateformat:SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd")
val df = sqlContext.read.csv("<path>")
val result = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
But while executing code I am getting below exception.
<console>:30: error: value toLong is not a member of org.apache.spark.sql.ColumnName
val t = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
Please help me to resolve the issue.
you would need to change the date column, which seems to be in long, to date data type. This can be done by using from_unixtime built-in function. And then its just a groupBy and agg function calls and use count function.
import org.apache.spark.sql.functions._
def stringDate = udf((date: Long) => new java.text.SimpleDateFormat("dd-MM-yyyy").format(date))
df.withColumn("date", stringDate($"date"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Above answer is using udf function which should be avoided as much as possible, since udf is a black box and requires serialization and deserialisation of columns.
Updated
Thanks to #philantrovert for his suggestion to divide by 1000
import org.apache.spark.sql.functions._
df.withColumn("date", from_unixtime($"date"/1000, "yyyy-MM-dd"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Both ways work.