Scala date format - scala

I have a data_date that gives a format of yyyymmdd:
beginDate = Some(LocalDate.of(startYearMonthDay(0), startYearMonthDay(1),
startYearMonthDay(2)))
var Date = beginDate.get
.......
val data_date = Date.toString().replace("-", "")
This will give me a result of '20180202'
however, I need the result to be 201802 (yyyymm) for my usecase. I don't want to change the value of beginDate, I just want to change the data_date value to fit my usecase, how do I do that? is there a split function I can use?
Thanks!

It's not clear from the code snippet that you're using Spark, but the tags imply that, so I'll give an answer using Spark built-in functions. Suppose your DataFrame is called df with date column my_date_column. Then, you can simply use date_format
scala> import org.apache.spark.sql.functions.date_format
import org.apache.spark.sql.functions.date_format
scala> df.withColumn("my_new_date_column", date_format($"my_date_column", "YYYYMM")).
| select($"my_new_date_column").limit(1).show
// for example:
+------------------+
|my_new_date_column|
+------------------+
| 201808|
+------------------+

The way to to it with DateTimeFormatter.
val formatter = DateTimeFormatter.ofPattern("YMM")
val data_date = Date.format(foramatter)
I recommend you to read through DateTimeFormatter docs, so you can format date the way you want.

You can do this by only taking the first 6 characters of the resulting string.
i.e.
val s = "20180202"
s.substring(0, 6) // returns "201802"

Related

convert a string type(MM/dd/YYYY hh:mm:ss AM/PM) to date format in PySpark?

I have a string in format 05/26/2021 11:31:56 AM for mat and I want to convert it to a date format like 05-26-2021 in pyspark.
I have tried below things but its converting the column type to date but making the values null.
df = df.withColumn("columnname", F.to_date(df["columnname"], 'yyyy-MM-dd'))
another one which I have tried is
df = df.withColumn("columnname", df["columnname"].cast(DateType()))
I have also tried the below method
df = df.withColumn(column.lower(), F.to_date(F.col(column.lower())).alias(column).cast("date"))
but in every method I was able to convert the column type to date but it makes the values null.
Any suggestion is appreciated
# Create data frame like below
df = spark.createDataFrame(
[("Test", "05/26/2021 11:31:56 AM")],
("user_name", "login_date"))
# Import functions
from pyspark.sql import functions as f
# Create data framew with new column new_date with data in desired format
df1 = df.withColumn("new_date", f.from_unixtime(f.unix_timestamp("login_date",'MM/dd/yyyy hh:mm:ss a'),'yyyy-MM-dd').cast('date'))
The above answer posted by #User12345 works and the below method is also works
df = df.withColumn(column, F.unix_timestamp(column, "MM/dd/YYYY hh:mm:ss aa").cast("double").cast("timestamp"))
df = df.withColumn(column, F.from_utc_timestamp(column, 'Z').cast(DateType()))
Use this
df=data.withColumn("Date",to_date(to_timestamp("Date","M/d/yyyy")))

Spark Scala Convert Int Column into Datetime

I have Datetime stored in the following format - YYYYMMDDHHMMSS. (Data Type -Long Int)
Sample Data -
This Temp View - ingestionView comes from a DataFrame.
Now I want to introduce a new column newingestiontime in the dataframe which is of the format YYYY-MM-DD HH:MM:SS.
One of the ways I have tried this is, but it didnt work either -
val res = ingestiondatetimeDf.select(col("ingestiontime"), unix_timestamp(col("newingestiontime"), "yyyyMMddHHmmss").cast(TimestampType).as("timestamp"))
Output -
Please help me here , and If there is a better way to establish this, I will be delighted to learn new thing.
Thanks in advance.
Use from_unixtime & unix_timestamp.
Check below code.
scala> df
.withColumn(
"newingestiontime",
from_unixtime(
unix_timestamp($"ingestiontime".cast("string"),
"yyyyMMddHHmmss")
)
)
.show(false)
+--------------+-------------------+
|ingestiontime |newingestiontime |
+--------------+-------------------+
|20200501230000|2020-05-01 23:00:00|
+--------------+-------------------+

How to get week start date in scala

I wrote the below code to get the Monday date for the date passed, Basically created an udf to pass a date and get it's monday date
def calculate_weekstartUDF = udf((pro_rtc:String)=>{
val df = new SimpleDateFormat("yyyy-MM-dd").parse(pro_rtc)
val cal = Calendar.getInstance()
cal.setTime(df)
cal.set(Calendar.DAY_OF_WEEK, Calendar.MONDAY)
//Get this Monday date
val Period=cal.getTime()
})
Using the above UDF in below code
flattendedJSON.withColumn("weekstartdate",calculate_weekstartUDF($"pro_rtc")).show()
is there any better way to achieve this.
Try with this approach using date_sub,next_day functions in spark.
Explanation:
date_sub(
next_day('dt,"monday"), //get next monday date
7)) //substract week from the date
Example:
val df =Seq(("2019-08-06")).toDF("dt")
import org.apache.spark.sql.functions._
df.withColumn("week_strt_day",date_sub(next_day('dt,"monday"),7)).show()
Result:
+----------+-------------+
| dt|week_strt_day|
+----------+-------------+
|2019-08-06| 2019-08-05|
+----------+-------------+
You could use the Java 8 Date API :
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import java.time.temporal.{TemporalField, WeekFields}
import java.util.Locale
def calculate_weekstartUDF =
(pro_rtc:String)=>{
val localDate = LocalDate.parse(pro_rtc); // By default parses a string in YYYY-MM-DD format.
val dayOfWeekField = WeekFields.of(Locale.getDefault).dayOfWeek()
localDate.`with`(dayOfWeekField, 1)
}
Of course, specify other thing than Locale.getDefault if you want to use another Locale.
tl;dr
LocalDate
.parse( "2019-01-23" )
.with(
TemporalAdjusters.previous( DayOfWeek.MONDAY )
)
.toString()
2019-01-21
Avoid legacy date-time classes
You are using terrible date-time classes that were supplanted years ago by the modern java.time classes defined in JSR 310.
java.time
Your input string is in standard ISO 8601 format. The java.time classes use these standard formats by default when parsing/generating strings. So no need to specify a formatting pattern.
Here is Java-syntax example code. (I don't know Scala)
LocalDate ld = LocalDate.parse( "2019-01-23" ) ;
To move from that date to another, use a TemporalAdjuster. You can find several in the TemporalAdjusters class.
Specify a day-of-week using the DayOfWeek enum, predefining seven objects, one for each day of the week.
TemporalAdjuster ta = TemporalAdjusters.previous( DayOfWeek.MONDAY ) ;
LocalDate previousMonday = ld.with( ta ) ;
See this code run live at IdeOne.com.
Monday, January 21, 2019
If the starting date happened to be a Monday, and you want to stay with that, use the alternate adjuster, previousOrSame.
Try this:
In my example, 'pro_rtc' is in seconds. Adjust if needed.
import org.apache.spark.sql.functions._
dataFrame
.withColumn("Date", to_date(from_unixtime(col("pro_rtc"))))
.withColumn("Monday", expr("date_sub(Date, dayofweek(Date) - 2)"))
That way, you're also utilizing Spark's query engine and avoiding UDF's latency
The spark-daria beginningOfWeek and endOfWeek functions are the easiest way to solve this problem. They're also the most flexible because they can easily be configured for different week end dates.
Suppose you have this dataset:
+----------+
| some_date|
+----------+
|2020-12-27|
|2020-12-28|
|2021-01-03|
|2020-12-12|
| null|
+----------+
Here's how to compute the beginning of the week and the end of the week, assuming the week ends on a Wednesday:
import com.github.mrpowers.spark.daria.sql.functions._
df
.withColumn("end_of_week", endOfWeek(col("some_date"), "Wed"))
.withColumn("beginning_of_week", beginningOfWeek(col("some_date"), "Wed"))
.show()
Here are the results:
+----------+-----------+-----------------+
| some_date|end_of_week|beginning_of_week|
+----------+-----------+-----------------+
|2020-12-27| 2020-12-30| 2020-12-24|
|2020-12-28| 2020-12-30| 2020-12-24|
|2021-01-03| 2021-01-06| 2020-12-31|
|2020-12-12| 2020-12-16| 2020-12-10|
| null| null| null|
+----------+-----------+-----------------+
See this file for the underlying implementations. This post explains these functions in greater detail.

How to extract a single (column/row) value from a dataframe using PySpark?

Here's my spark code. It works fine and returns 2517. All I want to do is to print "2517 degrees"...but I'm not sure how to extract that 2517 into a variable. I can only display the dataframe but not extract values from it. Sounds super easy but unfortunately I'm stuck! Any help will be appreciated. Thanks!
df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").option("delimiter", "\t").load("dbfs:/databricks-datasets/power-plant/data")
df.createOrReplaceTempView("MyTable")
df = spark.sql("SELECT COUNT (DISTINCT AP) FROM MyTable")
display(df)
here is the alternative:
df.first()['column name']
it will give you the desired output. you can store it in a variable.
I think you're looking for collect. Something like this should get you the value:
df.collect()[0]['count(DISTINCT AP)']
assuming the column name is 'count(DISTINCT AP)'
If you want to extract value in specific row and column:
df.select('column name').collect()[row number][0]
for example df.select('eye color').collect()[20][0]

How to group by on epoch timestame field in Scala spark

I want to group by the records by date. but the date is in epoch timestamp in millisec.
Here is the sample data.
date, Col1
1506838074000, a
1506868446000, b
1506868534000, c
1506869064000, a
1506869211000, c
1506871846000, f
1506874462000, g
1506879651000, a
Here is what I'm trying to achieve.
**date Count of records**
02-10-2017 4
04-10-2017 3
03-10-2017 5
Here is the code which I tried to group by,
import java.text.SimpleDateFormat
val dateformat:SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd")
val df = sqlContext.read.csv("<path>")
val result = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
But while executing code I am getting below exception.
<console>:30: error: value toLong is not a member of org.apache.spark.sql.ColumnName
val t = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
Please help me to resolve the issue.
you would need to change the date column, which seems to be in long, to date data type. This can be done by using from_unixtime built-in function. And then its just a groupBy and agg function calls and use count function.
import org.apache.spark.sql.functions._
def stringDate = udf((date: Long) => new java.text.SimpleDateFormat("dd-MM-yyyy").format(date))
df.withColumn("date", stringDate($"date"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Above answer is using udf function which should be avoided as much as possible, since udf is a black box and requires serialization and deserialisation of columns.
Updated
Thanks to #philantrovert for his suggestion to divide by 1000
import org.apache.spark.sql.functions._
df.withColumn("date", from_unixtime($"date"/1000, "yyyy-MM-dd"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Both ways work.