Convert timestamp to string without loosing miliseconds in Spark (Scala) - scala

I'm using the following code in order to convert a date/timestamp into a string with a specific format:
when(to_date($"timestamp", fmt).isNotNull, date_format(to_timestamp($"timestamp", fmt), outputFormat))
The "fmt" is coming from a list of possible formats because we have different formats in the source data.
The issue here is that when we apply the "to_timestamp" function, the milliseconds part is lost. Is there any other possible (and not over complicated) ways to do this without loosing the miliseconds detail?
Thanks,
BR

I remember having to mess with it while back. This will work as well.
df = (
spark
.createDataFrame(['2021-07-19 17:29:36.123',
'2021-07-18 17:29:36.123'], "string").toDF("ts")
.withColumn('ts_with_mili',
(unix_timestamp(col('ts'), "yyyy-MM-dd HH:mm:ss.SSS")
+ substring(col('ts'), -3, 3).cast('float')/1000).cast('timestamp'))
).show(truncate=False)
# +-----------------------+-----------------------+
# |ts |ts_with_mili |
# +-----------------------+-----------------------+
# |2021-07-19 17:29:36.123|2021-07-19 17:29:36.123|
# |2021-07-18 17:29:36.123|2021-07-18 17:29:36.123|
# +-----------------------+-----------------------+

Related

cast big number in human readable format

I'm working with databricks on a notebook.
I have a column with numbers like this 103503119090884718216391506040
They are in string format. I can print them and read them easily.
For debugging purpose I need to be able to read them. However I also need to be able to apply them .sort() method. Casting them as IntegerType() return null value, casting them as double make them unreadable.
How can I convert them in a human readable format but at the same time where .sort() would work? Do I need to create two separate columns?
To make the column sortable, you could convert your column to DecimalType(precision, scale) (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.DecimalType.html#pyspark.sql.types.DecimalType). For this data type you can choose the possible value range the two arguments
from pyspark.sql import SparkSession, Row, types as T, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(string_column='103503119090884718216391506040'),
Row(string_column='103503119090884718216391506039'),
Row(string_column='90'),
])
(
df
.withColumn('decimal_column', F.col('string_column').cast(T.DecimalType(30,0)))
.sort('decimal_column')
.show(truncate=False)
)
# Output
+------------------------------+------------------------------+
|string_column |decimal_column |
+------------------------------+------------------------------+
|90 |90 |
|103503119090884718216391506039|103503119090884718216391506039|
|103503119090884718216391506040|103503119090884718216391506040|
|103503119090884718216391506041|103503119090884718216391506041|
+------------------------------+------------------------------+
Concerning "human readability" I'm not sure whether that helps, though.

PySpark string column to timestamp conversion

I am currently learning pyspark and I need to convert a COLUMN of strings in format 13/09/2021 20:45 into a timestamp of just the hour 20:45.
Now I figured that I can do this with q1.withColumn("timestamp",to_timestamp("ts")) \ .show() (where q1 is my dataframe, and ts is a column we are speaking about) to convert my input into a DD/MM/YYYY HH:MM format, however values returned are only null. I therefore realised that I need an input in PySpark timestamp format (MM-dd-yyyy HH:mm:ss.SSSS) to convert it to a proper timestamp. Hence now my question:
How can I convert the column of strings dd/mm/yyyy hh:mm into a format understandable for pyspark so that I can convert it to timestamp format?
There are different ways you can do that
from pyspark.sql import functions as F
# use substring
df.withColumn('hour', F.substring('A', 12, 15)).show()
# use regex
df.withColumn('hour', F.regexp_extract('A', '\d{2}:\d{2}', 0)).show()
# use datetime
df.withColumn('hour', F.from_unixtime(F.unix_timestamp('A', 'dd/MM/yyyy HH:mm'), 'HH:mm')).show()
# Output
# +----------------+-----+
# | A| hour|
# +----------------+-----+
# |13/09/2021 20:45|20:45|
# +----------------+-----+
unix_timestamp may be a help for your problem.
Just try this:
Convert pyspark string to date format

Convert time string into timestamp/date time in scala

I am receiving time data into my source as a csv file in the format (HHMMSSHS). I am not sure about what HS in the format stands for. example data will be like 15110708.
I am creating table in databricks table with received columns and data. I want to convert this field to time while processing in scala.
I am using UDF to do formating on any data on the go. But for this i am totally stuck while writing a UDF for parsing only time.
The final output should be 15:11:07:08 or any time format suitable for this string.
I tried with java.text.SimpleDateFormat and faced issue with unparsable string.
Is there any way to convert the above given string to a time format?
I am storing this value as acolumn in databricks notebook table. Is there any other format other than string to save only time values?
Have you tried?:
import java.time.LocalTime
val dtf : DateTimeFormatter = DateTimeFormatter.ofPattern("HHmmssSS")
val localTime = udf { str : String =>
LocalTime.parse(str, dtf).toString
}
that gives:
+---------+------------+
|Timestamp|converted |
+---------+------------+
|15110708 |15:11:07.080|
|15110708 |15:11:07.080|
+---------+------------+

Spark Scala Convert Int Column into Datetime

I have Datetime stored in the following format - YYYYMMDDHHMMSS. (Data Type -Long Int)
Sample Data -
This Temp View - ingestionView comes from a DataFrame.
Now I want to introduce a new column newingestiontime in the dataframe which is of the format YYYY-MM-DD HH:MM:SS.
One of the ways I have tried this is, but it didnt work either -
val res = ingestiondatetimeDf.select(col("ingestiontime"), unix_timestamp(col("newingestiontime"), "yyyyMMddHHmmss").cast(TimestampType).as("timestamp"))
Output -
Please help me here , and If there is a better way to establish this, I will be delighted to learn new thing.
Thanks in advance.
Use from_unixtime & unix_timestamp.
Check below code.
scala> df
.withColumn(
"newingestiontime",
from_unixtime(
unix_timestamp($"ingestiontime".cast("string"),
"yyyyMMddHHmmss")
)
)
.show(false)
+--------------+-------------------+
|ingestiontime |newingestiontime |
+--------------+-------------------+
|20200501230000|2020-05-01 23:00:00|
+--------------+-------------------+

How to get week start date in scala

I wrote the below code to get the Monday date for the date passed, Basically created an udf to pass a date and get it's monday date
def calculate_weekstartUDF = udf((pro_rtc:String)=>{
val df = new SimpleDateFormat("yyyy-MM-dd").parse(pro_rtc)
val cal = Calendar.getInstance()
cal.setTime(df)
cal.set(Calendar.DAY_OF_WEEK, Calendar.MONDAY)
//Get this Monday date
val Period=cal.getTime()
})
Using the above UDF in below code
flattendedJSON.withColumn("weekstartdate",calculate_weekstartUDF($"pro_rtc")).show()
is there any better way to achieve this.
Try with this approach using date_sub,next_day functions in spark.
Explanation:
date_sub(
next_day('dt,"monday"), //get next monday date
7)) //substract week from the date
Example:
val df =Seq(("2019-08-06")).toDF("dt")
import org.apache.spark.sql.functions._
df.withColumn("week_strt_day",date_sub(next_day('dt,"monday"),7)).show()
Result:
+----------+-------------+
| dt|week_strt_day|
+----------+-------------+
|2019-08-06| 2019-08-05|
+----------+-------------+
You could use the Java 8 Date API :
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import java.time.temporal.{TemporalField, WeekFields}
import java.util.Locale
def calculate_weekstartUDF =
(pro_rtc:String)=>{
val localDate = LocalDate.parse(pro_rtc); // By default parses a string in YYYY-MM-DD format.
val dayOfWeekField = WeekFields.of(Locale.getDefault).dayOfWeek()
localDate.`with`(dayOfWeekField, 1)
}
Of course, specify other thing than Locale.getDefault if you want to use another Locale.
tl;dr
LocalDate
.parse( "2019-01-23" )
.with(
TemporalAdjusters.previous( DayOfWeek.MONDAY )
)
.toString()
2019-01-21
Avoid legacy date-time classes
You are using terrible date-time classes that were supplanted years ago by the modern java.time classes defined in JSR 310.
java.time
Your input string is in standard ISO 8601 format. The java.time classes use these standard formats by default when parsing/generating strings. So no need to specify a formatting pattern.
Here is Java-syntax example code. (I don't know Scala)
LocalDate ld = LocalDate.parse( "2019-01-23" ) ;
To move from that date to another, use a TemporalAdjuster. You can find several in the TemporalAdjusters class.
Specify a day-of-week using the DayOfWeek enum, predefining seven objects, one for each day of the week.
TemporalAdjuster ta = TemporalAdjusters.previous( DayOfWeek.MONDAY ) ;
LocalDate previousMonday = ld.with( ta ) ;
See this code run live at IdeOne.com.
Monday, January 21, 2019
If the starting date happened to be a Monday, and you want to stay with that, use the alternate adjuster, previousOrSame.
Try this:
In my example, 'pro_rtc' is in seconds. Adjust if needed.
import org.apache.spark.sql.functions._
dataFrame
.withColumn("Date", to_date(from_unixtime(col("pro_rtc"))))
.withColumn("Monday", expr("date_sub(Date, dayofweek(Date) - 2)"))
That way, you're also utilizing Spark's query engine and avoiding UDF's latency
The spark-daria beginningOfWeek and endOfWeek functions are the easiest way to solve this problem. They're also the most flexible because they can easily be configured for different week end dates.
Suppose you have this dataset:
+----------+
| some_date|
+----------+
|2020-12-27|
|2020-12-28|
|2021-01-03|
|2020-12-12|
| null|
+----------+
Here's how to compute the beginning of the week and the end of the week, assuming the week ends on a Wednesday:
import com.github.mrpowers.spark.daria.sql.functions._
df
.withColumn("end_of_week", endOfWeek(col("some_date"), "Wed"))
.withColumn("beginning_of_week", beginningOfWeek(col("some_date"), "Wed"))
.show()
Here are the results:
+----------+-----------+-----------------+
| some_date|end_of_week|beginning_of_week|
+----------+-----------+-----------------+
|2020-12-27| 2020-12-30| 2020-12-24|
|2020-12-28| 2020-12-30| 2020-12-24|
|2021-01-03| 2021-01-06| 2020-12-31|
|2020-12-12| 2020-12-16| 2020-12-10|
| null| null| null|
+----------+-----------+-----------------+
See this file for the underlying implementations. This post explains these functions in greater detail.