I am receiving time data into my source as a csv file in the format (HHMMSSHS). I am not sure about what HS in the format stands for. example data will be like 15110708.
I am creating table in databricks table with received columns and data. I want to convert this field to time while processing in scala.
I am using UDF to do formating on any data on the go. But for this i am totally stuck while writing a UDF for parsing only time.
The final output should be 15:11:07:08 or any time format suitable for this string.
I tried with java.text.SimpleDateFormat and faced issue with unparsable string.
Is there any way to convert the above given string to a time format?
I am storing this value as acolumn in databricks notebook table. Is there any other format other than string to save only time values?
Have you tried?:
import java.time.LocalTime
val dtf : DateTimeFormatter = DateTimeFormatter.ofPattern("HHmmssSS")
val localTime = udf { str : String =>
LocalTime.parse(str, dtf).toString
}
that gives:
+---------+------------+
|Timestamp|converted |
+---------+------------+
|15110708 |15:11:07.080|
|15110708 |15:11:07.080|
+---------+------------+
Related
I am currently learning pyspark and I need to convert a COLUMN of strings in format 13/09/2021 20:45 into a timestamp of just the hour 20:45.
Now I figured that I can do this with q1.withColumn("timestamp",to_timestamp("ts")) \ .show() (where q1 is my dataframe, and ts is a column we are speaking about) to convert my input into a DD/MM/YYYY HH:MM format, however values returned are only null. I therefore realised that I need an input in PySpark timestamp format (MM-dd-yyyy HH:mm:ss.SSSS) to convert it to a proper timestamp. Hence now my question:
How can I convert the column of strings dd/mm/yyyy hh:mm into a format understandable for pyspark so that I can convert it to timestamp format?
There are different ways you can do that
from pyspark.sql import functions as F
# use substring
df.withColumn('hour', F.substring('A', 12, 15)).show()
# use regex
df.withColumn('hour', F.regexp_extract('A', '\d{2}:\d{2}', 0)).show()
# use datetime
df.withColumn('hour', F.from_unixtime(F.unix_timestamp('A', 'dd/MM/yyyy HH:mm'), 'HH:mm')).show()
# Output
# +----------------+-----+
# | A| hour|
# +----------------+-----+
# |13/09/2021 20:45|20:45|
# +----------------+-----+
unix_timestamp may be a help for your problem.
Just try this:
Convert pyspark string to date format
I have a column in spark dataframe of timestamp type with date format like '2019-06-13T11:39:10.244Z'
My goal is to convert this column into EST time(subtracting 4 hours) keeping the same format.
I tried it using from_utc_timestamp api but it seems it is converting the UTC time to my local timezone (+5:30) and adding it to the timestamp then subtracting 4 hours from it. I tried to use Joda time but for some reason it is adding 33 days to the EST time
innput = 2019-06-13T11:39:10.244Z
using from_utc_timestamp api:
val tDf = df.withColumn("newTimeCol", to_utc_timestamp(col("timeCol"), "America/New_York"))
output = 2019-06-13T13:09:10.244Z+5:30
using Joda time package:
val coder : (String => String) = (arg: String) => {
new DateTime(arg, DateTimeZone.UTC).minusHours(4).toString("yyyy-mm-dd'T'HH:mm:s.SS'Z'")}
val sqlfunc = udf(coder)
val tDf = df.withColumn("newTime", sqlfunc(col("_c20")))
output = 2019-39-13T07:39:10.244Z
desired output = 2019-06-13T07:39:10.244Z
Kindly advise how should I proceed. Thanks in advance
There is a typo in your format string when creating the output.
Your format string should be yyyy-MM-dd'T'HH:mm:s.SS'Z' but it is yyyy-mm-dd'T'HH:mm:s.SS'Z'.
mm is the format char for minutes while MM is the format char for the months. You can check all format chars here.
I have data in hive table in the below format.
2019-11-21 18:19:15.817
I wrote a sql query as below to get the above column value into epoch format.
val newDF = spark.sql(f"""select TRIM(id) as ID, unix_timestamp(sig_ts) as SIG_TS from table""")
And I am getting the output column SIG_TS as 1574360296 which is not having milliseconds.
How to get the epoch timestamp of a date with milliseconds?
Simple way: Create an UDF since spark's built-in function truncates at seconds.
import java.sql.Timestamp
val fullTimestampUDF = udf{t: Timestamp => t.getTime}
val df = Seq("2019-11-21 18:19:15.817").toDF("sig_ts")
.withColumn("sig_ts_ut", unix_timestamp($"sig_ts"))
.withColumn("sig_ts_ut_long", fullTimestampUDF($"sig_ts"))
df.show(false)
+-----------------------+----------+--------------+
|sig_ts |sig_ts_ut |sig_ts_ut_long|
+-----------------------+----------+--------------+
|2019-11-21 18:19:15.817|1574356755|1574356755817 |
+-----------------------+----------+--------------+
Since I am newbie to Apache spark and Scala methods, I want to perform the following needs.
-Read specific column from parquet file(13 Digit timestamp).
-Convert the timestamp to ordinary date format(yyyy-MM-dd HH:mm:ss).
-Store the result as another column in dataset.
I can read the timestamp using the following code
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object Test {
def main(args: Array[String]){
val conf=new SparkConf().setAppName("TEST_APP").setMaster("local")
val sc=new SparkContext(conf)
val sqlcon=new SQLContext(sc)
val Testdata = sqlcon.read.parquet("D:\\TestData.parquet")
val data_eve_type_end=Testdata.select(Testdata.col("heading.timestamp")).where(Testdata.col("status").equalTo("Success")).toDF("13digitTime")
}
}
and I tried to convert the timestamp using the reference link below
[https://stackoverflow.com/a/54354790/9493078]
But it doesn't working for me.I don't know actually whether I am fetched the data into a dataframe correctly or not.Anyway that makes an output as a table with columnname 13digitTime and values as some numbers with size 13 digit.
When I am trying to do code from link mentioned above it shows the error as
WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '(`13digitTime` / 1000000)' due to data type mismatch:
I am expecting for data frame with 2 columns in which one should contain the 13 digit timestamp and other should contain converted time from 13 digit to general date format(yyyy-MM-dd HH:mm:ss).
I wish to kindly get a solution,Thanks in advance.
sqlcon.read.parquet will return a dataframe itself. All you need to do is add a new column using withcolumn method. This should work.
val data_eve_type_end = Testdata.withColumn("13digitTime", from_unixtime($"heading.timestamp"))
I updated my code like this in which the 13 digit unix time converted into 10 digit by dividing by 1000 and cast it to tiimestamp.
val date_conv=data_eve_type_end.select(col("timestamp_value").as("UNIX TIME"),from_unixtime(col("timestamp_value")/1000).cast("timestamp").as("GENERAL TIME"))
and output is like
+-------------+-------------------+
| UNIX TIME| GENERAL TIME|
+-------------+-------------------+
|1551552902793| 2019-03-0 6:55:02|
I have a data_date that gives a format of yyyymmdd:
beginDate = Some(LocalDate.of(startYearMonthDay(0), startYearMonthDay(1),
startYearMonthDay(2)))
var Date = beginDate.get
.......
val data_date = Date.toString().replace("-", "")
This will give me a result of '20180202'
however, I need the result to be 201802 (yyyymm) for my usecase. I don't want to change the value of beginDate, I just want to change the data_date value to fit my usecase, how do I do that? is there a split function I can use?
Thanks!
It's not clear from the code snippet that you're using Spark, but the tags imply that, so I'll give an answer using Spark built-in functions. Suppose your DataFrame is called df with date column my_date_column. Then, you can simply use date_format
scala> import org.apache.spark.sql.functions.date_format
import org.apache.spark.sql.functions.date_format
scala> df.withColumn("my_new_date_column", date_format($"my_date_column", "YYYYMM")).
| select($"my_new_date_column").limit(1).show
// for example:
+------------------+
|my_new_date_column|
+------------------+
| 201808|
+------------------+
The way to to it with DateTimeFormatter.
val formatter = DateTimeFormatter.ofPattern("YMM")
val data_date = Date.format(foramatter)
I recommend you to read through DateTimeFormatter docs, so you can format date the way you want.
You can do this by only taking the first 6 characters of the resulting string.
i.e.
val s = "20180202"
s.substring(0, 6) // returns "201802"