I have a csv file which I read in with scala and spark. In this data is a timecolumn, which contains timevalues as strings of the form
val myTimestamp = "2021-05-24 18:44:22.127631600+02:00"
I now need to parse this timestamp. Since I am having a Dataframe, I want to use the functionality of .withColumn and to_timestamp.
Samplecode:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, to_timestamp}
val spark:SparkSession = SparkSession.builder().master("local").getOrCreate()
val myTimestamp: String = "2021-05-24 18:44:22.127631600+02:00"
val myFormat: String = "yyyy-MM-dd HH:mm:ss"
import spark.sqlContext.implicits._
Seq(myTimestamp)
.toDF("theTimestampColumn")
.withColumn("parsedTime", to_timestamp(col("theTimestampColumn"),fmt = myFormat))
.show()
Output:
+--------------------+-------------------+
| theTimestampColumn| parsedTime|
+--------------------+-------------------+
|2021-05-24 18:44:...|2021-05-24 18:44:22|
+--------------------+-------------------+
Running this code works fine, but I restrict my timestamps to a second-precision. I want to have the whole precision with 9 fractions of the second. Therefore I read the documentation under https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html , but I wasn't able to set up the right number of S (tried with 1 to 9 S's) and X for specifiing the fractions of the second or the timezone, respectively. The parsedTime-column of the Dataframe becomes null. How do I parse this timestamp given the tools above?
I have for example also tried
val myFormat: String = "yyyy-MM-dd HH:mm:ss.SSSSSSSSSZ"
val myFormat: String = "yyyy-MM-dd HH:mm:ss.SSSSSSSSSX"
val myFormat: String = "yyyy-MM-dd HH:mm:ss.SSSSSSSSSXXXXXX"
with the original timestamp or
val myTimestamp: String = "2021-05-24 18:44:22.127631600"
val myFormat: String = "yyyy-MM-dd HH:mm:ss.SSSSSSSSS"
but converting yields a null-value.
Update: I just read that the fmt is optional. When leaving this out and calling to_timestamp(col("theTimestampColumn")) automatically parses the timestamp to 6 fractions.
If your zone offset has a colon your format pattern should have 3 Xs or xs depending on whether your format uses Z or 00:00 for zero offset. Or 5 Xs or Zs to include optional seconds.
From the documentation:
Offset X and x: [...] Three letters outputs the hour and minute, with a colon, such as ‘+01:30’ [...] Five letters outputs the hour and minute and optional second, with a colon, such as ‘+01:30:15’ [...] Pattern letter ‘X’ (upper case) will output ‘Z’ when the offset to be output would be zero, whereas pattern letter ‘x’ (lower case) will output ‘+00’, ‘+0000’, or ‘+00:00’.
Offset Z: [...] Five letters outputs the hour, minute, with optional second if non-zero, with colon. It outputs ‘Z’ if the offset is zero. [...]
So probably one of these should work for you:
val formatA = "yyyy-MM-dd HH:mm:ss.SSSSSSSSSxxx"
val formatB = "yyyy-MM-dd HH:mm:ss.SSSSSSSSSXXX"
Note that the docs also say
Spark supports datetime of micro-of-second precision, which has up to 6 significant digits, but can parse nano-of-second with exceeded part truncated.
Related
I have a dataframe in pyspark that has columns time1 and time 2. They both appear as strings like the below:
Time1 Time2
1990-03-18 22:50:09.693159 2022-04-23 17:30:22-07:00
1990-03-19 22:57:09.433159 2022-04-23 16:11:12-06:00
1990-03-20 22:04:09.437359 2022-04-23 17:56:33-05:00
I am trying to convert these into timestamp(preferably utc)
I am trying the below code:
Newtime1 = Function.to_timestamp(Function.col('time1'),'yyyy-MM-dd HH:mm:ss.SSSSSS')
Newtime2 = Function.to_timestamp(Function.col('time2'),'yyyy-MM-dd HH:mm:ss Z')
When applying to a dataframe like below:
mydataframe = mydataframe.withColumn('time1',Newtime1)
mydataframe = mydataframe.withColumn('time2',Newtime2)
This yields 'None' to be displayed in the data. How can I get the desired timestamps?
The format for timezone is a little tricky. Read the docs carefully.
"The count of pattern letters determines the format."
And there is a difference between X vs x vs Z.
...
Offset X and x: This formats the offset based on the number of pattern letters. One letter outputs just the hour, such as ‘+01’, unless the minute is non-zero in which case the minute is also output, such as ‘+0130’. Two letters outputs the hour and minute, without a colon, such as ‘+0130’. Three letters outputs the hour and minute, with a colon, such as ‘+01:30’. Four letters outputs the hour and minute and optional second, without a colon, such as ‘+013015’. Five letters outputs the hour and minute and optional second, with a colon, such as ‘+01:30:15’. Six or more letters will fail. Pattern letter ‘X’ (upper case) will output ‘Z’ when the offset to be output would be zero, whereas pattern letter ‘x’ (lower case) will output ‘+00’, ‘+0000’, or ‘+00:00’.
Offset Z: This formats the offset based on the number of pattern letters. One, two or three letters outputs the hour and minute, without a colon, such as ‘+0130’. The output will be ‘+0000’ when the offset is zero. Four letters outputs the full form of localized offset, equivalent to four letters of Offset-O. The output will be the corresponding localized offset text if the offset is zero. Five letters outputs the hour, minute, with optional second if non-zero, with colon. It outputs ‘Z’ if the offset is zero. Six or more letters will fail.
>>> from pyspark.sql import functions as F
>>>
>>> df = spark.createDataFrame([
... ('1990-03-18 22:50:09.693159', '2022-04-23 17:30:22-07:00'),
... ('1990-03-19 22:57:09.433159', '2022-04-23 16:11:12Z'),
... ('1990-03-20 22:04:09.437359', '2022-04-23 17:56:33+00:00')
... ],
... ('time1', 'time2')
... )
>>>
>>> df2 = (df
... .withColumn('t1', F.to_timestamp(df.time1, 'yyyy-MM-dd HH:mm:ss.SSSSSS'))
... .withColumn('t2_lower_xxx', F.to_timestamp(df.time2, 'yyyy-MM-dd HH:mm:ssxxx'))
... .withColumn('t2_upper_XXX', F.to_timestamp(df.time2, 'yyyy-MM-dd HH:mm:ssXXX'))
... .withColumn('t2_ZZZZZ', F.to_timestamp(df.time2, 'yyyy-MM-dd HH:mm:ssZZZZZ'))
... )
>>>
>>> df2.select('time2', 't2_lower_xxx', 't2_upper_XXX', 't2_ZZZZZ', 'time1', 't1').show(truncate=False)
+-------------------------+-------------------+-------------------+-------------------+--------------------------+--------------------------+
|time2 |t2_lower_xxx |t2_upper_XXX |t2_ZZZZZ |time1 |t1 |
+-------------------------+-------------------+-------------------+-------------------+--------------------------+--------------------------+
|2022-04-23 17:30:22-07:00|2022-04-23 19:30:22|2022-04-23 19:30:22|2022-04-23 19:30:22|1990-03-18 22:50:09.693159|1990-03-18 22:50:09.693159|
|2022-04-23 16:11:12Z |null |2022-04-23 11:11:12|2022-04-23 11:11:12|1990-03-19 22:57:09.433159|1990-03-19 22:57:09.433159|
|2022-04-23 17:56:33+00:00|2022-04-23 12:56:33|2022-04-23 12:56:33|2022-04-23 12:56:33|1990-03-20 22:04:09.437359|1990-03-20 22:04:09.437359|
+-------------------------+-------------------+-------------------+-------------------+--------------------------+--------------------------+
>>>
For col 'time2' the pattern will be like below:
yyyy-MM-dd HH:mm:ssxxx
Tested in Pyspark v3.2.3 both are working after making above change.
Beginner learner here, trying to add an array of integers (which are meant to be seconds) to an array of Epochs:
Sample input:
AddSeconds = [3,4]
TimeEpoch = [1575165652000, 1576424223000] // Which are 2019-12-01 02:00:52 and 2019-12-15 15:37:03
Desired output:
endDate = [2019-12-01 02:00:55, 2019-12-15 15:37:07]
I need to convert the TimeEpoch to dates with "yyyy-MM-dd hh:mm:ss" format
I need to add "AddSeconds" to the obtained dates
Thanks!
You can do this (I changed your variables to start with a lower-case letter, because Groovy guesses that upper case letter variables are actually classnames, so can cause confusion):
addSeconds = [3,4]
timeEpoch = [1575165652000, 1576424223000]
import java.time.*
import java.time.format.*
def formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd hh:mm:ss")
def datesAsStrings = [addSeconds, timeEpoch]
.transpose()
.collect { a, t -> Instant.ofEpochMilli(t).plusSeconds(a).atZone(ZoneId.systemDefault()).toLocalDateTime() }
.collect { d -> d.format(formatter) }
datesAsStrings.each { println it }
That takes your two lists, and joins them together with transpose():
[ [3, 1575165652000], [4, 1576424223000] ]
Then for each of these, we create an instant, add the seconds, and convert it to a LocalDateTime using the current system timezone -- You need to consider timezones 😉
Then we convert them to the String format you wanted, and pint each of them out
I have a 6 digit value from which i have to get the date in scala. For eg if the value is - 119003 then the output should be
1=20 century
19=2019 year
003= january 3
The output should be 2019/01/03
I have tried ti split the value first and then get the date. But i am not sure how to proceed as i am new to scala
I think you'll have to do the century calculations manually. After that you can let the java.time library do all the rest.
import java.time.LocalDate
import java.time.format.DateTimeFormatter
val in = "119003"
val cent = in.head.asDigit + 19
val res = LocalDate.parse(cent+in.tail, DateTimeFormatter.ofPattern("yyyyDDD"))
.format(DateTimeFormatter.ofPattern("yyyy/MM/dd"))
//res: String = 2019/01/03
The Date class of Java 1.0 used 1900-based years, so 119 would mean 2019, for example. This use was deprecated already in Java 1.1 more than 20 years ago, so it’s surprising to see it survive into Scala.
When you say 6 digit value, I take it to be a number (not a string).
The answer by jwvh is correct. My variant would be like (sorry about the Java code, please translate yourself):
int value = 119003;
int year1900based = value / 1000;
int dayOfYear = value % 1000;
LocalDate date = LocalDate.ofYearDay(year1900based + 1900, dayOfYear);
System.out.println(date);
2019-01-03
If you’ve got a string, I would slice it into two parts only, 119 and 003 (not three parts as in your comment). Parse each into an int and proceed as above.
If you need 2019/01/03 format in your output, use a DateTimeFormatter for that. Inside your program, do keep the LocalDate, not a String.
I have data in a file as shown below:
7373743343333444.
7373743343333432.
This data should be converted to decimal values and should be in a position of 8.7 where 8 are the digits before decimal and 7 are the digits after decimal.
I am trying to read the data file as below:
val readDataFile = Initialize.spark.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "|").schema(***SCHEMA*****).load(****DATA FILE PATH******)
I have tried this:
val changed = dataFileWithSchema.withColumn("COLUMN NAME", dataFileWithSchema.col("COLUMN NAME").cast(new DecimalType(38,3)))
println(changed.show(5))
but it only gives me zeros at the end of the number, like this:
7373743343333444.0000
But I want the digits formatted as described above, how can I achieve this?
A simple combination of regexp_replace, trim and format_number inbuilt function should get you what you desire
import org.apache.spark.sql.functions._
df.withColumn("column", regexp_replace(format_number(trim(regexp_replace(col("column"), "\\.", "")).cast("long")/100000000, 7), ",", ""))
Divide the column by 10^8, this will move the decimal point 8 steps. After that cast to DecimalType to get the correct number of decimals. Since there are 16 digits to begin with, this means the last one is removed.
df.withColumn("col", (col("col").cast(DoubleType)/math.pow(10,8)).cast(DecimalType(38,7)))
What I am encountering is quite peculiar.
My Code:
val aa = "2017-01-17 01:33:00"
val bb = "04:33"
val hour = bb.substring(0, bb.indexOf(":"))
val mins = bb.substring(bb.indexOf(":") + 1, bb.length())
val negatedmins = "-" + mins
val ecoffsethour = hour.toLong
val ecoffsetmins = negatedmins.toLong
println(aa)
val datetimeformatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
val txn_post_date_hkt_date_parsed = LocalDateTime.parse(aa, datetimeformatter)
println(txn_post_date_hkt_date_parsed)
val minushours = txn_post_date_hkt_date_parsed.minusHours(ecoffsethour)
println(minushours)
val minusmins = minushours.minusMinutes(ecoffsetmins)
println(minusmins)
val offsetPostdateDiff = minusmins.toString().replace("T", " ")
println(offsetPostdateDiff)
Output:
2017-01-17 01:33:00
2017-01-17T01:33
2017-01-16T21:33
2017-01-16T22:06
2017-01-16 22:06
In the same code I am changing only the "aa" value to ==> 2017-01-17 01:33:44
Now the output is :
2017-01-17 01:33:44
2017-01-17T01:33:44
2017-01-16T21:33:44
2017-01-16T22:06:44
2017-01-16 22:06:44
Why is the first method not taking seconds field into consideration?
My Requirement is : However the output should come in "yyyy-MM-dd
HH:mm:ss" format.
I'm quite new to Scala. Please enlighten me.
Default format is ISO 8601
The java.time classes use the standard ISO 8601 formats by default when parsing/generating strings to represent date-time value.
The standard format for a local date-time is what you are seeing with the T in the middle: YYYY-MM-DDTHH:MM:SS.SSSSSSSSS.
LocalDateTime ldt = LocalDateTime.now( ZoneId.of( "America/Montreal" ) ) ;
String output = ldt.toString() ;
2017-01-23T12:34:56.789
Your call println( txn_post_date_hkt_date_parsed ) is implicitly calling the built-in toString method on the LocalDateTime object, and thereby asking for the standard ISO 8601 format with the T.
println( txn_post_date_hkt_date_parsed.toString() )
Offsets
On an unrelated note, you are working too hard. The java.time classes handle offsets. I do not understand why you want an offset of such an odd number (four hours and thirty-three minutes), but so be it.
Here is your code revised, but in Java syntax.
String input = "2017-01-17 01:33:00" ;
DateTimeFormatter f = DateTimeFormatter.ofPattern( "yyyy-MM-dd HH:mm:ss" ) ;
LocalDateTime ldt = LocalDateTime.parse( input , f ) ;
OffsetDateTime utc = ldt.atOffset( ZoneOffset.UTC ) ;
ZoneOffset offset = ZoneOffset.of( "-04:33" ) ; // Behind UTC by four hours and thirty-three minutes.
OffsetDateTime odt = utc.withOffsetSameInstant( offset ) ;
You can see this code run live at IdeOne.com. Notice how the wall-clock time of your offset-from-UTC is on the previous date. Same moment in history, same point on the timeline, but viewed through two different wall-clock times (UTC, and four hours and thirty three minutes behind).
The Z on the end is standard ISO 8601 notation, short for Zulu and meaning UTC.
input: 2017-01-17 01:33:00
ldt.toString(): 2017-01-17T01:33
utc.toString(): 2017-01-17T01:33Z
odt.toString(): 2017-01-16T21:00-04:33
It's usually better to explicitly the format in which you want the output.
So, instead of
println datetime
You can do something like this:
println datetimeformat.print(datetime)
Good luck!
Edit: Change made to make the 2 expressions exactly equivalent