pyspark sql to_timestamp function formatting - pyspark

I have a dataframe in pyspark that has columns time1 and time 2. They both appear as strings like the below:
Time1 Time2
1990-03-18 22:50:09.693159 2022-04-23 17:30:22-07:00
1990-03-19 22:57:09.433159 2022-04-23 16:11:12-06:00
1990-03-20 22:04:09.437359 2022-04-23 17:56:33-05:00
I am trying to convert these into timestamp(preferably utc)
I am trying the below code:
Newtime1 = Function.to_timestamp(Function.col('time1'),'yyyy-MM-dd HH:mm:ss.SSSSSS')
Newtime2 = Function.to_timestamp(Function.col('time2'),'yyyy-MM-dd HH:mm:ss Z')
When applying to a dataframe like below:
mydataframe = mydataframe.withColumn('time1',Newtime1)
mydataframe = mydataframe.withColumn('time2',Newtime2)
This yields 'None' to be displayed in the data. How can I get the desired timestamps?

The format for timezone is a little tricky. Read the docs carefully.
"The count of pattern letters determines the format."
And there is a difference between X vs x vs Z.
...
Offset X and x: This formats the offset based on the number of pattern letters. One letter outputs just the hour, such as ‘+01’, unless the minute is non-zero in which case the minute is also output, such as ‘+0130’. Two letters outputs the hour and minute, without a colon, such as ‘+0130’. Three letters outputs the hour and minute, with a colon, such as ‘+01:30’. Four letters outputs the hour and minute and optional second, without a colon, such as ‘+013015’. Five letters outputs the hour and minute and optional second, with a colon, such as ‘+01:30:15’. Six or more letters will fail. Pattern letter ‘X’ (upper case) will output ‘Z’ when the offset to be output would be zero, whereas pattern letter ‘x’ (lower case) will output ‘+00’, ‘+0000’, or ‘+00:00’.
Offset Z: This formats the offset based on the number of pattern letters. One, two or three letters outputs the hour and minute, without a colon, such as ‘+0130’. The output will be ‘+0000’ when the offset is zero. Four letters outputs the full form of localized offset, equivalent to four letters of Offset-O. The output will be the corresponding localized offset text if the offset is zero. Five letters outputs the hour, minute, with optional second if non-zero, with colon. It outputs ‘Z’ if the offset is zero. Six or more letters will fail.
>>> from pyspark.sql import functions as F
>>>
>>> df = spark.createDataFrame([
... ('1990-03-18 22:50:09.693159', '2022-04-23 17:30:22-07:00'),
... ('1990-03-19 22:57:09.433159', '2022-04-23 16:11:12Z'),
... ('1990-03-20 22:04:09.437359', '2022-04-23 17:56:33+00:00')
... ],
... ('time1', 'time2')
... )
>>>
>>> df2 = (df
... .withColumn('t1', F.to_timestamp(df.time1, 'yyyy-MM-dd HH:mm:ss.SSSSSS'))
... .withColumn('t2_lower_xxx', F.to_timestamp(df.time2, 'yyyy-MM-dd HH:mm:ssxxx'))
... .withColumn('t2_upper_XXX', F.to_timestamp(df.time2, 'yyyy-MM-dd HH:mm:ssXXX'))
... .withColumn('t2_ZZZZZ', F.to_timestamp(df.time2, 'yyyy-MM-dd HH:mm:ssZZZZZ'))
... )
>>>
>>> df2.select('time2', 't2_lower_xxx', 't2_upper_XXX', 't2_ZZZZZ', 'time1', 't1').show(truncate=False)
+-------------------------+-------------------+-------------------+-------------------+--------------------------+--------------------------+
|time2 |t2_lower_xxx |t2_upper_XXX |t2_ZZZZZ |time1 |t1 |
+-------------------------+-------------------+-------------------+-------------------+--------------------------+--------------------------+
|2022-04-23 17:30:22-07:00|2022-04-23 19:30:22|2022-04-23 19:30:22|2022-04-23 19:30:22|1990-03-18 22:50:09.693159|1990-03-18 22:50:09.693159|
|2022-04-23 16:11:12Z |null |2022-04-23 11:11:12|2022-04-23 11:11:12|1990-03-19 22:57:09.433159|1990-03-19 22:57:09.433159|
|2022-04-23 17:56:33+00:00|2022-04-23 12:56:33|2022-04-23 12:56:33|2022-04-23 12:56:33|1990-03-20 22:04:09.437359|1990-03-20 22:04:09.437359|
+-------------------------+-------------------+-------------------+-------------------+--------------------------+--------------------------+
>>>

For col 'time2' the pattern will be like below:
yyyy-MM-dd HH:mm:ssxxx
Tested in Pyspark v3.2.3 both are working after making above change.

Related

Parse a specific timestamp in Scala

I have a csv file which I read in with scala and spark. In this data is a timecolumn, which contains timevalues as strings of the form
val myTimestamp = "2021-05-24 18:44:22.127631600+02:00"
I now need to parse this timestamp. Since I am having a Dataframe, I want to use the functionality of .withColumn and to_timestamp.
Samplecode:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, to_timestamp}
val spark:SparkSession = SparkSession.builder().master("local").getOrCreate()
val myTimestamp: String = "2021-05-24 18:44:22.127631600+02:00"
val myFormat: String = "yyyy-MM-dd HH:mm:ss"
import spark.sqlContext.implicits._
Seq(myTimestamp)
.toDF("theTimestampColumn")
.withColumn("parsedTime", to_timestamp(col("theTimestampColumn"),fmt = myFormat))
.show()
Output:
+--------------------+-------------------+
| theTimestampColumn| parsedTime|
+--------------------+-------------------+
|2021-05-24 18:44:...|2021-05-24 18:44:22|
+--------------------+-------------------+
Running this code works fine, but I restrict my timestamps to a second-precision. I want to have the whole precision with 9 fractions of the second. Therefore I read the documentation under https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html , but I wasn't able to set up the right number of S (tried with 1 to 9 S's) and X for specifiing the fractions of the second or the timezone, respectively. The parsedTime-column of the Dataframe becomes null. How do I parse this timestamp given the tools above?
I have for example also tried
val myFormat: String = "yyyy-MM-dd HH:mm:ss.SSSSSSSSSZ"
val myFormat: String = "yyyy-MM-dd HH:mm:ss.SSSSSSSSSX"
val myFormat: String = "yyyy-MM-dd HH:mm:ss.SSSSSSSSSXXXXXX"
with the original timestamp or
val myTimestamp: String = "2021-05-24 18:44:22.127631600"
val myFormat: String = "yyyy-MM-dd HH:mm:ss.SSSSSSSSS"
but converting yields a null-value.
Update: I just read that the fmt is optional. When leaving this out and calling to_timestamp(col("theTimestampColumn")) automatically parses the timestamp to 6 fractions.
If your zone offset has a colon your format pattern should have 3 Xs or xs depending on whether your format uses Z or 00:00 for zero offset. Or 5 Xs or Zs to include optional seconds.
From the documentation:
Offset X and x: [...] Three letters outputs the hour and minute, with a colon, such as ‘+01:30’ [...] Five letters outputs the hour and minute and optional second, with a colon, such as ‘+01:30:15’ [...] Pattern letter ‘X’ (upper case) will output ‘Z’ when the offset to be output would be zero, whereas pattern letter ‘x’ (lower case) will output ‘+00’, ‘+0000’, or ‘+00:00’.
Offset Z: [...] Five letters outputs the hour, minute, with optional second if non-zero, with colon. It outputs ‘Z’ if the offset is zero. [...]
So probably one of these should work for you:
val formatA = "yyyy-MM-dd HH:mm:ss.SSSSSSSSSxxx"
val formatB = "yyyy-MM-dd HH:mm:ss.SSSSSSSSSXXX"
Note that the docs also say
Spark supports datetime of micro-of-second precision, which has up to 6 significant digits, but can parse nano-of-second with exceeded part truncated.

Handling decimal values in spark scala

I have data in a file as shown below:
7373743343333444.
7373743343333432.
This data should be converted to decimal values and should be in a position of 8.7 where 8 are the digits before decimal and 7 are the digits after decimal.
I am trying to read the data file as below:
val readDataFile = Initialize.spark.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "|").schema(***SCHEMA*****).load(****DATA FILE PATH******)
I have tried this:
val changed = dataFileWithSchema.withColumn("COLUMN NAME", dataFileWithSchema.col("COLUMN NAME").cast(new DecimalType(38,3)))
println(changed.show(5))
but it only gives me zeros at the end of the number, like this:
7373743343333444.0000
But I want the digits formatted as described above, how can I achieve this?
A simple combination of regexp_replace, trim and format_number inbuilt function should get you what you desire
import org.apache.spark.sql.functions._
df.withColumn("column", regexp_replace(format_number(trim(regexp_replace(col("column"), "\\.", "")).cast("long")/100000000, 7), ",", ""))
Divide the column by 10^8, this will move the decimal point 8 steps. After that cast to DecimalType to get the correct number of decimals. Since there are 16 digits to begin with, this means the last one is removed.
df.withColumn("col", (col("col").cast(DoubleType)/math.pow(10,8)).cast(DecimalType(38,7)))

subtracting one from another in matlab

My time comes back from a database query as following:
kdbstrbegtime =
09:15:00
kdbstrendtime =
15:00:00
or rather this is what it looks like in the command window.
I want to create a matrix with the number of rows equal to the number of seconds between the two timestamps. Are there time funcitons that make this easily possible?
Use datenum to convert both timestamps into serial numbers, and then subtract them to get the amount of seconds:
secs = fix((datenum(kdbstrendtime) - datenum(kdbstrbegtime)) * 86400)
Since the serial number is measured in days, the result should be multiplied by 86400 ( the number of seconds in one day). Then you can create a matrix with the number of rows equal to secs, e.g:
A = zeros(secs, 1)
I chose the number of columns to be 1, but this can be modified, of course.
First you have to convert kdbstrendtime and kdbstrbegtime to char by datestr command, then:
time = datenum(kdbstrendtime )-datenum(kdbstrbegtime )
t = datestr(time,'HH:MM:SS')

Vectorising Date Array Calculations

I simply want to generate a series of dates 1 year apart from today.
I tried this
CurveLength=30;
t=zeros(CurveLength);
t(1)=datestr(today);
x=2:CurveLength-1;
t=addtodate(t(1),x,'year');
I am getting two errors so far?
??? In an assignment A(I) = B, the number of elements in B and
Which I am guessing is related to the fact that the date is a string, but when I modified the string to be the same length as the date dd-mmm-yyyy i.e. 11 letters I still get the same error.
Lsstly I get the error
??? Error using ==> addtodate at 45
Quantity must be a numeric scalar.
Which seems to suggest that the function can't be vectorised? If this is true is there anyway to tell in advance which functions can be vectorised and which can not?
To add n years to a date x, you do this:
y = addtodate(x, n, 'year');
However, addtodate requires the following:
x must be a scalar number, not a string.
n must be a scalar number, not a vector.
Hence the errors you get.
I suggest you use a loop to do this:
CurveLength = 30;
t = zeros(CurveLength, 1);
t(1) = today; % # Whatever today equals to...
for ii = 2:CurveLength
t(ii) = addtodate(t(1), ii - 1, 'year');
end
Now that you have all your date values, you can convert it to strings with:
datestr(t);
And here's a neat one-liner using arrayfun;
datestr(arrayfun(#(n)addtodate(today, n, 'year'), 0:CurveLength))
If you're sequence has a constant known start, you can use datenum in the following way:
t = datenum( startYear:endYear, 1, 1)
This works fine also with months, days, hours etc. as long as the sequence doesn't run into negative numbers (like 1:-1:-10). Then months and days behave in a non-standard way.
Here a solution without a loop (possibly faster):
CurveLength=30;
t=datevec(repmat(now(),CurveLength,1));
x=[0:CurveLength-1]';
t(:,1)=t(:,1)+x;
t=datestr(t)
datevec splits the date into six columns [year, month, day, hour, min, sec]. So if you want to change e.g. the year you can just add or subtract from it.
If you want to change the month just add to t(:,2). You can even add numbers > 12 to the month and it will increase the year and month correctly if you transfer it back to a datenum or datestr.

find mean or median date of event

I have a dataset for which I have extracted the date at which an event occurred. The date is in the format of MMDDYY although MatLab does not show leading zeros so often it's MDDYY.
Is there a method to find the mean or median (I could use either) date? median works fine when there is an odd number of days but for even numbers I believe it is averaging the two middle ones which doesn't produce sensible values. I've been trying to convert the dates to a MatLab format with regexp and put it back together but I haven't gotten it to work. Thanks
dates=[32381 41081 40581 32381 32981 41081 40981 40581];
You can use datenum to convert dates to a serial date number (1 at 01/01/0000, 2 at 02/01/0000, 367 at 01/01/0001, etc.):
strDate='27112011';
numDate = datenum(strDate,'ddmmyyyy')
Any arithmetic operation can then be performed on these date numbers, like taking a mean or median:
mean(numDates)
median(numDates)
The only problem here, is that you don't have your dates in a string type, but as numbers. Luckily datenum also accepts numeric input, but you'll have to give the day, month and year separated in a vector:
numDate = datenum([year month day])
or as rows in a matrix if you have multiple timestamps.
So for your specified example data:
dates=[32381 41081 40581 32381 32981 41081 40981 40581];
years = mod(dates,100);
dates = (dates-years)./100;
days = mod(dates,100);
months = (dates-days)./100;
years = years + 1900; % set the years to the 20th century
numDates = datenum([years(:) months(:) days(:)]);
fprintf('The mean date is %s\n', datestr(mean(numDates)));
fprintf('The median date is %s\n', datestr(median(numDates)));
In this example I converted the resulting mean and median back to a readable date format using datestr, which takes the serial date number as input.
Try this:
dates=[32381 41081 40581 32381 32981 41081 40981 40581];
d=zeros(1,length(dates));
for i=1:length(dates)
d(i)=datenum(num2str(dates(i)),'ddmmyy');
end
m=mean(d);
m_str=datestr(m,'dd.mm.yy')
I hope this info to be useful, regards
Store the dates as YYMMDD, rather than as MMDDYY. This has the useful side effect that the numeric order of the dates is also the chronological order.
Here is the pseudo-code for a function that you could write.
foreach date:
year = date % 100
date = (date - year) / 100
day = date % 100
date = (date - day) / 100
month = date
newdate = year * 100 * 100 + month * 100 + day
end for
Once you have the dates in YYMMDD format, then find the median (numerically), and this is also the median chronologically.
You see above how to present dates as numbers.
I will add no your issue of finding median of the list. The default matlab median function will average the two middle values when there are an even number of values.
But you can do it yourself! Try this:
dates; % is your array of dates in numeric form
sdates = sort(dates);
mediandate = sdates(round((length(sdates)+1)/2));