Optimal use of datetime related objects in python-polars - should I be using pl.date or other alternatives? - python-polars

I am working with many reasonably large polar dataframes (1m - 200m records) and many of these are time-indexed using dates.
To clarify, I never care about timezones or the time component of the date, all the data is daily.
So far I've been using the pl.Date() column type as this seems optimal, but now I am second guessing myself.
As of polars 1.6 it's no longer possible to compare pl.Date() columns to date strings (eg. '2020-01-01')
So I'll be refactoring a fair amount of my code to explicitly use either np.datetime64 objects, or pl.date or pl.datetime objects.
Is there any good reason to prefer one of these over another? They are all fairly comparable in the benchmarking I've done for .filter operations.

You can just compare with datetime.date - pl.Date should be fine
e.g.
import datetime as dt
df.filter(pl.col('date') > dt.date(2000, 1, 1))
instead of
df.filter(pl.col('date') > '2000-01-01')

Related

Webi: How to use double slider input control with dates?

My boss asked me to add a double slider input control for the date information on a report on Webi.
We have several columns and two of them display a date (start and end date of procedure). So I need to have two double slider, one for each date.
I've been searching for a whole day now and I know that it is not directly possible to use double slider with dates, because double slider only works with values (measures) and dates don't have that.
But I managed to create two more variables on the business layer. I used the following SQL function:
CAST(currentDate as Bigint))
These variables are displaying the date as numbers (e.g. 1090101 for 01.01.2009 (format is "dd-MM-yyyy")).
And it is working great! But it displays the dates as number, which is not possible to use for work. No one will know which date is 1090101. I could perhaps let both columns (date as date and date as number) in the report, so people could check the date they want to filter with the input control and select the right number on the slider. This could be a workaround, but not a clean one, I think.
I tried to change the format of the date as number to a date format, but then I could not use the slider anymore (even if the variable is a number).
I looked for a way to change the formatting of the values displayed on the slider, but with no luck.
So I'm asking for your help. Does anyone know how I could make this work?
Is there really no solutions for such a useful way of filtering data? I mean, filtering data by an interval of dates is surely something people want to do quite often, I assume.
Thank you in advance for your time.
(Version Webi : SAP BusinessObjects BI Platform 4.2 Support Pack 8 Patch 6,
Version: 14.2.8.3671)
You could format your date value as year, month, day and then convert it to a number so the value you are filtering on makes a little more sense. Like this...
=ToNumber(FormatDate([Your Date];"yyyyMMdd"))
It will be better than just an arbitrary number, but certainly not perfect since you will have large chunks of your range for which there never will be any corresponding data (e.g. 20211232 through 20220100).
What is wrong with just a Minimum and Maximum input controls? They are more intuitive and simple to create. Sometimes what your user or boss asks for is a bad idea and/or just not possible.

Spark: computationally efficient way to compare dates?

I have a huge data set which needs to be filtered by date (dates are stored as yyyy-MM-dd format). Which of the following options is the most computationally efficient way to do that (and why)?
df.filter("unix_timestamp(dt_column,'yyyy-MM-dd') >= unix_timestamp('2017-02-03','yyyy-MM-dd')")
OR
df.filter("cast(dt_column as date) >= cast('2017-02-03' as date)")
As dt_column is already in yyyy-MM-dd no need to cast/unix_timestamp it again. Internally spark does lexicographic comparison with Strings only for all date types (As of Spark 2.1). There won't be any date type at low level when comparison happens.
Now cast('2017-02-03' as date) and unix_timestamp('2017-02-03','yyyy-MM-dd') may not cause performance issue as it's constant. I'd recommand you to use DataSet functions to catch syntax issues at compile time
//These two should be the same
df.filter(df("dt_column") >= lit("2017-02-03"))
df.filter(df("dt_column") >= lit("2017-02-03").cast(DataTypes.DateType))
cast and unix_timestamp both generates dates from strings but
unix_timestamp gives options to create date in diff formats. Apart
from that, there shouldn't be any diff in terms of performance.

what does the #> operator in postgres do?

I came across a query in postgres here which uses the #> operator on earth objects.
I've searched everywhere, but have come up empty on the meaning of this operator (and likely others like it, eg: #<, etc...).
> is obvious. I also found that # will take the absolute value of something. So my best guess is this does an absolute greater than comparison of two values?
Is that correct? Is this documented somewhere in the postgres docs? I'm even more curious to understand what the operator does on earth objects.
Thanks!
In general #> is the "contains" operator.
It is defined for several data types.
arrays: http://www.postgresql.org/docs/current/static/functions-array.html
range types: http://www.postgresql.org/docs/current/static/functions-range.html
geometric types: http://www.postgresql.org/docs/current/static/functions-geometry.html
JSON (and JSONB): http://www.postgresql.org/docs/current/static/functions-json.html
According to the PostgreSQL Official Documentation
interval values can be written using the following verbose syntax:
[#] quantity unit [quantity unit...] [direction]
where quantity is a
number (possibly signed); unit is microsecond, millisecond, second,
minute, hour, day, week, month, year, decade, century, millennium, or
abbreviations or plurals of these units; direction can be ago or
empty. The at sign (#) is optional noise. The amounts of the different
units are implicitly added with appropriate sign accounting. ago
negates all the fields. This syntax is also used for interval output,
if IntervalStyle is set to postgres_verbose.

Date in calculated field not separating data

What I'm trying to do is list all of the data in a specific table. I tried these code snippets shown, and I still get data from outside the range (2005, etc.)
This was the first one I tried
=IIF((Fields!APP_CREATE_DATETIME.Value >="{ts '2014-01-01 00:00:00'}") AND (Fields!APP_CREATE_DATETIME.Value > "{ts '2014-01-31 00:00:00'}"), Switch(Fields!DLR_NAME.Value, "JAN"), nothing)
Then this
=IIF((Fields!APP_CREATE_DATETIME.Value >="2014-01-01 00:00:00") AND (Fields!APP_CREATE_DATETIME.Value > "2014-01-31 00:00:00"), Switch(Fields!DLR_NAME.Value, "JAN"), nothing)
The SQL column in the table itself is of DATETIME format
I'm not sure what you're trying to do there but that isn't the right syntax for Switch()
=SWITCH(Conditional1, Result1,
Conditional2, Result2,
Conditional3, Result3
)
I think it would be easier to set two filters on the table. One for greater than the start date and the other for less than the end date. It would be much easier to understand and maintain. But that is mostly because it seems to be a simple date range filtering if the SWITCH() doesn't actually do anything.
I'd recommend using the DateSerial() function in order to generate the date rather than relying on trying to properly convert a string value. Otherwise you'll need to use the CDATE() function to convert the string, but string dates always just feel a little unreliable to me.
=DateSerial(1970,1,1)
or
=Cdate("1970-01-01")

How did this number become this datetime?

We have a vendor which one field in the database is a number and somehow in the app interface it shows the date,
I'm trying to figure out how is this conversion
Here is the data:
this number 15862 generates this date 06/05/2013
I have no idea how, the vendor told us it is NOT a custom logic conversion it was used a tsql function although I can't figure which one.
I tried using "convert" without success.
I don't think that's from a tsql function considering it's derived using the UNIX time epoch. Basically it's number of days since 1969-12-31
But you could get it using tsql like so:
select datediff(d,'1969-12-31','2013-06-05')
It looks like it's using a base-date of 1/1/1970 (actually 12/31/1969) and the number represents the number of days after that.
Most probably this is saved as an offset in days since 01/01/1979:
date('m/d/Y', 15862*3600*24) gives 06/06/2013 and
date('m/d/Y', 15862*3600*24-(3600*24) gives exactly 06/05/2013