Spark: computationally efficient way to compare dates? - scala

I have a huge data set which needs to be filtered by date (dates are stored as yyyy-MM-dd format). Which of the following options is the most computationally efficient way to do that (and why)?
df.filter("unix_timestamp(dt_column,'yyyy-MM-dd') >= unix_timestamp('2017-02-03','yyyy-MM-dd')")
OR
df.filter("cast(dt_column as date) >= cast('2017-02-03' as date)")

As dt_column is already in yyyy-MM-dd no need to cast/unix_timestamp it again. Internally spark does lexicographic comparison with Strings only for all date types (As of Spark 2.1). There won't be any date type at low level when comparison happens.
Now cast('2017-02-03' as date) and unix_timestamp('2017-02-03','yyyy-MM-dd') may not cause performance issue as it's constant. I'd recommand you to use DataSet functions to catch syntax issues at compile time
//These two should be the same
df.filter(df("dt_column") >= lit("2017-02-03"))
df.filter(df("dt_column") >= lit("2017-02-03").cast(DataTypes.DateType))
cast and unix_timestamp both generates dates from strings but
unix_timestamp gives options to create date in diff formats. Apart
from that, there shouldn't be any diff in terms of performance.

Related

Optimal use of datetime related objects in python-polars - should I be using pl.date or other alternatives?

I am working with many reasonably large polar dataframes (1m - 200m records) and many of these are time-indexed using dates.
To clarify, I never care about timezones or the time component of the date, all the data is daily.
So far I've been using the pl.Date() column type as this seems optimal, but now I am second guessing myself.
As of polars 1.6 it's no longer possible to compare pl.Date() columns to date strings (eg. '2020-01-01')
So I'll be refactoring a fair amount of my code to explicitly use either np.datetime64 objects, or pl.date or pl.datetime objects.
Is there any good reason to prefer one of these over another? They are all fairly comparable in the benchmarking I've done for .filter operations.
You can just compare with datetime.date - pl.Date should be fine
e.g.
import datetime as dt
df.filter(pl.col('date') > dt.date(2000, 1, 1))
instead of
df.filter(pl.col('date') > '2000-01-01')

Compare some date predicate based on 1st of the month

I would like to get DATE format like in the title (yyyy-mm) from getdate() , in order to use it in where clause to get < and > dates from the one i formatted .So far i found almost everybody uses convert(varchar(10),getdate(),120) but that's varchar and it cant be check with < and > right ? So can someone help me to make a Date in format yyyy-mm or it's impossible ?
Why can't VARCHAR be compared using > and < operators? So long as you have a character string in an appropriate format, you can compare it just fine. For instance, CONVERT(VARCHAR(10), GETDATE(), 120) returns an ODBC Canonical format date, as "YYYY-MM-DD". This can obviously be compared using > and < to obtain correct results.
However, you would generally not want to do this in a database. Predicate such as this:
WHERE CONVERT(VARCHAR(10), [DateColumn], 120) >= '2018-02-26'
are considered "non-SARGable", and cannot use an index. This means that the server will apply brute force conversions to the underlying columns prior to the comparison, resulting in Index Scans or Table Scans, depending upon your schema.
For the vast majority of situations, you want the column to be used as an operand without any kind of conversion beforehand. Thus, the predicate should be expressed as:
WHERE [DateColumn] >= '2018-02-26'
This will result in an implicit cast of the '2018-02-26' operand into Date or DateTime (whatever the column type is), and this can use an index.
The absolute best would be an explicit cast such as this:
WHERE [DateColumn] >= CAST('2018-02-26' AS DATETIME)
This way, there is no room for mistakes, implicit conversions, or non-SARGable predicates.
Put simply, you do not want to do this in the way you are asking.
To look for records that match a specific year and month, simply use two where criteria in this manner:
declare #SomeDate date = '20180114'; -- This is any date.
-- This gets the first day of the month of the date above.
declare #MonthStart date = dateadd(month,datediff(month,0,#SomeDate),0);
-- This gets the first day of the following month of the date above.
declare #NextMonthStart date = dateadd(month,datediff(month,0,#SomeDate)+1,0);
select cols
from tables
where DateCol >= #MonthStart
and DateCol < #NextMonthStart;
if you have an existing datetime you should compare it against another time, using between for example. Why do you want to do a string comparison

Operating with datetimes in SQLite

I'm interested in knowing the different possibilities to operate with datetimes in SQLite and understand its pros and cons. I did not find anywhere a detailed explanation of all the alternatives.
So far I have learned that
SQLite doesn't actually have a native storage class for timestamps /
dates. It stores these values as NUMERIC or TEXT typed values
depending on input format. Date manipulation is done using the builtin
date/time functions, which know how to convert inputs from the other
formats.
(quoted from here)
When any operation between datetimes is needed, I have seen two different approaches:
julianday function
SELECT julianday(OneDatetime) - julianday(AnotherDatetime) FROM MyTable;
Number of days is returned, but this can be fractional.
Therefore, you can also get some other measures of time with some extra operations. For instance, to get minutes:
SELECT CAST ((
julianday(OneDatetime) - julianday(AnotherDatetime)
) * 24 * 60 AS INTEGER)
Apparently julianday could cause some problems:
Bear in mind that julianday returns the (fractional) number of 'days'
- i.e. 24hour periods, since noon UTC on the origin date. That's usually not what you need, unless you happen to live 12 hours west of
Greenwich. E.g. if you live in London, this morning is on the same
julianday as yesterday afternoon.
More information in this post.
strftime function
SELECT strftime("%s", OneDatetime)-strftime("%s", AnotherDatetime) FROM MyTable;
Number of seconds is returned. Similarly, you can also get some other measures of time with some extra operations. For instance, to get minutes:
SELECT (strftime("%s", OneDatetime)-strftime("%s", AnotherDatetime))/60 FROM MyTable;
More information here.
My conclusion so far is: julianday seems easier to use, but can cause some problems. strftime seems more verbose, but also safer. Both of them provide only as results a single unit (either days or hours or minutes or seconds), but not a combination of many.
Question
1) Is there any other possibility to operate with datetimes?
2) What would be the best way to get directly the difference of two datetimes in time format (or date or datetime), where datetime would be formatted as 'YYYY-mm-dd HH:MM:SS', and the result would be something in the same format?
I would have imagined that something like the following would work, but it does not:
SELECT DATETIME('2016-11-04 08:05:00') - DATETIME('2016-11-04 07:00:00') FROM MyTable;
> 01:05:00
Julian day numbers are perfectly safe when computing differences.
The only problem would be if you tried to convert them into a date by truncating any fractional digits; this would result in noon, not midnight. (The same could happen if you tried to store them in integer variables.) But that is not what you're doing here.
SQLite has no built-in function to compute date/time differences; you have to convert date/time values into some number first. Whether you use (Julian) days or seconds does not really matter from a technical point of view; use whatever is easier in your program.
If you started with a different format, you might want to convert the resulting difference back into that format, e.g.:
time(difference_value, 'unixepoch') -- from seconds to hh:mm:ss
time(0.5 + difference_value) -- from Julian days to hh:mm:ss

Date in calculated field not separating data

What I'm trying to do is list all of the data in a specific table. I tried these code snippets shown, and I still get data from outside the range (2005, etc.)
This was the first one I tried
=IIF((Fields!APP_CREATE_DATETIME.Value >="{ts '2014-01-01 00:00:00'}") AND (Fields!APP_CREATE_DATETIME.Value > "{ts '2014-01-31 00:00:00'}"), Switch(Fields!DLR_NAME.Value, "JAN"), nothing)
Then this
=IIF((Fields!APP_CREATE_DATETIME.Value >="2014-01-01 00:00:00") AND (Fields!APP_CREATE_DATETIME.Value > "2014-01-31 00:00:00"), Switch(Fields!DLR_NAME.Value, "JAN"), nothing)
The SQL column in the table itself is of DATETIME format
I'm not sure what you're trying to do there but that isn't the right syntax for Switch()
=SWITCH(Conditional1, Result1,
Conditional2, Result2,
Conditional3, Result3
)
I think it would be easier to set two filters on the table. One for greater than the start date and the other for less than the end date. It would be much easier to understand and maintain. But that is mostly because it seems to be a simple date range filtering if the SWITCH() doesn't actually do anything.
I'd recommend using the DateSerial() function in order to generate the date rather than relying on trying to properly convert a string value. Otherwise you'll need to use the CDATE() function to convert the string, but string dates always just feel a little unreliable to me.
=DateSerial(1970,1,1)
or
=Cdate("1970-01-01")

Postgresql Ethiopian Date Format

Is there a way to store a date in a PostgreSQL db using the Ethiopian date format? I'm trying to store 29th or 30th of February but it throws an error, because in the Julian calendar there's no such thing. Any inputs?
I am not sure that I'll tell you something new but...
Databases are used by programs or by interfaces, I never saw databases that are used by end-user in console with psql.
If you are develop an application, that must display dates in specific calendar, you can store date in PostgreSQL in TIMESTAMP. All operations with dates will work correct in database. But you have to implement conversion from TIMESTAMP into string representation and vice versa in your application manually. If this is most important thing for your application, you will do this.
All queries that must return date you will write with conversion into DOUBLE PRECISION e.g.
SELECT EXTRACT(EPOCH FROM timestamp_field)
This returns DOUBLE PRECISION value that represents timestamp in numerical format.
All date parameters in queries you have convert from numerical presentation in TIMESTAMP using built-in function to_timestamp:
update table_name set
timestamp_fileld = to_timestamp(1384852375.46666)
The other solution is to write psql functions that do this for you directly in queries, but anyway you need to handle each input/output of date fields in queries.