Parse nanoseconds with polars - python-polars

Is it possible to parse dates with nanoseconds in polars?
>>> pl.Series(['2020-01-01 00:00:00.123456789']).str.strptime(pl.Datetime)
shape: (1,)
datetime[μs]
2020-01-01 00:00:00.123456
Notice how the nanoseconds were stripped off
Equivalent in pandas:
>>> pd.to_datetime(['2020-01-01 00:00:00.123456789'])
DatetimeIndex(['2020-01-01 00:00:00.123456789'], dtype='datetime64[ns]', freq=None)

I would have expected specifying ns as the time_unit to work:
>>> pl.Series(['2020-01-01 00:00:00.123456789']).str.strptime(pl.Datetime(time_unit="ns"))
shape: (1,)
Series: '' [datetime[ns]]
[
2020-01-01 00:00:00.123456
]
Perhaps this is a "bug"?
[UPDATE]: This was fixed in polars 0.15.15
>>> pl.Series(['2020-01-01 00:00:00.123456789']).str.strptime(pl.Datetime("ns"))
shape: (1,)
Series: '' [datetime[ns]]
[
2020-01-01 00:00:00.123456789
]
Explicitly passing a format seems to work:
>>> pl.Series(['2020-01-01 00:00:00.123456789']).str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S%.f")
shape: (1,)
Series: '' [datetime[ns]]
[
2020-01-01 00:00:00.123456789
]

Related

How can I achieve functionality similar to pandas.reindex(new_index, method="ffill") with a datetime column in polars?

In Pandas I can add new rows by their index and forward fill in values without filling any other nulls in the DataFrame:
import numpy as np
import pandas as pd
df = pd.DataFrame(data={"a": [1.0, 2.0, np.nan, 3.0]}, index=pd.date_range("2020", periods=4, freq="T"))
print(df)
df = df.reindex(index=df.index.union(pd.date_range("2020-01-01 00:01:30", periods=2, freq="T")), method="ffill")
print(df)
Giving output
a
2020-01-01 00:00:00 1.0
2020-01-01 00:01:00 2.0
2020-01-01 00:02:00 NaN
2020-01-01 00:03:00 3.0
a
2020-01-01 00:00:00 1.0
2020-01-01 00:01:00 2.0
2020-01-01 00:01:30 2.0
2020-01-01 00:02:00 NaN
2020-01-01 00:02:30 NaN
2020-01-01 00:03:00 3.0
Is it possible to achieve something similar using Polars? I am using Polars mainly because it has better performance for my data so far, so performance matters.
I can think of concat -> sort -> ffill methods, something like:
let new_index_values = new_index_values.into_series().into_frame();
let new_index_values_len = new_index_values.height();
let mut cols = vec![new_index_values];
let col_names = source.get_column_names();
for col_name in col_names.clone() {
if col_name != index_column {
cols.push(
Series::full_null(
col_name,
new_index_values_len,
source.column(col_name)?.dtype(),
)
.into_frame(),
)
}
}
let range_frame = hor_concat_df(&cols)?.select(col_names)?;
concat([source.clone().lazy(), range_frame.lazy()], true, true)?
.sort(
index_column,
SortOptions {
descending: false,
nulls_last: true,
},
)
.collect()?
.fill_null(FillNullStrategy::Forward(Some(1)))?
.unique(Some(&[index_column.into()]), UniqueKeepStrategy::Last)
but this will fill other nulls than the ones that were added. I need to preserve the nulls in the original data, so that does not work for me.
I'm not familiar with Rust so this would be the python way to do it (or at least how I would approach it).
Starting with:
pldf = pl.DataFrame({
"dt":pl.date_range(datetime(2020,1,1), datetime(2020,1,1,0,3), "1m"),
"a": [1.0, 2.0, None, 3.0]
})
and then you want to add
new_rows = pl.DataFrame({
"dt": pl.date_range(datetime(2020,1,1,0,1,30), datetime(2020,1,1,0,2,30), "1m")
})
All I've done is convert the pandas date_range syntax to the polars one.
To put those together, use a join_asof. Since these Frames were constructed with date_range, they're already in order but if real data is constructed a different way, ensure you sort them first.
new_rows = new_rows.join_asof(pldf, on='dt')
This just gives you the actual new_rows and then you can concat them together to get to your final answer.
pldf = pl.concat([pldf, new_rows]).sort('dt')

Extract date from pySpark timestamp column (no UTC timezone) in Palantir

I have a timestamp of this type: 2022-11-09T23:19:32.000Z
When I cast to date, my output is "2022-11-10" but I wanna "2022-11-09". Is there a way to force utc 0 (not +1) or extract directly data with a regex to bring only date without consider timezone?
I have tried also substring('2022-11-09T23:19:32.000Z', 1, 10) or some function to extract string... but my output is the same: "2022-11-10".
Example:
Input
id
start_date
123
2020-04-10T23:55:19.000Z
My code:
df_output = df_input.withColumn('date', F.regex_extract(F.col('start_date', '(\\d{4})-(\\d{2})-(\\d{2})', 0))
Wrong Output
id
start_date
date
123
2020-04-10T23:55:19.000Z
2020-04-11
Desidered Output [I wanna extract string from timestamp without consider timezone]
id
start_date
date
123
2020-04-10T23:55:19.000Z
2020-04-10
Can't you use the to_date function? This here works for me:
from datetime import datetime
from pyspark.sql.functions import to_date
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
df = spark.createDataFrame(
[
(
"123",
datetime.strptime("2020-04-10T23:55:19.000Z", '%Y-%m-%dT%H:%M:%S.%fZ')
)
],
StructType([
StructField("id", StringType()),
StructField("start_date", TimestampType()),
]))
df.withColumn("date", to_date("start_date", "%Y-%m-%d")).show()
Output:
+---+-------------------+----------+
| id| start_date| date|
+---+-------------------+----------+
|123|2020-04-10 23:55:19|2020-04-10|
+---+-------------------+----------+

How to convert Date to timezone aware datetime in polars

Let's say I have
df = pl.DataFrame({
"date": pl.Series(["2022-01-01", "2022-01-02"]).str.strptime(pl.Date), "%Y-%m-%d")
})
How do I localize that to a specific timezone and make it a datetime?
I tried:
df.select(pl.col('date').cast(pl.Datetime(time_zone='America/New_York')))
but that gives me
shape: (2, 1)
date
datetime[μs, America/New_York]
2021-12-31 19:00:00 EST
2022-01-01 19:00:00 EST
so it looks like it's starting from the presumption that the naïve datetimes are UTC and then applying the conversion. I set os.environ['TZ']='America/New_York' but I got the same result.
I looked through the polars config options in the API guide to see if there's something else to set but couldn't find anything about default timezone.
As of polars 0.16.3, you can do:
df.select(
pl.col('date').cast(pl.Datetime).dt.replace_time_zone("America/New_York")
)
In previous versions (after 0.14.24), the syntax was
df.select(
pl.col('date').cast(pl.Datetime).dt.tz_localize("America/New_York")
)

24h clock with from_unixtime

I need to transform a dataframe with a column of timestamps in Unixtime/LongType-Format to actual TimestampType.
According to epochconverter.com:
1646732321 = 8. März 2022 10:38:41 GMT+1
1646768324 = 8. March 2022 20:38:44 GMT+1
However, when I use from_unixtime on the dataframe, I get a 12-hour clock and it basically subtracts 12 hours from my second timestamp for some reason? How can I tell PySpark to use a 24h clock?
The output of the code below is:
+---+----------+-------------------+
|id |mytime |mytime_new |
+---+----------+-------------------+
|ABC|1646732321|2022-03-08 10:38:41|
|DFG|1646768324|2022-03-08 08:38:44|
+---+----------+-------------------+
The second line should be 2022-03-08 20:38:44.
Reproducible code example:
data = [
("ABC", 1646732321)
,
("DFG", 1646768324)
]
schema = StructType(
[
StructField("id", StringType(), True),
StructField("mytime", LongType(), True),
]
)
df = spark.createDataFrame(data, schema)
df = df.withColumn(
"mytime_new",
from_unixtime(df["mytime"], "yyyy-MM-dd hh:mm:ss"),
)
df.show(10, False)
Found my mistake 3 minutes later... the issue was my timestamp-format string for the hour (hh):
Instead of:
from_unixtime(df["mytime"], "yyyy-MM-dd hh:mm:ss"),
I needed:
from_unixtime(df["mytime"], "yyyy-MM-dd HH:mm:ss"),

pyspark substring and aggregation

I am new to Spark and I've got a csv file with such data:
date, accidents, injured
2015/20/03 18:00 15, 5
2015/20/03 18:30 25, 4
2015/20/03 21:10 14, 7
2015/20/02 21:00 15, 6
I would like to aggregate this data by a specific hour of when it has happened. My idea is to Substring date to 'year/month/day hh' with no minutes so I can make it a key. I wanted to give average of accidents and injured by each hour. Maybe there is a different, smarter way with pyspark?
Thanks guys!
Well, it depends on what you're going to do afterwards, I guess.
The simplest way would be to do as you suggest: substring the date string and then aggregate:
data = [('2015/20/03 18:00', 15, 5),
('2015/20/03 18:30', 25, 4),
('2015/20/03 21:10', 14, 7),
('2015/20/02 21:00', 15, 6)]
df = spark.createDataFrame(data, ['date', 'accidents', 'injured'])
df.withColumn('date_hr',
df['date'].substr(1, 13)
).groupby('date_hr')\
.agg({'accidents': 'avg', 'injured': 'avg'})\
.show()
If you, however, want to do some more computation later on, you can parse the data to a TimestampType() and then extract the date and hour from that.
import pyspark.sql.types as typ
from pyspark.sql.functions import col, udf
from datetime import datetime
parseString = udf(lambda x: datetime.strptime(x, '%Y/%d/%m %H:%M'), typ.TimestampType())
getDate = udf(lambda x: x.date(), typ.DateType())
getHour = udf(lambda x: int(x.hour), typ.IntegerType())
df.withColumn('date_parsed', parseString(col('date'))) \
.withColumn('date_only', getDate(col('date_parsed'))) \
.withColumn('hour', getHour(col('date_parsed'))) \
.groupby('date_only', 'hour') \
.agg({'accidents': 'avg', 'injured': 'avg'})\
.show()