Using a grouped z-score over a rolling window - python-polars

I would like to calculate a z-score over a bin based on the data of a rolling look-back period.
Example
Todays visitor amount during [9:30-9:35) should be z-score normalized based off the (mean, std) of the last 3 days of visitors that visited during [9:30-9:35).
My current attempts both raise InvalidOperationError. Is there a way in polars to calculate this?
import polars as pl
def z_score(col: str, over: str, alias: str):
# calculate z-score normalized `col` over `over`
return (
(pl.col(col)-pl.col(col).over(over).mean()) / pl.col(col).over(over).std()
).alias(alias)
df = pl.from_dict(
{
"timestamp": pd.date_range("2019-12-02 9:30", "2019-12-02 12:30", freq="30s").union(
pd.date_range("2019-12-03 9:30", "2019-12-03 12:30", freq="30s")
),
"visitors": [(e % 2) + 1 for e in range(722)]
}
# 5 minute bins for grouping [9:30-9:35) -> 930
).with_column(
pl.col("timestamp").dt.truncate(every="5m").dt.strftime("%H%M").cast(pl.Int32).alias("five_minute_bin")
).with_column(
pl.col("timestamp").dt.truncate(every="3d").alias("daytrunc")
)
# normalize visitor amount for each 5 min bin over the rolling 3 day window using z-score.
# not rolling but also wont work (InvalidOperationError: window expression not allowed in aggregation)
# df.with_column(
# z_score("visitors", "five_minute_bin", "normalized").over("daytrunc")
# )
# won't work either (InvalidOperationError: window expression not allowed in aggregation)
#df.groupby_rolling(index_column="daytrunc", period="3i").agg(z_score("visitors", "five_minute_bin", "normalized"))
For an example of 4 days of data with four data-points each lying in two time-bins ({0,0} - {0,1}), ({1,0} - {1,1})
Input:
Day 0: x_d0_{0,0}, x_d0_{0,1}, x_d0_{1,0}, x_d0_{1,1}
Day 1: x_d1_{0,0}, x_d1_{0,1}, x_d1_{1,0}, x_d1_{1,1}
Day 2: x_d2_{0,0}, x_d2_{0,1}, x_d2_{1,0}, x_d2_{1,1}
Day 3: x_d3_{0,0}, x_d3_{0,1}, x_d3_{1,0}, x_d3_{1,1}
Output:
Day 0: norm_x_d0_{0,0} = nan, norm_x_d0_{0,1} = nan, norm_x_d0_{1,0} = nan, norm_x_d0_{1,1} = nan
Day 1: norm_x_d1_{0,0} = nan, norm_x_d1_{0,1} = nan, norm_x_d1_{1,0} = nan, norm_x_d1_{1,1} = nan
Day 2: norm_x_d2_{0,0} = nan, norm_x_d2_{0,1} = nan, norm_x_d2_{1,0} = nan, norm_x_d2_{1,1} = nan
Day 3: norm_x_d3_{0,0} = (x_d3_{0,0} - np.mean([x_d0_{0,0}, x_d0_{0,1}, X_d1_{0,0}, ..., x_d3_{0,1}]) / np.std([x_d0_{0,0}, x_d0_{0,1}, X_d1_{0,0}, ..., x_d3_{0,1}])) , ... ,

They key here is to use over to restrict your calculations to the five minute bins and then use the rolling functions to get the rolling mean and standard deviation over days restricted by those five minute bin keys. five_minute_bin works as in your code and I believe that a truncated day_bin is necessary so that, for example, 9:33 on one day will include 9:31 both 9:34 on the same and 9:31 from 2 days ago.
days = 5
pl.DataFrame(
{
"timestamp": pl.concat(
[
pl.date_range(
datetime(2019, 12, d, 9, 30), datetime(2019, 12, d, 12, 30), "30s"
)
for d in range(2, days + 2)
]
),
"visitors": [(e % 2) + 1 for e in range(days * 361)],
}
).with_columns(
five_minute_bin=pl.col("timestamp").dt.truncate(every="5m").dt.strftime("%H%M"),
day_bin=pl.col("timestamp").dt.truncate(every="1d"),
).with_columns(
standardized_visitors=(
(
pl.col("visitors")
- pl.col("visitors").rolling_mean("3d", by="day_bin", closed="right")
)
/ pl.col("visitors").rolling_std("3d", by="day_bin", closed="right")
).over("five_minute_bin")
)
Now, that said, when trying out the code for this, I found polars doesn't handle non-unique values in the by-column in the rolling function correctly, so that the same values in the same 5-minute bin don't end up as the same standardized values. Opened bug report here: https://github.com/pola-rs/polars/issues/6691. For large amounts of real world data, this shouldn't actually matter that much, unless your data systematically differs in distribution within the 5 minute bins.

Related

In Julia, how do I set DateFormat year for 19 meaning 2019?

I have dates that look like "17-JAN-19", "18-FEB-20". When I attempt to use the Dates package Date("17-JAN-19", "d-u-yy") I get reasonably 0019-01-17. I could do Date("17-JAN-19", "d-u-yy") + Year(2000) but that introduces the possibility of new errors (I was going to give the example of leap year but that generally works though there is the very rare error Date("29-FEB-00", "d-u-yy")+Year(1900)).
Is there a date format that embeds known information about century?
As mentioned in https://github.com/JuliaLang/julia/issues/30002 there are multiple heuristics for assigning the century to a date. I would recommend being explicit and handling it through a helper function.
const NOCENTURYDF = DateFormat("d-u-y")
"""
parse_date(obj::AbstractString,
breakpoint::Integer = year(now()) - 2000,
century::Integer = 20)
Parses date in according to DateFormat("d-u-y") after attaching century information.
If the year portion is greater that the current year,
it assumes it corresponds to the previous century.
"""
function parse_date(obj::AbstractString,
breakpoint::Integer = year(now()) - 2000,
century::Integer = 20)
# breakpoint = year(now()) - 2000
# century = year(now()) ÷ 100
#assert 0 ≤ breakpoint ≤ 99
yy = rpad(parse(Int, match(r"\d{2}$", obj).match), 2, '0')
Date(string(obj[1:7],
century - (parse(Int, yy) > breakpoint),
yy),
NOCENTURYDF)
end
parse_date("17-JAN-19")
parse_date("29-FEB-00")

Python: add zeroes in single digit numbers without using .zfill

Im currently using micropython and it does not have the .zfill method.
What Im trying to get is to get the YYMMDDhhmmss of the UTC.
The time that it gives me for example is
t = (2019, 10, 11, 3, 40, 8, 686538, None)
I'm able to access the ones that I need by using t[:6]. Now the problem is with the single digit numbers, the 3 and 8. I was able to get it to show 1910113408, but I need to get 19101034008 I would need to get the zeroes before those 2. I used
t = "".join(map(str,t))
t = t[2:]
So my idea was to iterate over t and then check if the number is less than 10. If it is. I will add zeroes in front of it, replacing the number . And this is what I came up with.
t = (2019, 1, 1, 2, 40, 0)
t = list(t)
for i in t:
if t[i] < 10:
t[i] = 0+t[i]
t[i] = t[i]
print(t)
However, this gives me IndexError: list index out of range
Please help, I'm pretty new to coding/python.
When you use
for i in t:
i is not index, each item.
>>> for i in t:
... print(i)
...
2019
10
11
3
40
8
686538
None
If you want to use index, do like following:
>>> for i, v in enumerate(t):
... print("{} is {}".format(i,v))
...
0 is 2019
1 is 10
2 is 11
3 is 3
4 is 40
5 is 8
6 is 686538
7 is None
another way to create '191011034008'
>>> t = (2019, 10, 11, 3, 40, 8, 686538, None)
>>> "".join(map(lambda x: "%02d" % x, t[:6]))
'20191011034008'
>>> "".join(map(lambda x: "%02d" % x, t[:6]))[2:]
'191011034008'
note that:
%02d add leading zero when argument is lower than 10 otherwise (greater or equal 10) use itself. So year is still 4digit string.
This lambda does not expect that argument is None.
I tested this code at https://micropython.org/unicorn/
edited :
str.format method version:
"".join(map(lambda x: "{:02d}".format(x), t[:6]))[2:]
or
"".join(map(lambda x: "{0:02d}".format(x), t[:6]))[2:]
second example's 0 is parameter index.
You can use parameter index if you want to specify it (ex: position mismatch between format-string and params, want to write same parameter multiple times...and so on) .
>>> print("arg 0: {0}, arg 2: {2}, arg 1: {1}, arg 0 again: {0}".format(1, 11, 111))
arg 0: 1, arg 2: 111, arg 1: 11, arg 0 again: 1
I'd recommend you to use Python's string formatting syntax.
>> t = (2019, 10, 11, 3, 40, 8, 686538, None)
>> r = ("%d%02d%02d%02d%02d%02d" % t[:-2])[2:]
>> print(r)
191011034008
Let's see what's going on here:
%d means "display a number"
%2d means "display a number, at least 2 digits"
%02d means "display a number, at least 2 digits, pad with zeroes"
so we're feeding all the relevant numbers, padding them as needed, and cut the "20" out of "2019".

Aggregate in Julia like R or pandas

I want to aggregate a monthly series at the quarterly frequency, for which R has ts and aggregate() (see the first answer on this thread) and pandas has df.resample("Q").sum() (see this question). Does Julia offer something similar?
Appendix: my current solution uses a function to convert a data to the first quarter and split-apply-combine:
"""
month_to_quarter(date)
Returns the date corresponding to the first day of the quarter enclosing date
# Examples
```jldoctest
julia> Date(1990, 1, 1) == RED.month_to_quarter(Date(1990, 2, 1))
true
julia> Date(1990, 1, 1) == RED.month_to_quarter(Date(1990, 1, 1))
true
julia> Date(1990, 1, 1) == RED.month_to_quarter(Date(1990, 2, 25))
true
```
"""
function month_to_quarter(date::Date)
new_month = 1 + 3 * floor((Dates.month(date) - 1) / 3)
return Date(Dates.year(date), new_month, 1)
end
"""
monthly_to_quarterly(monthly_df)
Aggregates a monthly data frame to the quarterly frequency. The data frame should have a :DATE column.
# Examples
```jldoctest
julia> monthly = convert(DataFrame, hcat(collect([Dates.Date(1990, m, 1) for m in 1:3]), [1; 2; 3]));
julia> rename!(monthly, :x1 => :DATE);
julia> rename!(monthly, :x2 => :value);
julia> quarterly = RED.monthly_to_quarterly(monthly);
julia> quarterly[:value][1]
2.0
julia> length(quarterly[:value])
1
```
"""
function monthly_to_quarterly(monthly::DataFrame)
# quarter months: 1, 4, 7, 10
quarter_months = collect(1:3:10)
# Deep copy the data frame
monthly_copy = deepcopy(monthly)
# Drop initial rows until it starts on a quarter
while !in(Dates.month(monthly_copy[:DATE][1]), quarter_months)
# Verify that something is left to pop
#assert 1 <= length(monthly_copy[:DATE])
monthly_copy = monthly_copy[2:end, :]
end
# Drop end rows until it finishes before a quarter
while !in(Dates.month(monthly_copy[:DATE][end]), 2 + quarter_months)
monthly_copy = monthly_copy[1:end-1, :]
end
# Change month of each date to the nearest quarter
monthly_copy[:DATE] = month_to_quarter.(monthly_copy[:DATE])
# Split-apply-combine
quarterly = by(monthly_copy, :DATE, df -> mean(df[:value]))
# Rename
rename!(quarterly, :x1 => :value)
return quarterly
end
I couldn't find such a function in the docs. Here's a more DataFrames.jl-ish and more succint version of your own answer
using DataFrames
# copy-pasted your own function
function month_to_quarter(date::Date)
new_month = 1 + 3 * floor((Dates.month(date) - 1) / 3)
return Date(Dates.year(date), new_month, 1)
end
# the data
r=collect(1:6)
monthly = DataFrame(date=[Dates.Date(1990, m, 1) for m in r],
val=r);
# the functionality
monthly[:quarters] = month_to_quarter.(monthly[:date])
_aggregated = by(monthly, :quarters, df -> DataFrame(S = sum(df[:val])))
#show monthly
#show _aggregated

PySpark : how to split data without randomnize

there are function that can randomize spilt data
trainingRDD, validationRDD, testRDD = RDD.randomSplit([6, 2, 2], seed=0L)
I'm curious if there a way that we generate data the same partition ( train 60 / valid 20 / test 20 ) but without randommize ( let's just say use the current data to split first 60 = train, next 20 =valid and last 20 are for test data)
is there a possible way to split data similar way to split but not randomize?
The basic issue here is that unless you have an index column in your data, there is no concept of "first rows" and "next rows" in your RDD, it's just an unordered set. If you have an integer index column you could do something like this:
train = RDD.filter(lambda r: r['index'] % 5 <= 3)
validation = RDD.filter(lambda r: r['index'] % 5 == 4)
test = RDD.filter(lambda r: r['index'] % 5 == 5)

Compare dates in Lua

I have a variable with a date table that looks like this
* table:
[day]
* number: 15
[year]
* number: 2015
[month]
* number: 2
How do I get the days between the current date and the date above? Many thanks!
You can use os.time() to convert your table to seconds and get the current time and then use os.difftime() to compute the difference. see Lua Wiki for more details.
reference = os.time{day=15, year=2015, month=2}
daysfrom = os.difftime(os.time(), reference) / (24 * 60 * 60) -- seconds in a day
wholedays = math.floor(daysfrom)
print(wholedays) -- today it prints "1"
as #barnes53 pointed out could be off by one day for a few seconds so it's not ideal, but it may be good enough for your needs.
You can use the algorithms gathered here:
chrono-Compatible Low-Level Date Algorithms
The algorithms are shown using C++, but they can be easily implemented in Lua if you like, or you can implement them in C or C++ and then just provide Lua bindings.
The basic idea using these algorithms is to compute a day number for the two dates and then just subtract them to give you the number of days.
--[[
http://howardhinnant.github.io/date_algorithms.html
Returns number of days since civil 1970-01-01. Negative values indicate
days prior to 1970-01-01.
Preconditions: y-m-d represents a date in the civil (Gregorian) calendar
m is in [1, 12]
d is in [1, last_day_of_month(y, m)]
y is "approximately" in
[numeric_limits<Int>::min()/366, numeric_limits<Int>::max()/366]
Exact range of validity is:
[civil_from_days(numeric_limits<Int>::min()),
civil_from_days(numeric_limits<Int>::max()-719468)]
]]
function days_from_civil(y, m, d)
if m <= 2 then
y = y - 1
m = m + 9
else
m = m - 3
end
local era = math.floor(y/400)
local yoe = y - era * 400 -- [0, 399]
local doy = math.modf((153*m + 2)/5) + d-1 -- [0, 365]
local doe = yoe * 365 + math.modf(yoe/4) - math.modf(yoe/100) + doy -- [0, 146096]
return era * 146097 + doe - 719468
end
local reference_date = {year=2001, month = 1, day = 1}
local date = os.date("*t")
local reference_days = days_from_civil(reference_date.year, reference_date.month, reference_date.day)
local days = days_from_civil(date.year, date.month, date.day)
print(string.format("Today is %d days into the 21st century.",days-reference_days))
os.time (under Windows, at least) is limited to years from 1970 and up. If, for example, you need a general solution to also find ages in days for people born before 1970, this won't work. You can use a julian date conversion and subtract between the two numbers (today and your target date).
A sample julian date function that will work for practically any date AD is given below (Lua v5.3 because of // but you could adapt to earlier versions):
local
function div(n,d)
local a, b = 1, 1
if n < 0 then a = -1 end
if d < 0 then b = -1 end
return a * b * (math.abs(n) // math.abs(d))
end
--------------------------------------------------------------------------------
-- Convert a YYMMDD date to Julian since 1/1/1900 (negative answer possible)
--------------------------------------------------------------------------------
function julian(year, month, day)
local temp
if (year < 0) or (month < 1) or (month > 12)
or (day < 1) or (day > 31) then
return
end
temp = div(month - 14, 12)
return (
day - 32075 +
div(1461 * (year + 4800 + temp), 4) +
div(367 * (month - 2 - temp * 12), 12) -
div(3 * div(year + 4900 + temp, 100), 4)
) - 2415021
end