Aggregate in Julia like R or pandas - aggregate

I want to aggregate a monthly series at the quarterly frequency, for which R has ts and aggregate() (see the first answer on this thread) and pandas has df.resample("Q").sum() (see this question). Does Julia offer something similar?
Appendix: my current solution uses a function to convert a data to the first quarter and split-apply-combine:
"""
month_to_quarter(date)
Returns the date corresponding to the first day of the quarter enclosing date
# Examples
```jldoctest
julia> Date(1990, 1, 1) == RED.month_to_quarter(Date(1990, 2, 1))
true
julia> Date(1990, 1, 1) == RED.month_to_quarter(Date(1990, 1, 1))
true
julia> Date(1990, 1, 1) == RED.month_to_quarter(Date(1990, 2, 25))
true
```
"""
function month_to_quarter(date::Date)
new_month = 1 + 3 * floor((Dates.month(date) - 1) / 3)
return Date(Dates.year(date), new_month, 1)
end
"""
monthly_to_quarterly(monthly_df)
Aggregates a monthly data frame to the quarterly frequency. The data frame should have a :DATE column.
# Examples
```jldoctest
julia> monthly = convert(DataFrame, hcat(collect([Dates.Date(1990, m, 1) for m in 1:3]), [1; 2; 3]));
julia> rename!(monthly, :x1 => :DATE);
julia> rename!(monthly, :x2 => :value);
julia> quarterly = RED.monthly_to_quarterly(monthly);
julia> quarterly[:value][1]
2.0
julia> length(quarterly[:value])
1
```
"""
function monthly_to_quarterly(monthly::DataFrame)
# quarter months: 1, 4, 7, 10
quarter_months = collect(1:3:10)
# Deep copy the data frame
monthly_copy = deepcopy(monthly)
# Drop initial rows until it starts on a quarter
while !in(Dates.month(monthly_copy[:DATE][1]), quarter_months)
# Verify that something is left to pop
#assert 1 <= length(monthly_copy[:DATE])
monthly_copy = monthly_copy[2:end, :]
end
# Drop end rows until it finishes before a quarter
while !in(Dates.month(monthly_copy[:DATE][end]), 2 + quarter_months)
monthly_copy = monthly_copy[1:end-1, :]
end
# Change month of each date to the nearest quarter
monthly_copy[:DATE] = month_to_quarter.(monthly_copy[:DATE])
# Split-apply-combine
quarterly = by(monthly_copy, :DATE, df -> mean(df[:value]))
# Rename
rename!(quarterly, :x1 => :value)
return quarterly
end

I couldn't find such a function in the docs. Here's a more DataFrames.jl-ish and more succint version of your own answer
using DataFrames
# copy-pasted your own function
function month_to_quarter(date::Date)
new_month = 1 + 3 * floor((Dates.month(date) - 1) / 3)
return Date(Dates.year(date), new_month, 1)
end
# the data
r=collect(1:6)
monthly = DataFrame(date=[Dates.Date(1990, m, 1) for m in r],
val=r);
# the functionality
monthly[:quarters] = month_to_quarter.(monthly[:date])
_aggregated = by(monthly, :quarters, df -> DataFrame(S = sum(df[:val])))
#show monthly
#show _aggregated

Related

Using a grouped z-score over a rolling window

I would like to calculate a z-score over a bin based on the data of a rolling look-back period.
Example
Todays visitor amount during [9:30-9:35) should be z-score normalized based off the (mean, std) of the last 3 days of visitors that visited during [9:30-9:35).
My current attempts both raise InvalidOperationError. Is there a way in polars to calculate this?
import polars as pl
def z_score(col: str, over: str, alias: str):
# calculate z-score normalized `col` over `over`
return (
(pl.col(col)-pl.col(col).over(over).mean()) / pl.col(col).over(over).std()
).alias(alias)
df = pl.from_dict(
{
"timestamp": pd.date_range("2019-12-02 9:30", "2019-12-02 12:30", freq="30s").union(
pd.date_range("2019-12-03 9:30", "2019-12-03 12:30", freq="30s")
),
"visitors": [(e % 2) + 1 for e in range(722)]
}
# 5 minute bins for grouping [9:30-9:35) -> 930
).with_column(
pl.col("timestamp").dt.truncate(every="5m").dt.strftime("%H%M").cast(pl.Int32).alias("five_minute_bin")
).with_column(
pl.col("timestamp").dt.truncate(every="3d").alias("daytrunc")
)
# normalize visitor amount for each 5 min bin over the rolling 3 day window using z-score.
# not rolling but also wont work (InvalidOperationError: window expression not allowed in aggregation)
# df.with_column(
# z_score("visitors", "five_minute_bin", "normalized").over("daytrunc")
# )
# won't work either (InvalidOperationError: window expression not allowed in aggregation)
#df.groupby_rolling(index_column="daytrunc", period="3i").agg(z_score("visitors", "five_minute_bin", "normalized"))
For an example of 4 days of data with four data-points each lying in two time-bins ({0,0} - {0,1}), ({1,0} - {1,1})
Input:
Day 0: x_d0_{0,0}, x_d0_{0,1}, x_d0_{1,0}, x_d0_{1,1}
Day 1: x_d1_{0,0}, x_d1_{0,1}, x_d1_{1,0}, x_d1_{1,1}
Day 2: x_d2_{0,0}, x_d2_{0,1}, x_d2_{1,0}, x_d2_{1,1}
Day 3: x_d3_{0,0}, x_d3_{0,1}, x_d3_{1,0}, x_d3_{1,1}
Output:
Day 0: norm_x_d0_{0,0} = nan, norm_x_d0_{0,1} = nan, norm_x_d0_{1,0} = nan, norm_x_d0_{1,1} = nan
Day 1: norm_x_d1_{0,0} = nan, norm_x_d1_{0,1} = nan, norm_x_d1_{1,0} = nan, norm_x_d1_{1,1} = nan
Day 2: norm_x_d2_{0,0} = nan, norm_x_d2_{0,1} = nan, norm_x_d2_{1,0} = nan, norm_x_d2_{1,1} = nan
Day 3: norm_x_d3_{0,0} = (x_d3_{0,0} - np.mean([x_d0_{0,0}, x_d0_{0,1}, X_d1_{0,0}, ..., x_d3_{0,1}]) / np.std([x_d0_{0,0}, x_d0_{0,1}, X_d1_{0,0}, ..., x_d3_{0,1}])) , ... ,
They key here is to use over to restrict your calculations to the five minute bins and then use the rolling functions to get the rolling mean and standard deviation over days restricted by those five minute bin keys. five_minute_bin works as in your code and I believe that a truncated day_bin is necessary so that, for example, 9:33 on one day will include 9:31 both 9:34 on the same and 9:31 from 2 days ago.
days = 5
pl.DataFrame(
{
"timestamp": pl.concat(
[
pl.date_range(
datetime(2019, 12, d, 9, 30), datetime(2019, 12, d, 12, 30), "30s"
)
for d in range(2, days + 2)
]
),
"visitors": [(e % 2) + 1 for e in range(days * 361)],
}
).with_columns(
five_minute_bin=pl.col("timestamp").dt.truncate(every="5m").dt.strftime("%H%M"),
day_bin=pl.col("timestamp").dt.truncate(every="1d"),
).with_columns(
standardized_visitors=(
(
pl.col("visitors")
- pl.col("visitors").rolling_mean("3d", by="day_bin", closed="right")
)
/ pl.col("visitors").rolling_std("3d", by="day_bin", closed="right")
).over("five_minute_bin")
)
Now, that said, when trying out the code for this, I found polars doesn't handle non-unique values in the by-column in the rolling function correctly, so that the same values in the same 5-minute bin don't end up as the same standardized values. Opened bug report here: https://github.com/pola-rs/polars/issues/6691. For large amounts of real world data, this shouldn't actually matter that much, unless your data systematically differs in distribution within the 5 minute bins.

Polars Dataframe: Apply MinMaxScaler to a column with condition

I am trying to perform the following operation in Polars.
For value in column B which is below 80 will be scaled between 1 and 4, where as for anything above 80, will be set as 5.
df_pandas = pd.DataFrame(
{
"A": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"B": [50, 300, 80, 12, 105, 78, 66, 42, 61.5, 35],
}
)
test_scaler = MinMaxScaler(feature_range=(1,4)) # from sklearn.preprocessing
df_pandas.loc[df_pandas['B']<80, 'Test'] = test_scaler.fit_transform(df_pandas.loc[df_pandas['B']<80, "B"].values.reshape(-1,1))
df_pandas = df_pandas.fillna(5)
This is what I did with Polars:
# dt is a dictionary
dt = df.filter(
pl.col('B')<80
).to_dict(as_series=False)
below_80 = list(dt.keys())
dt_scale = list(
test_scaler.fit_transform(
np.array(dt['B']).reshape(-1,1)
).reshape(-1) # reshape back to one dimensional
)
# reassign to dictionary dt
dt['B'] = dt_scale
dt_scale_df = pl.DataFrame(dt)
dt_scale_df
dummy = df.join(
dt_scale_df, how="left", on="A"
).fill_null(5)
dummy = dummy.rename({"B_right": "Test"})
Result:
A
B
Test
1
50.0
2.727273
2
300.0
5.000000
3
80.0
5.000000
4
12.0
1.000000
5
105.0
5.000000
6
78.0
4.000000
7
66.0
3.454545
8
42.0
2.363636
9
61.5
3.250000
10
35.0
2.045455
Is there a better approach for this?
Alright, I have got 3 examples for you that should help you from which the last should be preferred.
Because you only want to apply your scaler to a part of a column, we should ensure we only send that part of the data to the scaler. This can be done by:
window function over a partition
partition_by
when -> then -> otherwise + min_max expression
Window function over partititon
This requires a python function that will be applied over the partitions. In the function itself we then have to check in which partition we are and deal with it accordingly.
df = pl.from_pandas(df_pandas)
min_max_sc = MinMaxScaler((1, 4))
def my_scaler(s: pl.Series) -> pl.Series:
if s.len() > 0 and s[0] > 80:
out = (s * 0 + 5)
else:
out = pl.Series(min_max_sc.fit_transform(s.to_numpy().reshape(-1, 1)).flatten())
# ensure all types are the same
return out.cast(pl.Float64)
df.with_column(
pl.col("B").apply(my_scaler).over(pl.col("B") < 80).alias("Test")
)
partition_by
This partitions the the original dataframe to a dictionary holding the different partitions. We then only modify the partitions as needed.
parts = (df
.with_column((pl.col("B") < 80).alias("part"))
.partition_by("part", as_dict=True)
)
parts[True] = parts[True].with_column(
pl.col("B").map(
lambda s: pl.Series(min_max_sc.fit_transform(s.to_numpy().reshape(-1, 1)).flatten())
).alias("Test")
)
parts[False] = parts[False].with_column(
pl.lit(5.0).alias("Test")
)
pl.concat([df for df in parts.values()]).select(pl.all().exclude("part"))
when -> then -> otherwise + min_max expression
This one I like best. We can make function that creates a polars expression that is the min_max scaling function you need. This will have best performance.
def min_max_scaler(col: str, predicate: pl.Expr):
x = pl.col(col)
x_min = x.filter(predicate).min()
x_max = x.filter(predicate).max()
# * 3 + 1 to set scale between 1 - 4
return (x - x_min) / (x_max - x_min) * 3 + 1
predicate = pl.col("B") < 80
df.with_column(
pl.when(predicate)
.then(min_max_scaler("B", predicate))
.otherwise(5).alias("Test")
)

How to implement Date.add(date, n, :month) in Elixir

Would be nice to have this in the standard Elixir library, but we don't.
Date.add(date, n, :month) # where n could be +/-
How would you implement this?
This looks like a good starting point: https://stackoverflow.com/a/53407676/44080
Date.utc_today() |> Timex.shift(months: -1)
You could use the Timex implementation:
defp shift_by(%NaiveDateTime{:year => year, :month => month} = datetime, value, :months) do
m = month + value
shifted =
cond do
m > 0 ->
years = div(m - 1, 12)
month = rem(m - 1, 12) + 1
%{datetime | :year => year + years, :month => month}
m <= 0 ->
years = div(m, 12) - 1
month = 12 + rem(m, 12)
%{datetime | :year => year + years, :month => month}
end
# If the shift fails, it's because it's a high day number, and the month
# shifted to does not have that many days. This will be handled by always
# shifting to the last day of the month shifted to.
case :calendar.valid_date({shifted.year,shifted.month,shifted.day}) do
false ->
last_day = :calendar.last_day_of_the_month(shifted.year, shifted.month)
cond do
shifted.day <= last_day ->
shifted
:else ->
%{shifted | :day => last_day}
end
true ->
shifted
end
end
Timex uses the MIT license, so you should be able to incorporate this in pretty much any project.
ex_cldr_calendars can also do basic date math for adding and subtracting years, quarters, months, weeks and days for any calendar that implements the Calendar behaviour.
iex> Cldr.Calendar.plus ~D[2019-03-31], :months, -1
~D[2019-02-28]
# The :coerce option determines whether to force an end
# of month date when the result of the operation is an invalid date
iex> Cldr.Calendar.plus ~D[2019-03-31], :months, -1, coerce: false
{:error, :invalid_date}
Without adding a dependency like Timex, the following works for adding/subtracting Gregorian months without too much trouble - assuming you only need the first of each month. Shifting to a day of the month directly may be best served through a library, given how many calendrical fallacies there are.
defmodule DateUtils
#doc """
Shift a given date forward or back n months
"""
def shift_n_months(date, n) when n < 0, do: subtract_n_months(date, -1 * n)
def shift_n_months(date, n), do: add_n_months(date, n)
def add_n_months(date, 0), do: Date.beginning_of_month(date)
def add_n_months(date, n) do
date
|> Date.end_of_month()
|> Date.add(1)
|> add_n_months(n - 1)
end
def subtract_n_months(date, 0), do: Date.beginning_of_month(date)
def subtract_n_months(date, n) do
date
|> Date.beginning_of_month()
|> Date.add(-1)
|> subtract_n_months(n - 1)
end
end
There is an elixir function Date.add/2. Give it any date and it will add the dates for you.
iex>Date.add(~D[2000-01-03], -2)
~D[2000-01-01]
If you want to create the date to add to then i suggest you use the Date.new/4
iex>{:ok, date} = Date.new(year, month, day)
iex>date |> Date.add(n)

How to make matlab only read dates in one year when the files consist of several year?

So, I have four excel files with dates, that I read out and convert.
num = xlsread('1.xlsx', 1, 'A:B')
num2 = xlsread('2.xlsx', 1, 'A:B');
num3 = xlsread('3.xlsx', 1, 'A:B');
num4 = xlsread('4.xlsx', 1, 'A:B');
dnum = datetime(num(:,1),1,1) + caldays(num(:,2));
dnum2= datetime(num2(:,1),1,1) + caldays(num2(:,2));
dnum3= datetime(num3(:,1),1,1) + caldays(num3(:,2));
dnum4=datetime(num4(:,1),1,1) + caldays(num4(:,2));
plot(dnum, 1*ones(size(dnum)), 'x-','linewidth', 1)
plot(dnum2, 2*ones(size(dnum2)), 'x-','linewidth', 1 )
plot(dnum3, 3*ones(size(dnum3)), 'x-', 'linewidth', 1)
plot(dnum4, 4*ones(size(dnum4)), 'x-', 'linewidth', 1)
This are the files that contain dates from many years, but if I want to just collect dates from 2016, what can I do?
Create a filter array with year.
FilterYears=year(dnum)==2016
FilteredData=data[FilterYears]
Hope this helps.

Compare dates in Lua

I have a variable with a date table that looks like this
* table:
[day]
* number: 15
[year]
* number: 2015
[month]
* number: 2
How do I get the days between the current date and the date above? Many thanks!
You can use os.time() to convert your table to seconds and get the current time and then use os.difftime() to compute the difference. see Lua Wiki for more details.
reference = os.time{day=15, year=2015, month=2}
daysfrom = os.difftime(os.time(), reference) / (24 * 60 * 60) -- seconds in a day
wholedays = math.floor(daysfrom)
print(wholedays) -- today it prints "1"
as #barnes53 pointed out could be off by one day for a few seconds so it's not ideal, but it may be good enough for your needs.
You can use the algorithms gathered here:
chrono-Compatible Low-Level Date Algorithms
The algorithms are shown using C++, but they can be easily implemented in Lua if you like, or you can implement them in C or C++ and then just provide Lua bindings.
The basic idea using these algorithms is to compute a day number for the two dates and then just subtract them to give you the number of days.
--[[
http://howardhinnant.github.io/date_algorithms.html
Returns number of days since civil 1970-01-01. Negative values indicate
days prior to 1970-01-01.
Preconditions: y-m-d represents a date in the civil (Gregorian) calendar
m is in [1, 12]
d is in [1, last_day_of_month(y, m)]
y is "approximately" in
[numeric_limits<Int>::min()/366, numeric_limits<Int>::max()/366]
Exact range of validity is:
[civil_from_days(numeric_limits<Int>::min()),
civil_from_days(numeric_limits<Int>::max()-719468)]
]]
function days_from_civil(y, m, d)
if m <= 2 then
y = y - 1
m = m + 9
else
m = m - 3
end
local era = math.floor(y/400)
local yoe = y - era * 400 -- [0, 399]
local doy = math.modf((153*m + 2)/5) + d-1 -- [0, 365]
local doe = yoe * 365 + math.modf(yoe/4) - math.modf(yoe/100) + doy -- [0, 146096]
return era * 146097 + doe - 719468
end
local reference_date = {year=2001, month = 1, day = 1}
local date = os.date("*t")
local reference_days = days_from_civil(reference_date.year, reference_date.month, reference_date.day)
local days = days_from_civil(date.year, date.month, date.day)
print(string.format("Today is %d days into the 21st century.",days-reference_days))
os.time (under Windows, at least) is limited to years from 1970 and up. If, for example, you need a general solution to also find ages in days for people born before 1970, this won't work. You can use a julian date conversion and subtract between the two numbers (today and your target date).
A sample julian date function that will work for practically any date AD is given below (Lua v5.3 because of // but you could adapt to earlier versions):
local
function div(n,d)
local a, b = 1, 1
if n < 0 then a = -1 end
if d < 0 then b = -1 end
return a * b * (math.abs(n) // math.abs(d))
end
--------------------------------------------------------------------------------
-- Convert a YYMMDD date to Julian since 1/1/1900 (negative answer possible)
--------------------------------------------------------------------------------
function julian(year, month, day)
local temp
if (year < 0) or (month < 1) or (month > 12)
or (day < 1) or (day > 31) then
return
end
temp = div(month - 14, 12)
return (
day - 32075 +
div(1461 * (year + 4800 + temp), 4) +
div(367 * (month - 2 - temp * 12), 12) -
div(3 * div(year + 4900 + temp, 100), 4)
) - 2415021
end