I would like to calculate a z-score over a bin based on the data of a rolling look-back period.
Example
Todays visitor amount during [9:30-9:35) should be z-score normalized based off the (mean, std) of the last 3 days of visitors that visited during [9:30-9:35).
My current attempts both raise InvalidOperationError. Is there a way in polars to calculate this?
import polars as pl
def z_score(col: str, over: str, alias: str):
# calculate z-score normalized `col` over `over`
return (
(pl.col(col)-pl.col(col).over(over).mean()) / pl.col(col).over(over).std()
).alias(alias)
df = pl.from_dict(
{
"timestamp": pd.date_range("2019-12-02 9:30", "2019-12-02 12:30", freq="30s").union(
pd.date_range("2019-12-03 9:30", "2019-12-03 12:30", freq="30s")
),
"visitors": [(e % 2) + 1 for e in range(722)]
}
# 5 minute bins for grouping [9:30-9:35) -> 930
).with_column(
pl.col("timestamp").dt.truncate(every="5m").dt.strftime("%H%M").cast(pl.Int32).alias("five_minute_bin")
).with_column(
pl.col("timestamp").dt.truncate(every="3d").alias("daytrunc")
)
# normalize visitor amount for each 5 min bin over the rolling 3 day window using z-score.
# not rolling but also wont work (InvalidOperationError: window expression not allowed in aggregation)
# df.with_column(
# z_score("visitors", "five_minute_bin", "normalized").over("daytrunc")
# )
# won't work either (InvalidOperationError: window expression not allowed in aggregation)
#df.groupby_rolling(index_column="daytrunc", period="3i").agg(z_score("visitors", "five_minute_bin", "normalized"))
For an example of 4 days of data with four data-points each lying in two time-bins ({0,0} - {0,1}), ({1,0} - {1,1})
Input:
Day 0: x_d0_{0,0}, x_d0_{0,1}, x_d0_{1,0}, x_d0_{1,1}
Day 1: x_d1_{0,0}, x_d1_{0,1}, x_d1_{1,0}, x_d1_{1,1}
Day 2: x_d2_{0,0}, x_d2_{0,1}, x_d2_{1,0}, x_d2_{1,1}
Day 3: x_d3_{0,0}, x_d3_{0,1}, x_d3_{1,0}, x_d3_{1,1}
Output:
Day 0: norm_x_d0_{0,0} = nan, norm_x_d0_{0,1} = nan, norm_x_d0_{1,0} = nan, norm_x_d0_{1,1} = nan
Day 1: norm_x_d1_{0,0} = nan, norm_x_d1_{0,1} = nan, norm_x_d1_{1,0} = nan, norm_x_d1_{1,1} = nan
Day 2: norm_x_d2_{0,0} = nan, norm_x_d2_{0,1} = nan, norm_x_d2_{1,0} = nan, norm_x_d2_{1,1} = nan
Day 3: norm_x_d3_{0,0} = (x_d3_{0,0} - np.mean([x_d0_{0,0}, x_d0_{0,1}, X_d1_{0,0}, ..., x_d3_{0,1}]) / np.std([x_d0_{0,0}, x_d0_{0,1}, X_d1_{0,0}, ..., x_d3_{0,1}])) , ... ,
They key here is to use over to restrict your calculations to the five minute bins and then use the rolling functions to get the rolling mean and standard deviation over days restricted by those five minute bin keys. five_minute_bin works as in your code and I believe that a truncated day_bin is necessary so that, for example, 9:33 on one day will include 9:31 both 9:34 on the same and 9:31 from 2 days ago.
days = 5
pl.DataFrame(
{
"timestamp": pl.concat(
[
pl.date_range(
datetime(2019, 12, d, 9, 30), datetime(2019, 12, d, 12, 30), "30s"
)
for d in range(2, days + 2)
]
),
"visitors": [(e % 2) + 1 for e in range(days * 361)],
}
).with_columns(
five_minute_bin=pl.col("timestamp").dt.truncate(every="5m").dt.strftime("%H%M"),
day_bin=pl.col("timestamp").dt.truncate(every="1d"),
).with_columns(
standardized_visitors=(
(
pl.col("visitors")
- pl.col("visitors").rolling_mean("3d", by="day_bin", closed="right")
)
/ pl.col("visitors").rolling_std("3d", by="day_bin", closed="right")
).over("five_minute_bin")
)
Now, that said, when trying out the code for this, I found polars doesn't handle non-unique values in the by-column in the rolling function correctly, so that the same values in the same 5-minute bin don't end up as the same standardized values. Opened bug report here: https://github.com/pola-rs/polars/issues/6691. For large amounts of real world data, this shouldn't actually matter that much, unless your data systematically differs in distribution within the 5 minute bins.
I have a table looks like
Time ID Value1 Value2
1 a 1 4
2 a 2 3
3 a 5 9
1 b 6 2
2 b 4 2
3 b 9 1
4 b 2 5
1 c 4 7
2 c 2 0
Here is the tasks and requirements:
I want to set the column ID as the key, not the column Time, but I don't want to delete the column Time. Is there a way in Spark to set Primary Key?
The aggregation function is non-linear, which means you can not use "reduceByKey". All the data must be shuffled to one single node before calculation. For example, the aggregation function may looks like root N of the sum values, where N is the number of records (count) for each ID :
output = root(sum(value1), count(*)) + root(sum(value2), count(*))
To make it clear, for ID="a", the aggregated output value should be
output = root(1 + 2 + 5, 3) + root(4 + 3 + 9, 3)
the later 3 is because we have 3 record for a. For ID='b', it is:
output = root(6 + 4 + 9 + 2, 4) + root(2 + 2 + 1 + 5, 4)
The combination is non-linear. Therefore, in order to get correct results, all the data with the same "ID" must be in one executor.
I checked UDF or Aggregator in Spark 2.0. Based on my understanding, they all assume "linear combination"
Is there a way to handle such nonlinear combination calculation? Especially, taking the advantage of parallel computing with Spark?
Function you use doesn't require any special treatment. You can use plain SQL with join
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{count, lit, sum, pow}
def root(l: Column, r: Column) = pow(l, lit(1) / r)
val out = root(sum($"value1"), count("*")) + root(sum($"value2"), count("*"))
df.groupBy("id").agg(out.alias("outcome")).join(df, Seq("id"))
or window functions:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("id")
val outw = root(sum($"value1").over(w), count("*").over(w)) +
root(sum($"value2").over(w), count("*").over(w))
df.withColumn("outcome", outw)
there are function that can randomize spilt data
trainingRDD, validationRDD, testRDD = RDD.randomSplit([6, 2, 2], seed=0L)
I'm curious if there a way that we generate data the same partition ( train 60 / valid 20 / test 20 ) but without randommize ( let's just say use the current data to split first 60 = train, next 20 =valid and last 20 are for test data)
is there a possible way to split data similar way to split but not randomize?
The basic issue here is that unless you have an index column in your data, there is no concept of "first rows" and "next rows" in your RDD, it's just an unordered set. If you have an integer index column you could do something like this:
train = RDD.filter(lambda r: r['index'] % 5 <= 3)
validation = RDD.filter(lambda r: r['index'] % 5 == 4)
test = RDD.filter(lambda r: r['index'] % 5 == 5)
I recently encountered below scenario in drools. I want to know how to proceed with the rule design for this.
Class Emp{
beingDate:Date
endDate:Date
}
Rule to determine annual income for the employee based on the given dates:
For dates before 3/5/2003 the hourly rate is $3.5 and annual multiplier is 2100
For dates after 3/5/2003 the hourly rate changes every year (given data) and annual multiplier is 2092.
There might be scenarios where begin date is before 3/5/2003 and end date is after 3/5/2003.
What is the best way to design rules for this scenario.
Update: added an e.g. for more clarity
If the object is
empObj={
beginDate=10/8/2001,
endDate=5/10/2005
}
The rule should give the sum of below:
3.5 * (no. of days in 2001 starting 10/8/2001) / (total no. of days in 2001) * 2100
3.5 * 2100 ==> This is for year 2002
3.5 * (no. of days in 2003 before 3/5/2003) / (total no. of days in 2003) * 2100
(2003 hourly rate) * (no. of days in 2003 after 3/5/2003) / (total no. of days in 2003) * 2092 ==> note the change in yearly multiplier..
(2004 hourly rate) * 2092
(2005 hourly rate) * (no. of days in 2005 before 5/10/2005) / (total no. of days in 2005) * 2092
One way to do this is to have one rule per year. So it would look something like this
rule "2001"
when:
e : Emp( beginDate < "01-Jan-2002" )
then:
// 1. Get the number of days worked in 2001, probably easiest to do with some Java helper method
// 2. Calculate the sum
// 3. Add the sum to some Fact, could be the same Emp fact even
end
rule "2002"
when:
e : Emp( beginDate < "01-Jan-2003" )
then:
// As with 2001
end
The rest of the rules are very similar, just change the yearly multiplier accordingly. If you decide to use the Emp object to hold the sum, add method like
class Emp {
long sum = 0
void addToSum( long value ) { sum += value }
}
And in your RHS side call the method and update the object on each rule.
Hope this helps.
Being new to crystal, I am unable to figure out how to compute rows 3 and 4 below.
Rows 1 and 2 are simple percentages of the sum of the data.
Row 3 is a computed value (see below.)
Row 4 is a sum of the data points (NOT a percentage as in row 1 and row 2)
Can someone give me some pointers on how to generate the display as below.
My data:
2010/01/01 A 10
2010/01/01 B 20
2010/01/01 C 30
2010/02/01 A 40
2010/02/01 B 50
2010/02/01 C 60
2010/03/01 A 70
2010/03/01 B 80
2010/03/01 C 90
I want to display
2010/01/01 2010/02/01 2010/03/01
========== ========== ==========
[ B/(A + B + C) ] 20/60 50/150 80/240 <=== percentage of sum
[ C/(A + B + C) ] 30/60 60/150 90/240 <=== percentage of sum
[ 1 - A/(A + B + C) ] 1 - 10/60 1 - 40/150 1 - 70/240 <=== computed
[ (A + B + C) ] 60 150 250 <=== sum
Assuming you are using a SQL data source, I suggest deriving each of the output rows' values (ie. [B/(A + B + C)], [C/(A + B + C)], [1 - A/(A + B + C)] and [(A + B + C)]) per date in the SQL query, then using Crystal's crosstab feature to pivot them into the output format desired.
Crystal's crosstabs aren't particularly suited to deriving different calculations on different rows of output.