Polars Dataframe: Apply MinMaxScaler to a column with condition - python-polars

I am trying to perform the following operation in Polars.
For value in column B which is below 80 will be scaled between 1 and 4, where as for anything above 80, will be set as 5.
df_pandas = pd.DataFrame(
{
"A": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"B": [50, 300, 80, 12, 105, 78, 66, 42, 61.5, 35],
}
)
test_scaler = MinMaxScaler(feature_range=(1,4)) # from sklearn.preprocessing
df_pandas.loc[df_pandas['B']<80, 'Test'] = test_scaler.fit_transform(df_pandas.loc[df_pandas['B']<80, "B"].values.reshape(-1,1))
df_pandas = df_pandas.fillna(5)
This is what I did with Polars:
# dt is a dictionary
dt = df.filter(
pl.col('B')<80
).to_dict(as_series=False)
below_80 = list(dt.keys())
dt_scale = list(
test_scaler.fit_transform(
np.array(dt['B']).reshape(-1,1)
).reshape(-1) # reshape back to one dimensional
)
# reassign to dictionary dt
dt['B'] = dt_scale
dt_scale_df = pl.DataFrame(dt)
dt_scale_df
dummy = df.join(
dt_scale_df, how="left", on="A"
).fill_null(5)
dummy = dummy.rename({"B_right": "Test"})
Result:
A
B
Test
1
50.0
2.727273
2
300.0
5.000000
3
80.0
5.000000
4
12.0
1.000000
5
105.0
5.000000
6
78.0
4.000000
7
66.0
3.454545
8
42.0
2.363636
9
61.5
3.250000
10
35.0
2.045455
Is there a better approach for this?

Alright, I have got 3 examples for you that should help you from which the last should be preferred.
Because you only want to apply your scaler to a part of a column, we should ensure we only send that part of the data to the scaler. This can be done by:
window function over a partition
partition_by
when -> then -> otherwise + min_max expression
Window function over partititon
This requires a python function that will be applied over the partitions. In the function itself we then have to check in which partition we are and deal with it accordingly.
df = pl.from_pandas(df_pandas)
min_max_sc = MinMaxScaler((1, 4))
def my_scaler(s: pl.Series) -> pl.Series:
if s.len() > 0 and s[0] > 80:
out = (s * 0 + 5)
else:
out = pl.Series(min_max_sc.fit_transform(s.to_numpy().reshape(-1, 1)).flatten())
# ensure all types are the same
return out.cast(pl.Float64)
df.with_column(
pl.col("B").apply(my_scaler).over(pl.col("B") < 80).alias("Test")
)
partition_by
This partitions the the original dataframe to a dictionary holding the different partitions. We then only modify the partitions as needed.
parts = (df
.with_column((pl.col("B") < 80).alias("part"))
.partition_by("part", as_dict=True)
)
parts[True] = parts[True].with_column(
pl.col("B").map(
lambda s: pl.Series(min_max_sc.fit_transform(s.to_numpy().reshape(-1, 1)).flatten())
).alias("Test")
)
parts[False] = parts[False].with_column(
pl.lit(5.0).alias("Test")
)
pl.concat([df for df in parts.values()]).select(pl.all().exclude("part"))
when -> then -> otherwise + min_max expression
This one I like best. We can make function that creates a polars expression that is the min_max scaling function you need. This will have best performance.
def min_max_scaler(col: str, predicate: pl.Expr):
x = pl.col(col)
x_min = x.filter(predicate).min()
x_max = x.filter(predicate).max()
# * 3 + 1 to set scale between 1 - 4
return (x - x_min) / (x_max - x_min) * 3 + 1
predicate = pl.col("B") < 80
df.with_column(
pl.when(predicate)
.then(min_max_scaler("B", predicate))
.otherwise(5).alias("Test")
)

Related

summary row with gtsummary

I am trying to create a table of events with gtsummary and I would like to obtain a final row counting the events of the previous rows. add_overall() and add_n() do add the total but in a column, counting the same event across groups but not the overall events.
I created this example.
x1 <- sample(c("No", "Yes"), 30, replace = TRUE, prob = c(0.85, 0.15))
x2 <- sample(c("No", "Yes"), 30, replace = TRUE, prob = c(0.9, 0.1))
x3 <- sample(c("No", "Yes"), 30, replace = TRUE, prob = c(0.75, 0.25))
y <- sample(c("A", "B"), 30, replace = TRUE, prob = c(0.5, 0.5))
df <- data.frame(as_factor(x1), as_factor(x2), as_factor(x3), as_factor(y))
colnames(df) <-c("event_1", "event_2", "event_3", "group")
tbl_summary(df, by=group, statistic = all_categorical() ~ "{n}")
example
I tried using summary_rows() function from gt package after converting the table to a gt object but there is an error when summarising because these variables are factors.
Any other ideas?
You can do this by adding a new variable to your data frame that is the row sum of each of the events. Then you can display that variable's sum in the summary table. Example below!
library(gtsummary)
#> #Uighur
library(tidyverse)
df <-
data.frame(
event_1 = sample(c(FALSE, TRUE), 30, replace = TRUE, prob = c(0.85, 0.15)),
event_2 = sample(c(FALSE, TRUE), 30, replace = TRUE, prob = c(0.9, 0.1)),
event_3 = sample(c(FALSE, TRUE), 30, replace = TRUE, prob = c(0.75, 0.25)),
group = sample(c("A", "B"), 30, replace = TRUE, prob = c(0.5, 0.5))
) |>
rowwise() |>
mutate(Total = sum(event_1, event_2, event_3))
tbl_summary(
df,
by = group,
type = Total ~ "continuous",
statistic =
list(all_categorical() ~ "{n}",
all_continuous() ~ "{sum}")
) |>
as_kable() # convert to kable to display on stack overflow
Characteristic
A, N = 16
B, N = 14
event_1
4
4
event_2
1
2
event_3
7
6
Total
12
12
Created on 2023-01-12 with reprex v2.0.2
Thank you so much (great package gtsummary). That works! I had some trouble summing over factors. If variables are factors the code
mutate(Total = sum(event_1=="Yes", event_2=="Yes", event_3=="Yes"))
does it.

Google OR-Tools doesn't find solution on VRPtw problem

I'm tackling with VRPtw problem and struggling that the solver finds no solution with any data except for artificial small one.
The setting is as below.
There are several depots and locations to visit. Each locations have the time-window. Each vehicles have break time and work time. Also, the locations have some constraints and only the vehicles which satisfy that demand can visit there.
Based on this experiment setting, I wrote the code below.
As I wrote, it looks that it is working with small artificial data, but with real data, it never found the solution. I tried with 5 different data sets.
Although I set the 7200 sec time limit, previously I ran for longer than 10 hours and it was same.
The data's scale is 40~50 vehicles and 200~300 locations.
Does this code have a problem? If not, on what kind of order, should I change the approach(such as initialization, searching method and so on)?
(Edited to use integer for time matrix)
from dataclasses import dataclass
from typing import List, Tuple
from ortools.constraint_solver import pywrapcp
from ortools.constraint_solver import routing_enums_pb2
# TODO: Refactor
BIG_ENOUGH = 100000000
TIME_DIMENSION = 'Time'
TIME_LIMIT = 7200
#dataclass
class DataSet:
time_matrix: List[List[int]]
locations_num: int
vehicles_num: int
vehicles_break_time_window: List[Tuple[int, int, int]]
vehicles_work_time_windows: List[Tuple[int, int]]
location_time_windows: List[Tuple[int, int]]
vehicles_depots_indices: List[int]
possible_vehicles: List[List[int]]
def execute(data: DataSet):
manager = pywrapcp.RoutingIndexManager(data.locations_num,
data.vehicles_num,
data.vehicles_depots_indices,
data.vehicles_depots_indices)
routing_parameters = pywrapcp.DefaultRoutingModelParameters()
routing_parameters.solver_parameters.trace_propagation = True
routing_parameters.solver_parameters.trace_search = True
routing = pywrapcp.RoutingModel(manager, routing_parameters)
def time_callback(source_index, dest_index):
from_node = manager.IndexToNode(source_index)
to_node = manager.IndexToNode(dest_index)
return data.time_matrix[from_node][to_node]
transit_callback_index = routing.RegisterTransitCallback(time_callback)
routing.SetArcCostEvaluatorOfAllVehicles(transit_callback_index)
routing.AddDimension(
transit_callback_index,
BIG_ENOUGH,
BIG_ENOUGH,
False,
TIME_DIMENSION)
time_dimension = routing.GetDimensionOrDie(TIME_DIMENSION)
# set time window for locations start time
# set condition restrictions
possible_vehicles = data.possible_vehicles
for location_idx, time_window in enumerate(data.location_time_windows):
index = manager.NodeToIndex(location_idx + data.vehicles_num)
time_dimension.CumulVar(index).SetRange(time_window[0], time_window[1])
routing.SetAllowedVehiclesForIndex(possible_vehicles[location_idx], index)
solver = routing.solver()
for i in range(data.vehicles_num):
routing.AddVariableMinimizedByFinalizer(
time_dimension.CumulVar(routing.Start(i)))
routing.AddVariableMinimizedByFinalizer(
time_dimension.CumulVar(routing.End(i)))
# set work time window for vehicles
for vehicle_index, work_time_window in enumerate(data.vehicles_work_time_windows):
start_index = routing.Start(vehicle_index)
time_dimension.CumulVar(start_index).SetRange(work_time_window[0],
work_time_window[0])
end_index = routing.End(vehicle_index)
time_dimension.CumulVar(end_index).SetRange(work_time_window[1],
work_time_window[1])
# set break time for vehicles
node_visit_transit = {}
for n in range(routing.Size()):
if n >= data.locations_num:
node_visit_transit[n] = 0
else:
node_visit_transit[n] = 1
break_intervals = {}
for v in range(data.vehicles_num):
vehicle_break = data.vehicles_break_time_window[v]
break_intervals[v] = [
solver.FixedDurationIntervalVar(vehicle_break[0],
vehicle_break[1],
vehicle_break[2],
True,
'Break for vehicle {}'.format(v))
]
time_dimension.SetBreakIntervalsOfVehicle(
break_intervals[v], v, node_visit_transit
)
search_parameters = pywrapcp.DefaultRoutingSearchParameters()
search_parameters.first_solution_strategy = (
routing_enums_pb2.FirstSolutionStrategy.PATH_CHEAPEST_ARC)
search_parameters.local_search_metaheuristic = (
routing_enums_pb2.LocalSearchMetaheuristic.GREEDY_DESCENT)
search_parameters.time_limit.seconds = TIME_LIMIT
search_parameters.log_search = True
solution = routing.SolveWithParameters(search_parameters)
return solution
if __name__ == '__main__':
data = DataSet(
time_matrix=[[0, 0, 4, 5, 5, 6],
[0, 0, 6, 4, 5, 5],
[1, 3, 0, 6, 5, 4],
[2, 1, 6, 0, 5, 4],
[2, 2, 5, 5, 0, 6],
[3, 2, 4, 4, 6, 0]],
locations_num=6,
vehicles_num=2,
vehicles_depots_indices=[0, 1],
vehicles_work_time_windows=[(720, 1080), (720, 1080)],
vehicles_break_time_window=[(720, 720, 15), (720, 720, 15)],
location_time_windows=[(735, 750), (915, 930), (915, 930), (975, 990)],
possible_vehicles=[[0], [1], [0], [1]]
)
solution = execute(data)
if solution is not None:
print("solution is found")

For the series 1, 1, 2, 2, 4, 2, 6, what are the next terms in the sequence? What is the nth term?

i want to know the pattern for the above series in order to write the code for above series.
I am thinking that the above series is mix of two different series 1,2,4,6,...and 1,2,2,..
Please help me with this sequence and also tell whether i am thinking in correct way or not.
logic :--
series 1-> Prime-1 i.e. [1, 2, 4, 6, 10, 12, 16, 18, 22, 28, 30, 36.....]
series 2-> Number Series i.e. [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5.....]
final output -> Alternate Series i.e. [1 1 2 2 4 2 6 3 10 3 12 3 16 4 18 4 22 4 28 4....]
Note : There might be another logic but by the given Question, this series can be identified by below program..
Please do not use for any competition Test/Exam
import math
global li_prime;global li_series;xp=0
def prime(size):
global li_prime;count = 2;
while len(li_prime)
isprime = True
for x in range(2, int(math.sqrt(count) + 1)):
if count % x == 0:
isprime = False
break
if isprime:
li_prime.append(count-1)
count += 1
def series(size):
global li_series
for i in range(size+1):
for j in range(i):
li_series.append(i)
if len(li_series)>size:
break
def main():
global xp
global li_prime
global li_series
testcase=int(input(''))
for I in range(testcase):
li_series=[]
li_prime=[]
size=int(input(''))
prime(size)
series(size)
li_prime=li_prime[:size]
li_series=li_series[:size]
lc=[]
for i in range(size//2+1):
lc.append(str(li_prime[i]))
lc.append(str(li_series[i]))
lc=lc[:size]
main()
It is series whose greatest common divisior (gcd) is 1 also known as Euler's Totient Function.
series format = {1 1 2 2 4 2 6 32 ..... 168 80 216 120 164 100}
Code:
public static void main(String[] args) {
//n is the input for the size of the series
for(int j=1;j<=n;j++){
System.out.print(calSeriesVal(j)+" ");
}
}
private static int calDivisor(int a, int b)
{
if (a == 0)
return b;
return calDivisor(b % a, a);
}
private static int calSeriesVal(int n)
{
int val = 1;
for (int i = 2; i < n; i++)
if (calDivisor(i, n) == 1)
val++;
return val;
}

Aggregate in Julia like R or pandas

I want to aggregate a monthly series at the quarterly frequency, for which R has ts and aggregate() (see the first answer on this thread) and pandas has df.resample("Q").sum() (see this question). Does Julia offer something similar?
Appendix: my current solution uses a function to convert a data to the first quarter and split-apply-combine:
"""
month_to_quarter(date)
Returns the date corresponding to the first day of the quarter enclosing date
# Examples
```jldoctest
julia> Date(1990, 1, 1) == RED.month_to_quarter(Date(1990, 2, 1))
true
julia> Date(1990, 1, 1) == RED.month_to_quarter(Date(1990, 1, 1))
true
julia> Date(1990, 1, 1) == RED.month_to_quarter(Date(1990, 2, 25))
true
```
"""
function month_to_quarter(date::Date)
new_month = 1 + 3 * floor((Dates.month(date) - 1) / 3)
return Date(Dates.year(date), new_month, 1)
end
"""
monthly_to_quarterly(monthly_df)
Aggregates a monthly data frame to the quarterly frequency. The data frame should have a :DATE column.
# Examples
```jldoctest
julia> monthly = convert(DataFrame, hcat(collect([Dates.Date(1990, m, 1) for m in 1:3]), [1; 2; 3]));
julia> rename!(monthly, :x1 => :DATE);
julia> rename!(monthly, :x2 => :value);
julia> quarterly = RED.monthly_to_quarterly(monthly);
julia> quarterly[:value][1]
2.0
julia> length(quarterly[:value])
1
```
"""
function monthly_to_quarterly(monthly::DataFrame)
# quarter months: 1, 4, 7, 10
quarter_months = collect(1:3:10)
# Deep copy the data frame
monthly_copy = deepcopy(monthly)
# Drop initial rows until it starts on a quarter
while !in(Dates.month(monthly_copy[:DATE][1]), quarter_months)
# Verify that something is left to pop
#assert 1 <= length(monthly_copy[:DATE])
monthly_copy = monthly_copy[2:end, :]
end
# Drop end rows until it finishes before a quarter
while !in(Dates.month(monthly_copy[:DATE][end]), 2 + quarter_months)
monthly_copy = monthly_copy[1:end-1, :]
end
# Change month of each date to the nearest quarter
monthly_copy[:DATE] = month_to_quarter.(monthly_copy[:DATE])
# Split-apply-combine
quarterly = by(monthly_copy, :DATE, df -> mean(df[:value]))
# Rename
rename!(quarterly, :x1 => :value)
return quarterly
end
I couldn't find such a function in the docs. Here's a more DataFrames.jl-ish and more succint version of your own answer
using DataFrames
# copy-pasted your own function
function month_to_quarter(date::Date)
new_month = 1 + 3 * floor((Dates.month(date) - 1) / 3)
return Date(Dates.year(date), new_month, 1)
end
# the data
r=collect(1:6)
monthly = DataFrame(date=[Dates.Date(1990, m, 1) for m in r],
val=r);
# the functionality
monthly[:quarters] = month_to_quarter.(monthly[:date])
_aggregated = by(monthly, :quarters, df -> DataFrame(S = sum(df[:val])))
#show monthly
#show _aggregated

Can't assign a big number to a variable out of the while loop in scala

I want to write a program that can find the N-th number,which only contains factor 2 , 3 or 5.
def method3(n:Int):Int = {
var q2 = mutable.Queue[Int](2)
var q3 = mutable.Queue[Int](3)
var q5 = mutable.Queue[Int](5)
var count = 1
var x:Int = 0
while(count != n){
val minVal = Seq(q2,q3,q5).map(_.head).min
if(minVal == q2.head){
x = q2.dequeue()
q2.enqueue(2*x)
q3.enqueue(3*x)
q5.enqueue(5*x)
}else if(minVal == q3.head){
x = q3.dequeue()
q3.enqueue(3*x)
q5.enqueue(5*x)
}else{
x = q5.dequeue()
q5.enqueue(5*x)
}
count+=1
}
return x
}
println(method3(1000))
println(method3(10000))
println(method3(100000))
The results
51200000
0
0
When the input number gets larger , I get 0 from the function.
But if I change the function to
def method3(n:Int):Int = {
...
q5.enqueue(5*x)
}
if(x > 1000000000) println(('-',x)) //note here!!!
count+=1
}
return x
}
The results
51200000
(-,1006632960)
(-,1007769600)
(-,1012500000)
(-,1019215872)
(-,1020366720)
(-,1024000000)
(-,1025156250)
(-,1033121304)
(-,1036800000)
(-,1048576000)
(-,1049760000)
(-,1054687500)
(-,1061683200)
(-,1062882000)
(-,1073741824)
0
.....
So I don't know why the result equals to 0 when the input number grows larger.
An Int is only 32 bits (4 bytes). You're hitting the limits of what an Int can hold.
Take that last number you encounter: 1073741824. Multiply that by 2 and the result is negative (-2147483648). Multiply it by 4 and the result is zero.
BTW, if you're working with numbers "which only contains factor 2, 3 or 5", in other words the numbers 2, 3, 4, 5, 6, 8, 9, 10, 12, 14, 15, ... etc., then the 1,000th number in that sequence shouldn't be that big. By my calculations the result should only be 1365.