Related
I am trying to perform the following operation in Polars.
For value in column B which is below 80 will be scaled between 1 and 4, where as for anything above 80, will be set as 5.
df_pandas = pd.DataFrame(
{
"A": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"B": [50, 300, 80, 12, 105, 78, 66, 42, 61.5, 35],
}
)
test_scaler = MinMaxScaler(feature_range=(1,4)) # from sklearn.preprocessing
df_pandas.loc[df_pandas['B']<80, 'Test'] = test_scaler.fit_transform(df_pandas.loc[df_pandas['B']<80, "B"].values.reshape(-1,1))
df_pandas = df_pandas.fillna(5)
This is what I did with Polars:
# dt is a dictionary
dt = df.filter(
pl.col('B')<80
).to_dict(as_series=False)
below_80 = list(dt.keys())
dt_scale = list(
test_scaler.fit_transform(
np.array(dt['B']).reshape(-1,1)
).reshape(-1) # reshape back to one dimensional
)
# reassign to dictionary dt
dt['B'] = dt_scale
dt_scale_df = pl.DataFrame(dt)
dt_scale_df
dummy = df.join(
dt_scale_df, how="left", on="A"
).fill_null(5)
dummy = dummy.rename({"B_right": "Test"})
Result:
A
B
Test
1
50.0
2.727273
2
300.0
5.000000
3
80.0
5.000000
4
12.0
1.000000
5
105.0
5.000000
6
78.0
4.000000
7
66.0
3.454545
8
42.0
2.363636
9
61.5
3.250000
10
35.0
2.045455
Is there a better approach for this?
Alright, I have got 3 examples for you that should help you from which the last should be preferred.
Because you only want to apply your scaler to a part of a column, we should ensure we only send that part of the data to the scaler. This can be done by:
window function over a partition
partition_by
when -> then -> otherwise + min_max expression
Window function over partititon
This requires a python function that will be applied over the partitions. In the function itself we then have to check in which partition we are and deal with it accordingly.
df = pl.from_pandas(df_pandas)
min_max_sc = MinMaxScaler((1, 4))
def my_scaler(s: pl.Series) -> pl.Series:
if s.len() > 0 and s[0] > 80:
out = (s * 0 + 5)
else:
out = pl.Series(min_max_sc.fit_transform(s.to_numpy().reshape(-1, 1)).flatten())
# ensure all types are the same
return out.cast(pl.Float64)
df.with_column(
pl.col("B").apply(my_scaler).over(pl.col("B") < 80).alias("Test")
)
partition_by
This partitions the the original dataframe to a dictionary holding the different partitions. We then only modify the partitions as needed.
parts = (df
.with_column((pl.col("B") < 80).alias("part"))
.partition_by("part", as_dict=True)
)
parts[True] = parts[True].with_column(
pl.col("B").map(
lambda s: pl.Series(min_max_sc.fit_transform(s.to_numpy().reshape(-1, 1)).flatten())
).alias("Test")
)
parts[False] = parts[False].with_column(
pl.lit(5.0).alias("Test")
)
pl.concat([df for df in parts.values()]).select(pl.all().exclude("part"))
when -> then -> otherwise + min_max expression
This one I like best. We can make function that creates a polars expression that is the min_max scaling function you need. This will have best performance.
def min_max_scaler(col: str, predicate: pl.Expr):
x = pl.col(col)
x_min = x.filter(predicate).min()
x_max = x.filter(predicate).max()
# * 3 + 1 to set scale between 1 - 4
return (x - x_min) / (x_max - x_min) * 3 + 1
predicate = pl.col("B") < 80
df.with_column(
pl.when(predicate)
.then(min_max_scaler("B", predicate))
.otherwise(5).alias("Test")
)
I am new to constraint programming and OR-Tools. A brief about the problem. There are 8 positions, for each position I need to decide which move of type A (move_A) and which move of type B (move_B) should be selected such that the value achieved from the combination of the 2 moves (at each position) is maximized. (This is only a part of the bigger problem though). And I want to use AddElement approach to do the sub setting.
Please see the below attempt
from ortools.sat.python import cp_model
model = cp_model.CpModel()
# value achieved from combination of different moves of type A
# (moves_A (rows)) and different moves of type B (moves_B (columns))
# for e.g. 2nd move of type A and 3rd move of type B will give value = 2
value = [
[ -1, 5, 3, 2, 2],
[ 2, 4, 2, -1, 1],
[ 4, 4, 0, -1, 2],
[ 5, 1, -1, 2, 2],
[ 0, 0, 0, 0, 0],
[ 2, 1, 1, 2, 0]
]
# 6 moves of type A
num_moves_A = len(value)
# 5 moves of type B
num_moves_B = len(value[0])
num_positions = 8
type_move_A_position = [model.NewIntVar(0, num_moves_A - 1, f"move_A[{i}]") for i in range(num_positions)]
type_move_B_position = [model.NewIntVar(0, num_moves_B - 1, f"move_B[{i}]") for i in range(num_positions)]
value_position = [model.NewIntVar(0, 10, f"value_position[{i}]") for i in range(num_positions)]
# I am getting an error when I run the below
objective_terms = []
for i in range(num_positions):
model.AddElement(type_move_B_position[i], value[type_move_A_position[i]], value_position[i])
objective_terms.append(value_position[i])
The error is as follows:
Traceback (most recent call last):
File "<ipython-input-65-3696379ce410>", line 3, in <module>
model.AddElement(type_move_B_position[i], value[type_move_A_position[i]], value_position[i])
TypeError: list indices must be integers or slices, not IntVar
In MiniZinc the below code would have worked
var int: obj = sum(i in 1..num_positions ) (value [type_move_A_position[i], type_move_B_position[i]])
I know in OR-Tools we will have to create some intermediary variables to store results first, so the above approach of minizinc will not work. But I am struggling to do so.
I can always create a 2 matrix of binary binary variables one for num_moves_A * num_positions and the other for num_moves_B * num_positions, add re;evant constraints and achieve the purpose
But I want to learn how to do the same thing via AddElement constraint
Any help on how to re-write the AddElement snippet is highly appreciated. Thanks.
AddElement is 1D only.
The way it is translated from minizinc to CP-SAT is to create an intermediate variable p == index1 * max(index2) + index2 and use it in an element constraint with a flattened matrix.
Following Laurent's suggestion (using AddElement constraint):
from ortools.sat.python import cp_model
model = cp_model.CpModel()
# value achieved from combination of different moves of type A
# (moves_A (rows)) and different moves of type B (moves_B (columns))
# for e.g. 2 move of type A and 3 move of type B will give value = 2
value = [
[-1, 5, 3, 2, 2],
[2, 4, 2, -1, 1],
[4, 4, 0, -1, 2],
[5, 1, -1, 2, 2],
[0, 0, 0, 0, 0],
[2, 1, 1, 2, 0],
]
min_value = min([min(i) for i in value])
max_value = max([max(i) for i in value])
# 6 moves of type A
num_moves_A = len(value)
# 5 moves of type B
num_moves_B = len(value[0])
# number of positions
num_positions = 5
# flattened matrix of values
value_flat = [value[i][j] for i in range(num_moves_A) for j in range(num_moves_B)]
# flattened indices
flatten_indices = [
index1 * len(value[0]) + index2
for index1 in range(len(value))
for index2 in range(len(value[0]))
]
type_move_A_position = [
model.NewIntVar(0, num_moves_A - 1, f"move_A[{i}]") for i in range(num_positions)
]
model.AddAllDifferent(type_move_A_position)
type_move_B_position = [
model.NewIntVar(0, num_moves_B - 1, f"move_B[{i}]") for i in range(num_positions)
]
model.AddAllDifferent(type_move_B_position)
# below intermediate decision variable is created which
# will store index corresponding to the selected move of type A and
# move of type B for each position
# this will act as index in the AddElement constraint
flatten_index_num = [
model.NewIntVar(0, len(flatten_indices), f"flatten_index_num[{i}]")
for i in range(num_positions)
]
# another intermediate decision variable is created which
# will store value corresponding to the selected move of type A and
# move of type B for each position
# this will act as the target in the AddElement constraint
value_position_index_num = [
model.NewIntVar(min_value, max_value, f"value_position_index_num[{i}]")
for i in range(num_positions)
]
objective_terms = []
for i in range(num_positions):
model.Add(
flatten_index_num[i]
== (type_move_A_position[i] * len(value[0])) + type_move_B_position[i]
)
model.AddElement(flatten_index_num[i], value_flat, value_position_index_num[i])
objective_terms.append(value_position_index_num[i])
model.Maximize(sum(objective_terms))
# Solve
solver = cp_model.CpSolver()
status = solver.Solve(model)
solver.ObjectiveValue()
for i in range(num_positions):
print(
str(i)
+ "--"
+ str(solver.Value(type_move_A_position[i]))
+ "--"
+ str(solver.Value(type_move_B_position[i]))
+ "--"
+ str(solver.Value(value_position_index_num[i]))
)
The below version uses AddAllowedAssignments constraint to achieve the same purpose (per Laurent's alternate approach) :
from ortools.sat.python import cp_model
model = cp_model.CpModel()
# value achieved from combination of different moves of type A
# (moves_A (rows)) and different moves of type B (moves_B (columns))
# for e.g. 2 move of type A and 3 move of type B will give value = 2
value = [
[-1, 5, 3, 2, 2],
[2, 4, 2, -1, 1],
[4, 4, 0, -1, 2],
[5, 1, -1, 2, 2],
[0, 0, 0, 0, 0],
[2, 1, 1, 2, 0],
]
min_value = min([min(i) for i in value])
max_value = max([max(i) for i in value])
# 6 moves of type A
num_moves_A = len(value)
# 5 moves of type B
num_moves_B = len(value[0])
# number of positions
num_positions = 5
type_move_A_position = [
model.NewIntVar(0, num_moves_A - 1, f"move_A[{i}]") for i in range(num_positions)
]
model.AddAllDifferent(type_move_A_position)
type_move_B_position = [
model.NewIntVar(0, num_moves_B - 1, f"move_B[{i}]") for i in range(num_positions)
]
model.AddAllDifferent(type_move_B_position)
value_position = [
model.NewIntVar(min_value, max_value, f"value_position[{i}]")
for i in range(num_positions)
]
tuples_list = []
for i in range(num_moves_A):
for j in range(num_moves_B):
tuples_list.append((i, j, value[i][j]))
for i in range(num_positions):
model.AddAllowedAssignments(
[type_move_A_position[i], type_move_B_position[i], value_position[i]],
tuples_list,
)
model.Maximize(sum(value_position))
# Solve
solver = cp_model.CpSolver()
status = solver.Solve(model)
solver.ObjectiveValue()
for i in range(num_positions):
print(
str(i)
+ "--"
+ str(solver.Value(type_move_A_position[i]))
+ "--"
+ str(solver.Value(type_move_B_position[i]))
+ "--"
+ str(solver.Value(value_position[i]))
)
I am trying to run linear regressions by group and add the results to a new column in the dataframe.
Here is what I'm trying to do.
df2 = pd.DataFrame.from_dict({'case': ['foo', 'foo', 'foo', 'bar', 'bar'],
'cluster': [1, 1, 1, 1, 1],
'conf': [1, 2, 3, 1, 4],
'conf_1': [11, 12, 13, 11, 14]})
def ols_res(df, xcols, ycol):
results = sm.OLS(df[ycol], sm.add_constant(df[xcols])).fit()
return results.get_influence().cooks_distance[0]
df3 = df2.groupby(['case', 'cluster'])
df3.apply(ols_res, xcols='conf', ycol='conf_1')
output I got is :
case cluster
bar 1 [nan, nan]
foo 1 [0.42857142857143005, 0.09642857142857146, 10....
dtype: object
The size of results for each group is same as number of rows in the group.
I need the above output in following format. Can some one please help me?
case cluster conf conf_1 result
0 foo 1 1 11 0.42857142857143005
1 foo 1 2 12 0.09642857142857146
2 foo 1 3 13 10....
4 bar 1 1 11 nan
5 bar 1 4 14 nan
following worked for me.
def ols_res_mod(df, xcols, ycol):
results = sm.OLS(df[ycol], sm.add_constant(df[xcols])).fit()
results.get_influence().cooks_distance[0]
print(df)
df['distance'] = results.get_influence().cooks_distance[0]
return df
not sure, whether this is an efficient way or not.
Im currently using micropython and it does not have the .zfill method.
What Im trying to get is to get the YYMMDDhhmmss of the UTC.
The time that it gives me for example is
t = (2019, 10, 11, 3, 40, 8, 686538, None)
I'm able to access the ones that I need by using t[:6]. Now the problem is with the single digit numbers, the 3 and 8. I was able to get it to show 1910113408, but I need to get 19101034008 I would need to get the zeroes before those 2. I used
t = "".join(map(str,t))
t = t[2:]
So my idea was to iterate over t and then check if the number is less than 10. If it is. I will add zeroes in front of it, replacing the number . And this is what I came up with.
t = (2019, 1, 1, 2, 40, 0)
t = list(t)
for i in t:
if t[i] < 10:
t[i] = 0+t[i]
t[i] = t[i]
print(t)
However, this gives me IndexError: list index out of range
Please help, I'm pretty new to coding/python.
When you use
for i in t:
i is not index, each item.
>>> for i in t:
... print(i)
...
2019
10
11
3
40
8
686538
None
If you want to use index, do like following:
>>> for i, v in enumerate(t):
... print("{} is {}".format(i,v))
...
0 is 2019
1 is 10
2 is 11
3 is 3
4 is 40
5 is 8
6 is 686538
7 is None
another way to create '191011034008'
>>> t = (2019, 10, 11, 3, 40, 8, 686538, None)
>>> "".join(map(lambda x: "%02d" % x, t[:6]))
'20191011034008'
>>> "".join(map(lambda x: "%02d" % x, t[:6]))[2:]
'191011034008'
note that:
%02d add leading zero when argument is lower than 10 otherwise (greater or equal 10) use itself. So year is still 4digit string.
This lambda does not expect that argument is None.
I tested this code at https://micropython.org/unicorn/
edited :
str.format method version:
"".join(map(lambda x: "{:02d}".format(x), t[:6]))[2:]
or
"".join(map(lambda x: "{0:02d}".format(x), t[:6]))[2:]
second example's 0 is parameter index.
You can use parameter index if you want to specify it (ex: position mismatch between format-string and params, want to write same parameter multiple times...and so on) .
>>> print("arg 0: {0}, arg 2: {2}, arg 1: {1}, arg 0 again: {0}".format(1, 11, 111))
arg 0: 1, arg 2: 111, arg 1: 11, arg 0 again: 1
I'd recommend you to use Python's string formatting syntax.
>> t = (2019, 10, 11, 3, 40, 8, 686538, None)
>> r = ("%d%02d%02d%02d%02d%02d" % t[:-2])[2:]
>> print(r)
191011034008
Let's see what's going on here:
%d means "display a number"
%2d means "display a number, at least 2 digits"
%02d means "display a number, at least 2 digits, pad with zeroes"
so we're feeding all the relevant numbers, padding them as needed, and cut the "20" out of "2019".
I am new+ to Python. I am working on multi label classification and need to prepare target data for multi hot encoding. It has taken way more time than I had initially thought. This is not real data (I can't post it). The data has ID, Category that makes a row unique. So there are multiple rows for each ID for 10 categories in reality but I am mocking 4. There are bunch of other columns (predictors). Below is what I got best out of my couple of hours after trying many approaches:
test_data = pd.DataFrame()
test_data["Category"] = ['A','B','C','D','A','C']
test_data["ID"] = [1,1,3,4,5,6]
test_data =test_data.pivot(index='ID', columns="Category",
values='Category').reset_index()
test_data =test_data.fillna('0')
test_data = test_data.reset_index(drop=True).rename_axis(None, axis=1)
data = test_data.drop(['ID'], axis=1)
print(data)
ignore '=' below I don't know how to indent with space.
= A B C D
0 A B 0 0
1 0 0 C 0
2 0 0 0 D
3 A 0 0 0
4 0 0 C 0
As you can see I am filling the categories that are not present with dummy '0'.
data = data.astype(str).values
data
array(
[['A', 'B', '0', '0'],
['0', '0', 'C', '0'],
['0', '0', '0', 'D'],
['A', '0', '0', '0'],
['0', '0', 'C', '0']], dtype=object)
from sklearn.preprocessing import MultiLabelBinarizer
cat =['A','B','C','D','0']
mlb = MultiLabelBinarizer(cat)
mlb.fit_transform(data)
array(
[[1, 1, 0, 0, 1],
[0, 0, 1, 0, 1],
[0, 0, 0, 1, 1],
[1, 0, 0, 0, 1],
[0, 0, 1, 0, 1]])
There are two things I am looking for help:
How do I get rid of my dummy category encoding ('0')
It appears I am hacking my way through it. Is there a better way of doing it ?
Just for curious minds, I am using a fully connected neural network for this classification.
Thanks for your help.