Python pandas: how to unpack the statsmodel results and create a column in group by dataframe - group-by

I am trying to run linear regressions by group and add the results to a new column in the dataframe.
Here is what I'm trying to do.
df2 = pd.DataFrame.from_dict({'case': ['foo', 'foo', 'foo', 'bar', 'bar'],
'cluster': [1, 1, 1, 1, 1],
'conf': [1, 2, 3, 1, 4],
'conf_1': [11, 12, 13, 11, 14]})
def ols_res(df, xcols, ycol):
results = sm.OLS(df[ycol], sm.add_constant(df[xcols])).fit()
return results.get_influence().cooks_distance[0]
df3 = df2.groupby(['case', 'cluster'])
df3.apply(ols_res, xcols='conf', ycol='conf_1')
output I got is :
case cluster
bar 1 [nan, nan]
foo 1 [0.42857142857143005, 0.09642857142857146, 10....
dtype: object
The size of results for each group is same as number of rows in the group.
I need the above output in following format. Can some one please help me?
case cluster conf conf_1 result
0 foo 1 1 11 0.42857142857143005
1 foo 1 2 12 0.09642857142857146
2 foo 1 3 13 10....
4 bar 1 1 11 nan
5 bar 1 4 14 nan

following worked for me.
def ols_res_mod(df, xcols, ycol):
results = sm.OLS(df[ycol], sm.add_constant(df[xcols])).fit()
results.get_influence().cooks_distance[0]
print(df)
df['distance'] = results.get_influence().cooks_distance[0]
return df
not sure, whether this is an efficient way or not.

Related

Polars Dataframe: Apply MinMaxScaler to a column with condition

I am trying to perform the following operation in Polars.
For value in column B which is below 80 will be scaled between 1 and 4, where as for anything above 80, will be set as 5.
df_pandas = pd.DataFrame(
{
"A": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"B": [50, 300, 80, 12, 105, 78, 66, 42, 61.5, 35],
}
)
test_scaler = MinMaxScaler(feature_range=(1,4)) # from sklearn.preprocessing
df_pandas.loc[df_pandas['B']<80, 'Test'] = test_scaler.fit_transform(df_pandas.loc[df_pandas['B']<80, "B"].values.reshape(-1,1))
df_pandas = df_pandas.fillna(5)
This is what I did with Polars:
# dt is a dictionary
dt = df.filter(
pl.col('B')<80
).to_dict(as_series=False)
below_80 = list(dt.keys())
dt_scale = list(
test_scaler.fit_transform(
np.array(dt['B']).reshape(-1,1)
).reshape(-1) # reshape back to one dimensional
)
# reassign to dictionary dt
dt['B'] = dt_scale
dt_scale_df = pl.DataFrame(dt)
dt_scale_df
dummy = df.join(
dt_scale_df, how="left", on="A"
).fill_null(5)
dummy = dummy.rename({"B_right": "Test"})
Result:
A
B
Test
1
50.0
2.727273
2
300.0
5.000000
3
80.0
5.000000
4
12.0
1.000000
5
105.0
5.000000
6
78.0
4.000000
7
66.0
3.454545
8
42.0
2.363636
9
61.5
3.250000
10
35.0
2.045455
Is there a better approach for this?
Alright, I have got 3 examples for you that should help you from which the last should be preferred.
Because you only want to apply your scaler to a part of a column, we should ensure we only send that part of the data to the scaler. This can be done by:
window function over a partition
partition_by
when -> then -> otherwise + min_max expression
Window function over partititon
This requires a python function that will be applied over the partitions. In the function itself we then have to check in which partition we are and deal with it accordingly.
df = pl.from_pandas(df_pandas)
min_max_sc = MinMaxScaler((1, 4))
def my_scaler(s: pl.Series) -> pl.Series:
if s.len() > 0 and s[0] > 80:
out = (s * 0 + 5)
else:
out = pl.Series(min_max_sc.fit_transform(s.to_numpy().reshape(-1, 1)).flatten())
# ensure all types are the same
return out.cast(pl.Float64)
df.with_column(
pl.col("B").apply(my_scaler).over(pl.col("B") < 80).alias("Test")
)
partition_by
This partitions the the original dataframe to a dictionary holding the different partitions. We then only modify the partitions as needed.
parts = (df
.with_column((pl.col("B") < 80).alias("part"))
.partition_by("part", as_dict=True)
)
parts[True] = parts[True].with_column(
pl.col("B").map(
lambda s: pl.Series(min_max_sc.fit_transform(s.to_numpy().reshape(-1, 1)).flatten())
).alias("Test")
)
parts[False] = parts[False].with_column(
pl.lit(5.0).alias("Test")
)
pl.concat([df for df in parts.values()]).select(pl.all().exclude("part"))
when -> then -> otherwise + min_max expression
This one I like best. We can make function that creates a polars expression that is the min_max scaling function you need. This will have best performance.
def min_max_scaler(col: str, predicate: pl.Expr):
x = pl.col(col)
x_min = x.filter(predicate).min()
x_max = x.filter(predicate).max()
# * 3 + 1 to set scale between 1 - 4
return (x - x_min) / (x_max - x_min) * 3 + 1
predicate = pl.col("B") < 80
df.with_column(
pl.when(predicate)
.then(min_max_scaler("B", predicate))
.otherwise(5).alias("Test")
)

Minizinc : var array with variable size

I want solve this problem : i have a number n and i want to have an array with all pair
(i,j) for all i,j in [1,n]
I write this for solve the problem:
include "globals.mzn";
int:n=2;
var 1..:ArrayLenght;
array[1..ArrayLenght,1..2] of var 1..n:X;
constraint forall(i,j in 1..n)(exists(r in 1..ArrayLenght) (X[r,..] == [i,j]));
solve minimize ArrayLenght;
but i have a type error type error: type-inst must be par set but is 'var set of int' on this line array[1..ArrayLenght,1..2] of var 1..size:X
So how i can do to have an array with a variabe size ? (i don't see anything about this in the official documentation)
NB : for this specific example, it would be better to set the arrayLenght to n*n but it's a minimal example, I have to add constraints that make the size of array cannot be fixed.
MiniZinc does not support variable length arrays. The dimensions of all arrays must be known at compile time. One common approach to handle this is to create a multi-dimensional array of dimension - say - n x m (where n and m are the largest possible values in each dimension) - and then set 0 (or some other value) as a dummy variable for the "invalid" cells.
However, it seems that what you want here is just to create an array of pairs of numbers. It's quite easy to create this without using any decision variables (i.e. without var ...).
Here are two different approaches to generate the pairs:
pairs1 is the pairs with i < j
pairs2 is the all possible pairs
The loop variablek in the list comprehensions is used as a counter to select either the i value or the j in the appropriate places.
int: n = 5;
int: num_pairs1 = n*(n-1) div 2;
int: num_pairs2 = n*n;
array[1..num_pairs1,1..2] of int: pairs1 = array2d(1..num_pairs1,1..2, [ if k == 1 then i else j endif | i,j in 1..n, k in 1..2 where i < j]);
array[1..num_pairs2,1..2] of int: pairs2 = array2d(1..num_pairs2,1..2, [ if k == 1 then i else j endif | i,j in 1..n, k in 1..2]);
output ["pairs1:\n", show2d(pairs1)];
output ["\n\npairs2:\n", show2d(pairs2)];
The output is
pairs1:
[| 1, 2
| 1, 3
| 1, 4
| 1, 5
| 2, 3
| 2, 4
| 2, 5
| 3, 4
| 3, 5
| 4, 5
|]
pairs2:
[| 1, 1
| 1, 2
| 1, 3
| 1, 4
| 1, 5
| 2, 1
| 2, 2
| 2, 3
| 2, 4
| 2, 5
| 3, 1
| 3, 2
| 3, 3
| 3, 4
| 3, 5
| 4, 1
| 4, 2
| 4, 3
| 4, 4
| 4, 5
| 5, 1
| 5, 2
| 5, 3
| 5, 4
| 5, 5
|]
----------
==========
Hope this helps. If not, please describe in more detail what you are looking for.

Python: add zeroes in single digit numbers without using .zfill

Im currently using micropython and it does not have the .zfill method.
What Im trying to get is to get the YYMMDDhhmmss of the UTC.
The time that it gives me for example is
t = (2019, 10, 11, 3, 40, 8, 686538, None)
I'm able to access the ones that I need by using t[:6]. Now the problem is with the single digit numbers, the 3 and 8. I was able to get it to show 1910113408, but I need to get 19101034008 I would need to get the zeroes before those 2. I used
t = "".join(map(str,t))
t = t[2:]
So my idea was to iterate over t and then check if the number is less than 10. If it is. I will add zeroes in front of it, replacing the number . And this is what I came up with.
t = (2019, 1, 1, 2, 40, 0)
t = list(t)
for i in t:
if t[i] < 10:
t[i] = 0+t[i]
t[i] = t[i]
print(t)
However, this gives me IndexError: list index out of range
Please help, I'm pretty new to coding/python.
When you use
for i in t:
i is not index, each item.
>>> for i in t:
... print(i)
...
2019
10
11
3
40
8
686538
None
If you want to use index, do like following:
>>> for i, v in enumerate(t):
... print("{} is {}".format(i,v))
...
0 is 2019
1 is 10
2 is 11
3 is 3
4 is 40
5 is 8
6 is 686538
7 is None
another way to create '191011034008'
>>> t = (2019, 10, 11, 3, 40, 8, 686538, None)
>>> "".join(map(lambda x: "%02d" % x, t[:6]))
'20191011034008'
>>> "".join(map(lambda x: "%02d" % x, t[:6]))[2:]
'191011034008'
note that:
%02d add leading zero when argument is lower than 10 otherwise (greater or equal 10) use itself. So year is still 4digit string.
This lambda does not expect that argument is None.
I tested this code at https://micropython.org/unicorn/
edited :
str.format method version:
"".join(map(lambda x: "{:02d}".format(x), t[:6]))[2:]
or
"".join(map(lambda x: "{0:02d}".format(x), t[:6]))[2:]
second example's 0 is parameter index.
You can use parameter index if you want to specify it (ex: position mismatch between format-string and params, want to write same parameter multiple times...and so on) .
>>> print("arg 0: {0}, arg 2: {2}, arg 1: {1}, arg 0 again: {0}".format(1, 11, 111))
arg 0: 1, arg 2: 111, arg 1: 11, arg 0 again: 1
I'd recommend you to use Python's string formatting syntax.
>> t = (2019, 10, 11, 3, 40, 8, 686538, None)
>> r = ("%d%02d%02d%02d%02d%02d" % t[:-2])[2:]
>> print(r)
191011034008
Let's see what's going on here:
%d means "display a number"
%2d means "display a number, at least 2 digits"
%02d means "display a number, at least 2 digits, pad with zeroes"
so we're feeding all the relevant numbers, padding them as needed, and cut the "20" out of "2019".

How to compare 2 sets of different date which contains 2 different sets of data?

I have 2 sets of Date, their 1st and last dates are the same respectively but their dates within might not be the same to each other. Both DateA and DateB contain different values on their each date, which are arrays A and B.
DateA= '2016-01-01'
'2016-01-02'
'2016-01-04'
'2016-01-05'
'2016-01-06'
'2016-01-07'
'2016-01-08'
'2016-01-09'
'2016-01-10'
'2016-01-12'
'2016-01-13'
'2016-01-14'
'2016-01-16'
'2016-01-17'
'2016-01-18'
'2016-01-19'
'2016-01-20'
DateB= '2016-01-01'
'2016-01-02'
'2016-01-03'
'2016-01-04'
'2016-01-05'
'2016-01-09'
'2016-01-10'
'2016-01-11'
'2016-01-12'
'2016-01-13'
'2016-01-15'
'2016-01-16'
'2016-01-17'
'2016-01-19'
'2016-01-20'
A = [5, 2, 3, 4, 6, 1, 7, 9, 3, 6, 1, 7, 9, 2, 1, 4, 6]
B = [4, 2, 7, 1, 8, 4, 9, 5, 3, 9, 3, 6, 7, 2, 9]
I have converted the dates into datenumber,ie
datenumberA= 736330
736331
736333
736334
736335
736336
736337
736338
736339
736341
736342
736343
736345
736346
736347
datenumberB= 736330
736331
736332
736333
736334
736338
736339
736340
736341
736342
736344
736345
736346
736348
736349
Now I want to compare the value of A on DateA(n) to that of B on DateB while DateB is the date that is closest to and before the date of DateA(n).
For example,
comparing the value of A on DateA '2016-01-12' to that of B on DateB '2016-01-11'.
Please help and thanks a lot.
It'll get you the desired output!
all_k=0;
out(1)=1; % not comparing the first index as you mentioned
for n=2:size(datenumberA,1)
j=0;
while 1
k=find(datenumberB+j==datenumberA(n)-1); %finding the index of DateB closest to and before DateA(n)
if size(k,1)==1 break; end %if found, come out of the while loop
j=j+1; % otherwise keep adding 1 in the values of datenumberB until found
end
if size(find(all_k==k),2) ~=1 % to avoid if any DateB is already compared
out(end+1)=A(n)> B(k); %Comparing Value in A with corresponding value in B
all_k(end+1)=k; end %Storing which indices of DateB are already compared
end
out' %Output
Output:-
ans =
1
0
0
1
0
0
1
0
0
1
0
0
1

I need to eliminate alternate rows of an array

I need to eliminate alternate rows of an array, like i have an array of 23847X1 and i need the odd rows and finally making it into 11924X1. It is in .mat file and i want the resultant in the .mat file as well.
Try yourMatrix(1:2:size(yourMatrix, 2)).
The 1:2:N selects all elements from 1 to N with step 2.
A more complete example:
> M=[1, 2, 3, 4, 5, 6, 7]
M =
1 2 3 4 5 6 7
> OddM = M(1:2:size(M, 2))
OddM =
1 3 5 7
To load / store data in data.mat, follow H.Muster's advice below:
load('data.mat'); x = x(1:2:end,:); save('data.mat', 'x')