How to encode column of list for catboost? - encoding

I have a dataset where some columns contain lists:
import pandas as pd
df = pd.DataFrame(
{'var1': [1, 2, 3, 1, 2, 3],
'var2': [1, 1, 1, 2, 2, 2],
'var3': [["A", "B", "C"], ["A", "C"], None, ["A", "B"], ["C", "A"], ["D", "A"]]
}
)
var1 var2 var3
0 1 1 [A, B, C]
1 2 1 [A, C]
2 3 1 None
3 1 2 [A, B]
4 2 2 [C, A]
5 3 2 [D, A]
As the values within the lists of var3 can be shuffled and we can't assume any specific order the only way I can think of to prepare the columns for modelling is one-hot encoding. It could be done quite easily:
df["var3"] = df["var3"].apply(lambda x: [str(x)] if type(x) is not list else x)
mlb = MultiLabelBinarizer()
mlb.fit_transform(df["var3"])
resulting in:
array([[1, 1, 1, 0, 0],
[1, 0, 1, 0, 0],
[0, 0, 0, 0, 1],
[1, 1, 0, 0, 0],
[1, 0, 1, 0, 0],
[1, 0, 0, 1, 0]])
However, quoting catboost documentation:
Attention. Do not use one-hot encoding during preprocessing. This
affects both the training speed and the resulting quality.
Therefore, I'd like to ask if there's any other way I could encode this column for modelling with catboost?

Related

How to invocate nested loops one loop at a time?

I want to compare each element against all others like following. The number of variables like a, b, c is dynamic. However, each variable's array size is uniform.
let a = [1, 2, 3]
let b = [3, 4, 5]
let c = [4, 5, 6]
for i in a {
for j in b {
for k in c {
/// comparison
}
}
}
Instead looping from start to finish at once, what would be a way to make each comparison on call? For example:
compare(iteration: 0)
/// compares a[0], b[0], c[0]
compare(iteration: 1)
/// compares a[0], b[0], c[1]
/// all the way to
/// compares a[2], b[2], c[2]
Or it could even be like following:
next()
/// compares a[0], b[0], c[0]
next()
/// compares a[0], b[0], c[1]
almost like an iterator stepping through each cycle dictated by my invocation.
Let the number of arrays be n. And let the number of elements in each array, which is guaranteed the same for all of them, be k.
Then create an array consisting of the integers 0 through k-1, repeated n times. For example, in your case, n is 3, and k is 3, so generate the array
[0, 1, 2, 0, 1, 2, 0, 1, 2]
Now obtain all combinations of n elements of that array. You can do this using the algorithm at https://github.com/apple/swift-algorithms/blob/main/Guides/Combinations.md. Unique the result (by, for example, coercing to a Set and then back to an Array). This will give you a result equivalent, in some order or other, to
[[0, 1, 2], [0, 1, 0], [0, 1, 1], [0, 2, 0], [0, 2, 1], [0, 2, 2], [0, 0, 1], [0, 0, 2], [0, 0, 0], [1, 2, 0], [1, 2, 1], [1, 2, 2], [1, 0, 1], [1, 0, 2], [1, 0, 0], [1, 1, 2], [1, 1, 0], [1, 1, 1], [2, 0, 1], [2, 0, 2], [2, 0, 0], [2, 1, 2], [2, 1, 0], [2, 1, 1], [2, 2, 0], [2, 2, 1], [2, 2, 2]]
You can readily see that those are all 27 possible combinations of the numbers 0, 1, and 2. But that is exactly what you were doing with your for loops! So now, use those subarrays as indexes into each of your original arrays respectively.
So for instance, using my result and your original example, the first subarray [0, 1, 2] yields [1, 4, 6] — the first value from the first array, the second value from the second array, and the third value from the third array. And so on.
In this way you will have generated all possible n-tuples by choosing one value from each of your original arrays, which is the desired result; and we are in no way bound to fixed values of n and k, which was what you wanted to achieve. You will then be able to "compare" the elements of each n-tuple, whatever that may mean to you (you did not say in your question what it means).
In the case of your original values, we will get these n-tuples (expressed as arrays):
[1, 4, 6]
[1, 4, 4]
[1, 4, 5]
[1, 5, 4]
[1, 5, 5]
[1, 5, 6]
[1, 3, 5]
[1, 3, 6]
[1, 3, 4]
[2, 5, 4]
[2, 5, 5]
[2, 5, 6]
[2, 3, 5]
[2, 3, 6]
[2, 3, 4]
[2, 4, 6]
[2, 4, 4]
[2, 4, 5]
[3, 3, 5]
[3, 3, 6]
[3, 3, 4]
[3, 4, 6]
[3, 4, 4]
[3, 4, 5]
[3, 5, 4]
[3, 5, 5]
[3, 5, 6]
Those are precisely the triples of values you are after.
Actual code:
// your original conditions
let a = [1, 2, 3]
let b = [3, 4, 5]
let c = [4, 5, 6]
let originals = [a, b, c]
// The actual solution starts here. Note that I never use any hard
// coded numbers.
let n = originals.count
let k = originals[0].count
var indices = [Int]()
for _ in 0..<n {
for i in 0..<k {
indices.append(i)
}
}
let combos = Array(indices.combinations(ofCount: n))
var combosUniq = [[Int]]()
var combosSet = Set<[Int]>()
for combo in combos {
let success = combosSet.insert(combo)
if success.inserted {
combosUniq.append(combo)
}
}
// And here's how to generate your actual desired values.
for combo in combosUniq {
var tuple = [Int]()
for (outerIndex, innerIndex) in combo.enumerated() {
tuple.append(originals[outerIndex][innerIndex])
}
print(tuple) // in real life, do something useful here
}
}

Efficient replacement of x < i values in sparse array

How would I replace values less than 4 with 0 in this array without triggering a SparseEfficiencyWarning and without reducing its sparsity?
from scipy import sparse
x = sparse.csr_matrix(
[[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[0, 0, 0, 2, 5]])
x[x < 4] = 0
x.toarray() # verifies that this works
Note also that the sparsity between the initial version of x is 11 stored elements, which rises to 15 stored elements after doing the masking.
Manipulate the data array directly
from scipy import sparse
x = sparse.csr_matrix(
[[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[0, 0, 0, 2, 5]])
x.data[x.data < 4] = 0
>>> x.toarray()
array([[0, 0, 0, 0, 4],
[0, 0, 0, 4, 5],
[0, 0, 0, 0, 5]])
>>> x.data
array([0, 0, 0, 4, 0, 0, 0, 4, 5, 0, 5])
Note that the sparsity is unchanged and there are zero values unless you run x.eliminate_zeros().
x.eliminate_zeros()
>>> x.data
array([4, 4, 5, 5])
If for some reason you don't want to use a boolean mask & fancy indexing in numpy, you can loop over the array with numba:
import numba
#numba.jit(nopython=True)
def _set_array_less_than_to_zero(array, value):
for i in range(len(array)):
if array[i] < value:
array[i] = 0
This should also be faster than the numpy indexing by a fairly substantial degree.
array = np.arange(10)
_set_array_less_than_to_zero(array, 5)
>>> array
array([0, 0, 0, 0, 0, 5, 6, 7, 8, 9])

Floyd algorithm shortest path

i have written the code below,it works for shortest distance but not for shortest path,
import math
def floyd(dist_mat):
n=len(dist_mat)
p=[[0]*n]*n
for k in range(n):
for i in range(n):
for j in range(n):
if dist_mat[i][j]>dist_mat[i][k]+dist_mat[k][j]:
dist_mat[i][j] = dist_mat[i][k] + dist_mat[k][j]
p[i][j] = k+1
return p
if __name__ == '__main__':
print(floyd([[0,5,9999,9999],
[50,0,15,5],
[30,9999,0,15],
[15,9999,5,0]]))
result of this code is: [[4, 1, 4, 2], [4, 1, 4, 2], [4, 1, 4, 2], [4, 1, 4, 2]]
true result is: [[0, 0, 4, 2], [4, 0, 4, 0], [0, 1, 0, 0], [0, 1, 0, 0]],
I will be happy to receive your ideas about why it works wrong soon

MiniZinc Geocode not printing all solutions to CSP with "all" solutions enabled

The Issue
With solve minimize I only get one solution, even though there are multiple optimal solutions. I have enabled printout of multiple solutions in the solver configurations. The other optimal solutions are found with solve satisfy, along with non-optimal solutions.
Possible causes
Could it be that the cardinality function card() ranks by enum value where size of two sets are equal? In other words that card(A, B) > card(B, C)? If so, do I have to switch the representation of my vertices?
The Program
I am creating a MiniZinc program for finding the minimum vertex cover of a given graph. The graph in this example is this:
With Minimal Vertex Cover solutions being:
[{A, B, C, E}, {A, B, E, F}, {A, C, D, E}, {B, C, D, E}, {B, C, D, F}, {B, D, E, F}]. My code only outputs {A, B, C, E}.
Data file:
VERTEX = {A, B, C, D, E, F};
edges = [|1, 0, 1, 0, 0, 0, 0, 0, 0
|1, 1, 0, 1, 1, 0, 0, 0, 0
|0, 1, 0, 0, 0, 1, 1, 0, 0
|0, 0, 1, 1, 0, 0, 0, 1, 0
|0, 0, 0, 0, 1, 1, 0, 1, 1
|0, 0, 0, 0, 0, 0, 1, 0, 1|];
Solver program:
% Vertices in graph
enum VERTEX;
% Edges between vertices
array[VERTEX, int] of int: edges;
int: num_edges = (length(edges) div card(VERTEX));
% Set of vertices to find
var set of VERTEX: span;
% Number of vertices connected to edge resulting from span
array[1..num_edges] of var 0..num_edges: conn;
% All edges must be connected with at least one vertex from span
constraint forall(i in 1..num_edges)
(conn[i] >= 1);
% The number of connections to each edge is the number of vertices
% in span with a connection to that edge
constraint forall(i in 1..num_edges)
(conn[i] = sum([edges[vert,i]| vert in span]));
% Minimize the number of vertices in span
solve minimize card(span);
solve minimize only show one optimal solution (in some cases, intermediate values might also be shown).
If you want all optimal solutions you must use solve satisfy and add the constraint with the optimal value:
constraint card(span) = 4;
Then the model outputs all the 6 optimal solutions:
card(cpan): 4
span: {A, B, C, E}
conn: [2, 2, 1, 1, 2, 2, 1, 1, 1]
----------
card(cpan): 4
span: {B, C, D, F}
conn: [1, 2, 1, 2, 1, 1, 2, 1, 1]
----------
card(cpan): 4
span: {A, C, D, E}
conn: [1, 1, 2, 1, 1, 2, 1, 2, 1]
----------
card(cpan): 4
span: {B, C, D, E}
conn: [1, 2, 1, 2, 2, 2, 1, 2, 1]
----------
card(cpan): 4
span: {A, B, E, F}
conn: [2, 1, 1, 1, 2, 1, 1, 1, 2]
----------
card(cpan): 4
span: {B, D, E, F}
conn: [1, 1, 1, 2, 2, 1, 1, 2, 2]
----------
==========
Note: I added the output section to show all the values:
output [
"card(cpan): \(card(span))\n",
"span: \(span)\n",
"conn: \(conn)"
];
An alternative solution is to use OptiMathSAT (v. 1.6.3).
When asking for all solutions in optimization mode, the solver returns all solutions (with respect to the output variables) with the same optimal value.
Example:
~$ mzn2fzn test.mzn test.dzn # your instance
~$ optimathsat -input=fzn -opt.fzn.all_solutions=True < test.fzn
% allsat model
span = {2, 4, 5, 6};
conn = array1d(1..9, [1, 1, 1, 2, 2, 1, 1, 2, 2]);
----------
% allsat model
span = {1, 3, 4, 5};
conn = array1d(1..9, [1, 1, 2, 1, 1, 2, 1, 2, 1]);
----------
% allsat model
span = {1, 2, 3, 5};
conn = array1d(1..9, [2, 2, 1, 1, 2, 2, 1, 1, 1]);
----------
% allsat model
span = {1, 2, 5, 6};
conn = array1d(1..9, [2, 1, 1, 1, 2, 1, 1, 1, 2]);
----------
% allsat model
span = {2, 3, 4, 5};
conn = array1d(1..9, [1, 2, 1, 2, 2, 2, 1, 2, 1]);
----------
% allsat model
span = {2, 3, 4, 6};
conn = array1d(1..9, [1, 2, 1, 2, 1, 1, 2, 1, 1]);
----------
=========
The main advantage wrt. the approach presented in the accepted answer is that OptiMathSAT is incremental, meaning that the tool searches for other solutions without being restarted, so that it can re-use any useful information that has been previously generated to speed-up the search (e.g. theory lemmas). [CAVEAT: this may not be relevant for small instances; also, other MiniZinc solvers may still be faster depending on the input problem]
Note: please notice that OptiMathSAT does not print the labels of each VERTEX, because the mzn2fzn compiler removes these labels when compiling the file. However, the mapping among numbers and labels should be obvious.
Disclosure: I am one of the developers of this tool.

Python quicksort only sorting first half

I'm taking Princeton's algorithms-divide-conquer course - 3rd week, and trying to implement the quicksort.
Here's my current implementation with some tests ready to run:
import unittest
def quicksort(x):
if len(x) <= 1:
return x
pivot = x[0]
xLeft, xRight = partition(x)
print(xLeft, xRight)
quicksort(xLeft)
quicksort(xRight)
return x
def partition(x):
j = 0
print('partition', x)
for i in range(0, len(x)):
if x[i] < x[0]:
n = x[j + 1]
x[j + 1] = x[i]
x[i] = n
j += 1
p = x[0]
x[0] = x[j]
x[j] = p
return x[:j + 1], x[j + 1:]
class Test(unittest.TestCase):
def test_partition_pivot_first(self):
arrays = [
[3, 1, 2, 5],
[3, 8, 2, 5, 1, 4, 7, 6],
[10, 100, 3, 4, 2, 101]
]
expected = [
[[2, 1, 3], [5]],
[[1, 2, 3], [5, 8, 4, 7, 6]],
[[2, 3, 4, 10], [100, 101]]
]
for i in range(0, len(arrays)):
xLeft, xRight = partition(arrays[i])
self.assertEqual(xLeft, expected[i][0])
self.assertEqual(xRight, expected[i][1])
def test_quicksort(self):
arrays = [
[1, 2, 3, 4, 5, 6],
[3, 5, 6, 10, 2, 4]
]
expected = [
[1, 2, 3, 4, 5, 6],
[1, 2, 3, 4, 6, 10]
]
for i in range(0, len(arrays)):
arr = arrays[i]
quicksort(arr)
self.assertEqual(arr, expected[i])
if __name__ == "__main__":
unittest.main()
so for array = [3, 5, 6, 10, 2, 4] I get [2, 3, 6, 10, 5, 4] as a result... I can't figure what's wrong with my code. It partitions just fine, but the results are off...
Can anyone chip in? :) Thank you!
it's actually so minor problem that you'd be laughing
the problem resides with quicksort function
the correct one is:
def quicksort(x):
if len(x) <= 1:
return x
pivot = x[0]
xLeft, xRight = partition(x)
print(xLeft, xRight)
quicksort(xLeft)
quicksort(xRight)
x=xLeft+xRight #this one!
return x
what happens is python created a new object out of these xleft and xright they were never an in place-sort
so this is one solution(which is not in place)
the other one is to pass the list,the start_index,end_index
and do it in place
well done fella!
edit:
and actually if you'd print xleft and xright you'd see it performed perfectly:)