Multi hot encoding

Multi hot encoding - encoding

I am new+ to Python. I am working on multi label classification and need to prepare target data for multi hot encoding. It has taken way more time than I had initially thought. This is not real data (I can't post it). The data has ID, Category that makes a row unique. So there are multiple rows for each ID for 10 categories in reality but I am mocking 4. There are bunch of other columns (predictors). Below is what I got best out of my couple of hours after trying many approaches:
test_data = pd.DataFrame()
test_data["Category"] = ['A','B','C','D','A','C']
test_data["ID"] = [1,1,3,4,5,6]
test_data =test_data.pivot(index='ID', columns="Category",
values='Category').reset_index()
test_data =test_data.fillna('0')
test_data = test_data.reset_index(drop=True).rename_axis(None, axis=1)
data = test_data.drop(['ID'], axis=1)
print(data)
ignore '=' below I don't know how to indent with space.
= A B C D
0 A B 0 0
1 0 0 C 0
2 0 0 0 D
3 A 0 0 0
4 0 0 C 0
As you can see I am filling the categories that are not present with dummy '0'.
data = data.astype(str).values
data
array(
[['A', 'B', '0', '0'],
['0', '0', 'C', '0'],
['0', '0', '0', 'D'],
['A', '0', '0', '0'],
['0', '0', 'C', '0']], dtype=object)
from sklearn.preprocessing import MultiLabelBinarizer
cat =['A','B','C','D','0']
mlb = MultiLabelBinarizer(cat)
mlb.fit_transform(data)
array(
[[1, 1, 0, 0, 1],
[0, 0, 1, 0, 1],
[0, 0, 0, 1, 1],
[1, 0, 0, 0, 1],
[0, 0, 1, 0, 1]])
There are two things I am looking for help:
How do I get rid of my dummy category encoding ('0')
It appears I am hacking my way through it. Is there a better way of doing it ?
Just for curious minds, I am using a fully connected neural network for this classification.
Thanks for your help.

Related

How to pick an element from matrix (list of list in python) based on decision variables (one for row, and one for column) | OR-Tools, Python

I am new to constraint programming and OR-Tools. A brief about the problem. There are 8 positions, for each position I need to decide which move of type A (move_A) and which move of type B (move_B) should be selected such that the value achieved from the combination of the 2 moves (at each position) is maximized. (This is only a part of the bigger problem though). And I want to use AddElement approach to do the sub setting.
Please see the below attempt
from ortools.sat.python import cp_model
model = cp_model.CpModel()
# value achieved from combination of different moves of type A
# (moves_A (rows)) and different moves of type B (moves_B (columns))
# for e.g. 2nd move of type A and 3rd move of type B will give value = 2
value = [
[ -1, 5, 3, 2, 2],
[ 2, 4, 2, -1, 1],
[ 4, 4, 0, -1, 2],
[ 5, 1, -1, 2, 2],
[ 0, 0, 0, 0, 0],
[ 2, 1, 1, 2, 0]
]
# 6 moves of type A
num_moves_A = len(value)
# 5 moves of type B
num_moves_B = len(value[0])
num_positions = 8
type_move_A_position = [model.NewIntVar(0, num_moves_A - 1, f"move_A[{i}]") for i in range(num_positions)]
type_move_B_position = [model.NewIntVar(0, num_moves_B - 1, f"move_B[{i}]") for i in range(num_positions)]
value_position = [model.NewIntVar(0, 10, f"value_position[{i}]") for i in range(num_positions)]
# I am getting an error when I run the below
objective_terms = []
for i in range(num_positions):
model.AddElement(type_move_B_position[i], value[type_move_A_position[i]], value_position[i])
objective_terms.append(value_position[i])
The error is as follows:
Traceback (most recent call last):
File "<ipython-input-65-3696379ce410>", line 3, in <module>
model.AddElement(type_move_B_position[i], value[type_move_A_position[i]], value_position[i])
TypeError: list indices must be integers or slices, not IntVar
In MiniZinc the below code would have worked
var int: obj = sum(i in 1..num_positions ) (value [type_move_A_position[i], type_move_B_position[i]])
I know in OR-Tools we will have to create some intermediary variables to store results first, so the above approach of minizinc will not work. But I am struggling to do so.
I can always create a 2 matrix of binary binary variables one for num_moves_A * num_positions and the other for num_moves_B * num_positions, add re;evant constraints and achieve the purpose
But I want to learn how to do the same thing via AddElement constraint
Any help on how to re-write the AddElement snippet is highly appreciated. Thanks.

AddElement is 1D only.
The way it is translated from minizinc to CP-SAT is to create an intermediate variable p == index1 * max(index2) + index2 and use it in an element constraint with a flattened matrix.

Following Laurent's suggestion (using AddElement constraint):
from ortools.sat.python import cp_model
model = cp_model.CpModel()
# value achieved from combination of different moves of type A
# (moves_A (rows)) and different moves of type B (moves_B (columns))
# for e.g. 2 move of type A and 3 move of type B will give value = 2
value = [
[-1, 5, 3, 2, 2],
[2, 4, 2, -1, 1],
[4, 4, 0, -1, 2],
[5, 1, -1, 2, 2],
[0, 0, 0, 0, 0],
[2, 1, 1, 2, 0],
]
min_value = min([min(i) for i in value])
max_value = max([max(i) for i in value])
# 6 moves of type A
num_moves_A = len(value)
# 5 moves of type B
num_moves_B = len(value[0])
# number of positions
num_positions = 5
# flattened matrix of values
value_flat = [value[i][j] for i in range(num_moves_A) for j in range(num_moves_B)]
# flattened indices
flatten_indices = [
index1 * len(value[0]) + index2
for index1 in range(len(value))
for index2 in range(len(value[0]))
]
type_move_A_position = [
model.NewIntVar(0, num_moves_A - 1, f"move_A[{i}]") for i in range(num_positions)
]
model.AddAllDifferent(type_move_A_position)
type_move_B_position = [
model.NewIntVar(0, num_moves_B - 1, f"move_B[{i}]") for i in range(num_positions)
]
model.AddAllDifferent(type_move_B_position)
# below intermediate decision variable is created which
# will store index corresponding to the selected move of type A and
# move of type B for each position
# this will act as index in the AddElement constraint
flatten_index_num = [
model.NewIntVar(0, len(flatten_indices), f"flatten_index_num[{i}]")
for i in range(num_positions)
]
# another intermediate decision variable is created which
# will store value corresponding to the selected move of type A and
# move of type B for each position
# this will act as the target in the AddElement constraint
value_position_index_num = [
model.NewIntVar(min_value, max_value, f"value_position_index_num[{i}]")
for i in range(num_positions)
]
objective_terms = []
for i in range(num_positions):
model.Add(
flatten_index_num[i]
== (type_move_A_position[i] * len(value[0])) + type_move_B_position[i]
)
model.AddElement(flatten_index_num[i], value_flat, value_position_index_num[i])
objective_terms.append(value_position_index_num[i])
model.Maximize(sum(objective_terms))
# Solve
solver = cp_model.CpSolver()
status = solver.Solve(model)
solver.ObjectiveValue()
for i in range(num_positions):
print(
str(i)
+ "--"
+ str(solver.Value(type_move_A_position[i]))
+ "--"
+ str(solver.Value(type_move_B_position[i]))
+ "--"
+ str(solver.Value(value_position_index_num[i]))
)

The below version uses AddAllowedAssignments constraint to achieve the same purpose (per Laurent's alternate approach) :
from ortools.sat.python import cp_model
model = cp_model.CpModel()
# value achieved from combination of different moves of type A
# (moves_A (rows)) and different moves of type B (moves_B (columns))
# for e.g. 2 move of type A and 3 move of type B will give value = 2
value = [
[-1, 5, 3, 2, 2],
[2, 4, 2, -1, 1],
[4, 4, 0, -1, 2],
[5, 1, -1, 2, 2],
[0, 0, 0, 0, 0],
[2, 1, 1, 2, 0],
]
min_value = min([min(i) for i in value])
max_value = max([max(i) for i in value])
# 6 moves of type A
num_moves_A = len(value)
# 5 moves of type B
num_moves_B = len(value[0])
# number of positions
num_positions = 5
type_move_A_position = [
model.NewIntVar(0, num_moves_A - 1, f"move_A[{i}]") for i in range(num_positions)
]
model.AddAllDifferent(type_move_A_position)
type_move_B_position = [
model.NewIntVar(0, num_moves_B - 1, f"move_B[{i}]") for i in range(num_positions)
]
model.AddAllDifferent(type_move_B_position)
value_position = [
model.NewIntVar(min_value, max_value, f"value_position[{i}]")
for i in range(num_positions)
]
tuples_list = []
for i in range(num_moves_A):
for j in range(num_moves_B):
tuples_list.append((i, j, value[i][j]))
for i in range(num_positions):
model.AddAllowedAssignments(
[type_move_A_position[i], type_move_B_position[i], value_position[i]],
tuples_list,
)
model.Maximize(sum(value_position))
# Solve
solver = cp_model.CpSolver()
status = solver.Solve(model)
solver.ObjectiveValue()
for i in range(num_positions):
print(
str(i)
+ "--"
+ str(solver.Value(type_move_A_position[i]))
+ "--"
+ str(solver.Value(type_move_B_position[i]))
+ "--"
+ str(solver.Value(value_position[i]))
)

about or-tools OnlyEnforceIf,AddBoolAnd question

I have a Bool array, I want some ones to appear in certain places like this:
[1,1,1,1,0,0,0,0,0,0] or [0,0,0,0,0,0,1,1,1,1],here I want four ones appear in array's head,or tail,just two places,how can I do this?
model = cp_model.CpModel()
solver = cp_model.CpSolver()
shifts = {}
ones={}
sequence = []
for i in range(10):
shifts[(i)] = model.NewIntVar(0, 10, "shifts(%i)" % i)
ones[(i)] = model.NewBoolVar( '%i' % i)
for i in range(10):
model.Add(shifts[(i)] ==8).OnlyEnforceIf(ones[(i)])
model.Add(shifts[(i)] == 0).OnlyEnforceIf(ones[(i)].Not())
#I want the four 8s in the array to only appear in two positions at the head or tail of the array, and not in other positions.
model.AddBoolAnd([ones[(0)],ones[(1)],ones[(2)],ones[(3)]])# appear in head
#model.AddBoolAnd([ones[(6)],ones[(7)],ones[(8)],ones[(9)]]) #appear in tauk ，error！
model.Add(sum(ones[(i)] for i in range(10)) == 4)
status = solver.Solve(model)
print("status:",status)
res=[]
for i in range(10):
res.append(solver.Value(shifts[(i)]))
print(res)
bold
italic
quote

Try AddAllowedAsignments:
model = cp_model.CpModel()
ones = [model.NewBoolVar("") for _ in range(10)]
model.AddAllowedAssignments(
ones, [[1, 1, 1, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 1, 1, 1]]
)
Edit: #Laurent's suggestion
model = cp_model.CpModel()
ones = [model.NewBoolVar("") for _ in range(10)]
head = model.NewBoolVar("head")
# HEAD
model.AddImplication(head, ones[0])
model.AddImplication(head, ones[1])
model.AddImplication(head, ones[2])
model.AddImplication(head, ones[3])
model.AddImplication(head, ones[4].Not())
model.AddImplication(head, ones[5].Not())
...
# TAIL
model.AddImplication(head.Not(), ones[0].Not())
model.AddImplication(head.Not(), ones[1].Not())
....
model.AddImplication(head.Not(), ones[6])
model.AddImplication(head.Not(), ones[7])
model.AddImplication(head.Not(), ones[8])
model.AddImplication(head.Not(), ones[9])

Google OR-Tools doesn't find solution on VRPtw problem

I'm tackling with VRPtw problem and struggling that the solver finds no solution with any data except for artificial small one.
The setting is as below.
There are several depots and locations to visit. Each locations have the time-window. Each vehicles have break time and work time. Also, the locations have some constraints and only the vehicles which satisfy that demand can visit there.
Based on this experiment setting, I wrote the code below.
As I wrote, it looks that it is working with small artificial data, but with real data, it never found the solution. I tried with 5 different data sets.
Although I set the 7200 sec time limit, previously I ran for longer than 10 hours and it was same.
The data's scale is 40~50 vehicles and 200~300 locations.
Does this code have a problem? If not, on what kind of order, should I change the approach(such as initialization, searching method and so on)?
(Edited to use integer for time matrix)
from dataclasses import dataclass
from typing import List, Tuple
from ortools.constraint_solver import pywrapcp
from ortools.constraint_solver import routing_enums_pb2
# TODO: Refactor
BIG_ENOUGH = 100000000
TIME_DIMENSION = 'Time'
TIME_LIMIT = 7200
#dataclass
class DataSet:
time_matrix: List[List[int]]
locations_num: int
vehicles_num: int
vehicles_break_time_window: List[Tuple[int, int, int]]
vehicles_work_time_windows: List[Tuple[int, int]]
location_time_windows: List[Tuple[int, int]]
vehicles_depots_indices: List[int]
possible_vehicles: List[List[int]]
def execute(data: DataSet):
manager = pywrapcp.RoutingIndexManager(data.locations_num,
data.vehicles_num,
data.vehicles_depots_indices,
data.vehicles_depots_indices)
routing_parameters = pywrapcp.DefaultRoutingModelParameters()
routing_parameters.solver_parameters.trace_propagation = True
routing_parameters.solver_parameters.trace_search = True
routing = pywrapcp.RoutingModel(manager, routing_parameters)
def time_callback(source_index, dest_index):
from_node = manager.IndexToNode(source_index)
to_node = manager.IndexToNode(dest_index)
return data.time_matrix[from_node][to_node]
transit_callback_index = routing.RegisterTransitCallback(time_callback)
routing.SetArcCostEvaluatorOfAllVehicles(transit_callback_index)
routing.AddDimension(
transit_callback_index,
BIG_ENOUGH,
BIG_ENOUGH,
False,
TIME_DIMENSION)
time_dimension = routing.GetDimensionOrDie(TIME_DIMENSION)
# set time window for locations start time
# set condition restrictions
possible_vehicles = data.possible_vehicles
for location_idx, time_window in enumerate(data.location_time_windows):
index = manager.NodeToIndex(location_idx + data.vehicles_num)
time_dimension.CumulVar(index).SetRange(time_window[0], time_window[1])
routing.SetAllowedVehiclesForIndex(possible_vehicles[location_idx], index)
solver = routing.solver()
for i in range(data.vehicles_num):
routing.AddVariableMinimizedByFinalizer(
time_dimension.CumulVar(routing.Start(i)))
routing.AddVariableMinimizedByFinalizer(
time_dimension.CumulVar(routing.End(i)))
# set work time window for vehicles
for vehicle_index, work_time_window in enumerate(data.vehicles_work_time_windows):
start_index = routing.Start(vehicle_index)
time_dimension.CumulVar(start_index).SetRange(work_time_window[0],
work_time_window[0])
end_index = routing.End(vehicle_index)
time_dimension.CumulVar(end_index).SetRange(work_time_window[1],
work_time_window[1])
# set break time for vehicles
node_visit_transit = {}
for n in range(routing.Size()):
if n >= data.locations_num:
node_visit_transit[n] = 0
else:
node_visit_transit[n] = 1
break_intervals = {}
for v in range(data.vehicles_num):
vehicle_break = data.vehicles_break_time_window[v]
break_intervals[v] = [
solver.FixedDurationIntervalVar(vehicle_break[0],
vehicle_break[1],
vehicle_break[2],
True,
'Break for vehicle {}'.format(v))
]
time_dimension.SetBreakIntervalsOfVehicle(
break_intervals[v], v, node_visit_transit
)
search_parameters = pywrapcp.DefaultRoutingSearchParameters()
search_parameters.first_solution_strategy = (
routing_enums_pb2.FirstSolutionStrategy.PATH_CHEAPEST_ARC)
search_parameters.local_search_metaheuristic = (
routing_enums_pb2.LocalSearchMetaheuristic.GREEDY_DESCENT)
search_parameters.time_limit.seconds = TIME_LIMIT
search_parameters.log_search = True
solution = routing.SolveWithParameters(search_parameters)
return solution
if __name__ == '__main__':
data = DataSet(
time_matrix=[[0, 0, 4, 5, 5, 6],
[0, 0, 6, 4, 5, 5],
[1, 3, 0, 6, 5, 4],
[2, 1, 6, 0, 5, 4],
[2, 2, 5, 5, 0, 6],
[3, 2, 4, 4, 6, 0]],
locations_num=6,
vehicles_num=2,
vehicles_depots_indices=[0, 1],
vehicles_work_time_windows=[(720, 1080), (720, 1080)],
vehicles_break_time_window=[(720, 720, 15), (720, 720, 15)],
location_time_windows=[(735, 750), (915, 930), (915, 930), (975, 990)],
possible_vehicles=[[0], [1], [0], [1]]
)
solution = execute(data)
if solution is not None:
print("solution is found")

PostGIS conditional aggregration - presence/absence matrix

I have a dataset that resembles the following:
site_id, species
1, spp1
2, spp1
2, spp2
2, spp3
3, spp2
3, spp3
4, spp1
4, spp2
I want to create a table like this:
site_id, spp1, spp2, spp3, spp4
1, 1, 0, 0, 0
2, 1, 1, 1, 0
3, 0, 1, 1, 0
4, 1, 1, 0, 0
This question was asked here, however the issue I face is that my list of species is significantly greater and so creating a massive query listing each species manually would take a significant amount of time. I would therefore like a solution that does not require this and could instead read from the existing species list.
In addition, when playing with that query, the count() function would keep adding so I would end up with values greater than 1 where multiples of the same species were present in a site_id. Ideally I want a binary 1 or 0 output.

Perl & Image::Magick, getting color values by pixel

I'm using Perl and the Image::Magick module to process some JPEGs.
I'm using the GetPixels sub to get the RGB components of each pixel.
e.g.
my #pixels = $img->GetPixels(
width => 1,
height => 1,
x => 0,
y => 0,
map => 'RGB',
#normalize => 1
)
print Dumper \#pixels;
$img->Resize(
width => 1,
height => 1,
filter => 'Lanczos'
);
#pixels = $img->GetPixels(
width => 1,
height => 1,
x => 0,
y => 0,
map => 'RGB',
#normalize => 1
);
print Dumper \#pixels;
$img->Write('verify.jpg');
I've found that getPixels is returning two bytes per channel, e.g.
$VAR1 = [
46260,
45232,
44975
];
$VAR1 = [
58271,
58949,
60330
];
Before the call to Resize: (in this example) the colour of the designated pixel is #b4b0af, and the returned values are 0xB4B4, 0xB0B0, 0xAFAF. I don't understand why this is, but I can deal with it using MOD 256;
But after the call to Resize, the returned values don't correspond in any obvious way to the actual values I find in the output file (verify.jpg).
Is Image::Magick just being super-precise (accounting for the shorts instead of bytes)?
And does the JPEG compression account for the discrepancy between the second Dumper output and the contents of 'verify.jpg'?

Read all about colors in ImageMagick, including its quantum depth:
ImageMagick may be compiled to support 32 or 64 bit pixels of type PixelPacket. This is controlled by the value of the QuantumDepth define. The default is 64 bit pixels, which provides the best accuracy.
You might also like to read about how it does color reduction.

JPEG compression is lossy, so there's no direct correspondence between the pixel values before saving and the pixels in the compressed image. You'd have to load the new image if you want to find out how the compression modified it.

Took me some time to get the same problem solved for binary (black & white) tiffs:
for my $y (0..$height-1) {
my #pixels = $image->GetPixels(
'width' => 1,
'height' => 1,
'x' => 0,
'y' => $y,
map => 'RGB',
normalize => 'True',
);
print "$y: ",join(', ', #pixels),"\n";
}
Prints (shortened):
627: 1, 1, 1
628: 1, 1, 1
629: 0, 0, 0
630: 0, 0, 0
631: 0, 0, 0
632: 0, 0, 0
633: 1, 1, 1
634: 1, 1, 1
Found no better way.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Multi hot encoding - encoding

Related

How to pick an element from matrix (list of list in python) based on decision variables (one for row, and one for column) | OR-Tools, Python

about or-tools OnlyEnforceIf,AddBoolAnd question

Google OR-Tools doesn't find solution on VRPtw problem

PostGIS conditional aggregration - presence/absence matrix

Perl & Image::Magick, getting color values by pixel

Categories

Resources