Ordering of conflict sets in conflict-based backjumping - backtracking

If, during backtracking there is a conflict at variable X_j, conflict-based backjumping jumps to the most recent node X_i from the conflict set of X_j, conf(X_j).
Moreover, X_i absorbs the conflict set of X_j, that is
conf(X_i) = conf(X_i) U conf(X_j) - {X_i}
What happens if there is an overlap between the two conflict sets? How is the absorption working in that case?
e.g.
conf(X_i) = {X_1, X_2, X_3}
conf(X_j) = {X_4, X_5, X_3, X_i}
What will be the ordering after the absorption?

According to prosser, the order will be as conf(X_i).union(conf(X_j)).

Related

Few minizinc questions on constraints

A little bit of background. I'm trying to make a model for clustering a Design Structure Matrix(DSM). I made a draft model and have a couple of questions. Most of them are not directly related to DSM per se.
include "globals.mzn";
int: dsmSize = 7;
int: maxClusterSize = 7;
int: maxClusters = 4;
int: powcc = 2;
enum dsmElements = {A, B, C, D, E, F,G};
array[dsmElements, dsmElements] of int: dsm =
[|1,1,0,0,1,1,0
|0,1,0,1,0,0,1
|0,1,1,1,0,0,1
|0,1,1,1,1,0,1
|0,0,0,1,1,1,0
|1,0,0,0,1,1,0
|0,1,1,1,0,0,1|];
array[1..maxClusters] of var set of dsmElements: clusters;
array[1..maxClusters] of var int: clusterCard;
constraint forall(i in 1..maxClusters)(
clusterCard[i] = pow(card(clusters[i]), powcc)
);
% #1
% constraint forall(i, j in clusters where i != j)(card(i intersect j) == 0);
% #2
constraint forall(i, j in 1..maxClusters where i != j)(
card(clusters[i] intersect clusters[j]) == 0
);
% #3
% constraint all_different([i | i in clusters]);
constraint (clusters[1] union clusters[2] union clusters[3] union clusters[4]) = dsmElements;
var int: intraCost = sum(i in 1..maxClusters, j, k in clusters[i] where k != j)(
(dsm[j,k] + dsm[k,j]) * clusterCard[i]
) ;
var int: extraCost = sum(el in dsmElements,
c in clusters where card(c intersect {el}) = 0,
k,j in c)(
(dsm[j,k] + dsm[k,j]) * pow(card(dsmElements), powcc)
);
var int: TCC = trace("\(intraCost), \(extraCost)\n", intraCost+extraCost);
solve maximize TCC;
Question 1
I was under the impression, that constraints #1 and #2 are the same. However, seems like they are not. The question here is why? What is the difference?
Question 2
How can I replace constraint #2 with all_different? Does it make sense?
Question 3
Why the trace("\(intraCost), \(extraCost)\n", intraCost+extraCost); shows nothing in the output? The output I see using gecode is:
Running dsm.mzn
intraCost, extraCost
clusters = array1d(1..4, [{A, B, C, D, E, F, G}, {}, {}, {}]);
clusterCard = array1d(1..4, [49, 0, 0, 0]);
----------
<sipped to save space>
----------
clusters = array1d(1..4, [{B, C, D, G}, {A, E, F}, {}, {}]);
clusterCard = array1d(1..4, [16, 9, 0, 0]);
----------
==========
Finished in 5s 419msec
Question 4
The expression constraint (clusters[1] union clusters[2] union clusters[3] union clusters[4]) = dsmElements;, here I wanted to say that the union of all clusters should match the set of all nodes. Unfortunately, I did not find a way to make this big union more dynamic, so for now I just manually provide all clusters. Is there a way to make this expression return union of all sets from the array of sets?
Question 5
Basically, if I understand it correctly, for example from here, the Intra-cluster cost is the sum of all interactions within a cluster multiplied by the size of the cluster in some power, basically the cardinality of the set of nodes, that represents the cluster.
The Extra-cluster cost is a sum of interactions between some random element that does not belong to a cluster and all elements of that cluster multiplied by the cardinality of the whole space of nodes to some power.
The main question here is are the intraCost and extraCost I the model correct (they seem to be but still), and is there a better way to express these sums?
Thanks!
(Perhaps you would get more answers if you separate this into multiple questions.)
Question 3:
Here's an answer on the trace question:
When running the model, the trace actually shows this:
intraCost, extraCost
which is not what you expect, of course. Trace is in effect when creating the model, but at that stage there is no value of these two decision values and MiniZinc shows only the variable names. They got some values to show after the (first) solution is reached, and can then be shown in the output section.
trace is mostly used to see what's happening in loops where one can trace the (fixed) loop variables etc.
If you trace an array of decision variables then they will be represented in a different fashion, the array x will be shown as X_INTRODUCED_0_ etc.
And you can also use trace for domain reflection, e.g. using lb and ub to get the lower/upper value of the domain of a variable ("safe approximation of the bounds" as the documentation states it: https://www.minizinc.org/doc-2.5.5/en/predicates.html?highlight=ub_array). Here's an example which shows the domain of the intraCost variable:
constraint
trace("intraCost: \(lb(intraCost))..\(ub(intraCost))\n")
;
which shows
intraCost: -infinity..infinity
You can read a little more about trace here https://www.minizinc.org/doc-2.5.5/en/efficient.html?highlight=trace .
Update Answer to question 1, 2 and 4.
The constraint #1 and #2 means the same thing, i.e. that the elements in clusters should be disjoint. The #1 constraint is a little different in that it loops over decision variables while the #2 constraint use plain indices. One can guess that #2 is faster since #1 use the where i != j which must be translated to some extra constraints. (And using i < j instead should be a little faster.)
The all_different constraint states about the same and depending on the underlying solver it might be faster if it's translated to an efficient algorithm in the solver.
In the model there is also the following constraint which states that all elements must be used:
constraint (clusters[1] union clusters[2] union clusters[3] union clusters[4]) = dsmElements;
Apart from efficiency, all these constraints above can be replaced with one single constraint: partition_set which ensure that all elements in dsmElements must be used in clusters.
constraint partition_set(clusters,dsmElements);
It might be faster to also combine with the all_different constraint, but that has to be tested.

ortools: best practice for partial assignments

I would like to constrain my VRP with partial assignments. Given a list of stops assigned_stops and a vehicle id I would like to find solutions such that all stops in the list are serviced by the given vehicle and in the given order.
For example, if I want to assign all stops in a list assigned_stops to the vehicle with index 5 I use the following code (Python):
vehicle_ix = 5
for stop1, stop2 in zipper(assigned_stops):
ix1 = stops.index(stop1)
ix2 = stops.index(stop2)
cpsolver.Add(routing_model.VehicleVar(ix1) == vehicle_ix)
cpsolver.Add(routing_model.VehicleVar(ix2) == vehicle_ix)
cpsolver.Add(stop_sequence_dimension.CumulVar(ix1) < stop_sequence_dimension.CumulVar(ix2))
It works. But I am only enforcing pair-wise inequalities for successive ix1, ix2. Would it help the solver if I added all possible inequality constraints?

Why does supressing weights improve Tensorflow neural net performance?

I have a 2-layer non-convolutional network in Tensorflow, using tanh as the activation function. I understand that weights should be initialized with a truncated normal distribution divided by sqrt(nInputs) e.g.:
weightsLayer1 = tf.Variable(tf.div(tf.truncated_normal([nInputUnits, nUnitsHiddenLayer1),math.sqrt(nInputUnits))))
Being a bit of a bumbling newbie in NN and Tensorflow, I mistakenly implemented this as 2 lines only to make it more readable:
weightsLayer1 = tf.Variable(tf.truncated_normal([nInputUnits, nUnitsHiddenLayer1])
weightsLayer1 = tf.div(weightsLayer1, math.sqrt(nInputUnits))
I now know that this is wrong and that the 2nd line causes the weights to be recomputed at each learning step. However, to my suprise, the "incorrect" implementation consistently yields better performance, both in train and test/evaluation datasets. I thought that the incorrect 2-line implementation should be a train wreck, since it is recomputing (suppressing) weights to values other than those chosen by the optimizer, which I would expect would wreak havoc in the optimization process, but it actually improves it. Does anyone have any explanation for this? I am using the Tensorflow adam optimizer.
Update 2016.6.22 - updated the 2nd code block above.
You are right that weightsLayer1 = tf.div(weightsLayer1, math.sqrt(nInputUnits)) is executed at each step. But that does NOT mean that the values in the weight variable are scaled down by sqrt(nInputUnits) in each step. This line is not an in-place operation that affects the values stored in the variable. It computes a new tensor, holding the values in the variable divided by sqrt(nInputUnits) and that tensor, I assume, then goes into the rest of your computation graph. This does not interfere with the optimizer. You are still defining a valid computation graph, just with an somewhat arbitrary scaling of the weights. The optimizer can still compute the gradients with respect to this variable (it will back-propagate through your division operation) and create the corresponding update operations.
In terms of the model that you are defining, the two versions are totally equivalent. For any set of values of weightsLayer1 in the original model (where you don't do the division), you can simply scale them up by sqrt(nInputUnits) and you will get the identical results with your second model. The two represent exactly the same model class, if you will.
Why one works better than the other? Your guess is as good as mine. If you have done the same division for all your variables, you have effectively divided your learning rate by sqrt(nInputUnits). This smaller learning rate might have been beneficial to the problem at hand.
Edit: I think the fact that you give the same name to the variable and the newly created tensor causes confusion. When you do
A = tf.Variable(1.0)
A = tf.mul(A, 2.0)
# Do something with A
then the second line creates a new tensor (as discussed above) and you re-bind the name (and it is only a name) A to that new tensor. For the graph being defined, the naming is absolutely irrelevant. The following code defines the same graph:
A = tf.Variable(1.0)
B = tf.mul(A, 2.0)
# Do something with B
Maybe this becomes clear if you execute the following code:
A = tf.Variable(1.0)
print A
B = A
A = tf.mul(A, 2.0)
print A
print B
The output is
<tensorflow.python.ops.variables.Variable object at 0x7ff025c02bd0>
Tensor("Mul:0", shape=(), dtype=float32)
<tensorflow.python.ops.variables.Variable object at 0x7ff025c02bd0>
The first time you print A it tells you that A is a variable object. After executing A = tf.mul(A, 2.0) and printing A again, you can see that the name A is now bound to a tf.Tensor object. However, the variable still exists, as can be seen by looking at the object behind the name B.
This is what the single line of code does:
t = tf.truncated_normal( [ nInputUnits, nUnitsHiddenLayer1 ] )
Creates a Tensor with shape [ nInputUnits, nUnitsHiddenLayer1 ], initialized with 1.0 as the standard deviation of the truncated normal distribution. ( 1.0 is standard stddev value )
t1 = tf.div( t, math.sqrt( nInputUnits ) )
divide all values in t with math.sqrt( nInputUnits )
Your two lines of code do exactly the same thing. On the first line and the second line all values are divided by math.sqrt( nInputUnits ).
As for your statement:
I now know that this is wrong and that the 2nd line causes the weights to be recomputed at each learning step.
EDIT my mistake
Indeed you are right, they are divided by math.sqrt( nInputUnits ) at every execuction, but not reinitialized! The point of importance is where you put tf.variable()
Here both lines are only initialized once:
weightsLayer1 = tf.truncated_normal( [ nInputUnits, nUnitsHiddenLayer1 ] )
weightsLayer1 = tf.Variable( tf.div( weightsLayer1, math.sqrt( nInputUnits ) ) )
and here the second line is preformed at every step:
weightsLayer1 = tf.Variable( tf.truncated_normal( [ nInputUnits, nUnitsHiddenLayer1 ] )
weightsLayer1 = tf.div( weightsLayer1, math.sqrt( nInputUnits ) )
Why does the second yield better results? it looks like some kind normalization to me, but somebody more knowledgeable should verify that.
Ps.
you can write it more readable like this:
weightsLayer1 = tf.Variable( tf.truncated_normal( [ nInputUnits, nUnitsHiddenLayer1 ] , stddev = 1. / math.sqrt( nInputUnits ) )

Is there a range data structure in Scala?

I'm looking for a way to handle ranges in Scala.
What I need to do is:
given a set of ranges and a range(A) return the range(B) where range(A) intersect range (B) is not empty
given a set of ranges and a range(A) remove/add range(A) from/to the set of ranges.
given range(A) and range(B) create a range(C) = [min(A,B), max(A,B)]
I saw something similar in java - http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/RangeSet.html
Though subRangeSet returns only the intersect values and not the range in the set (or list of ranges) that it intersects with.
RangeSet rangeSet = TreeRangeSet.create();
rangeSet.add(Range.closed(0, 10));
rangeSet.add(Range.closed(30, 40));
Range range = Range.closed(12, 32);
System.out.println(rangeSet.subRangeSet(range)); //[30,32] (I need [30,40])
System.out.println(range.span(Range.closed(30, 40))); //[12,40]
There is an Interval[A] type in the spire math library. This allows working with ranges of arbitrary types that define an Order. Boundaries can be inclusive, exclusive or omitted. So e.g. (-∞, 0.0] or [0.0, 1.0) would be possible intervals of doubles.
Here is a library intervalset for working with sets of non-overlapping intervals (IntervalSeq or IntervalTrie) as well as maps of intervals to arbitrary values (IntervalMap).
Here is a related question that describes how to use IntervalSeq with DateTime.
Note that if the type you want to use is 64bit or less (basically any primitive), IntervalTrie is extremely fast. See the Benchmarks.
As Tzach Zohar has mentioned in the comment, if all you need is range of Int - go for scala.collection.immutable.Range:
val rangeSet = Set(0 to 10, 30 to 40)
val r = 12 to 32
rangeSet.filter(range => range.contains(r.start) || range.contains(r.end))
If you need it for another underlying type - implement it by yourself, it's easy for your usecase.

Why are products called minterms and sums called maxterms?

Do they have a reason for doing so? I mean, in the sum of minterms, you look for the terms with the output 1; I don't get why they call it "minterms." Why not maxterms because 1 is well bigger than 0?
Is there a reason behind this that I don't know? Or should I just accept it without asking why?
The convention for calling these terms "minterms" and "maxterms" does not correspond to 1 being greater than 0. I think the best way to answer is with an example:
Say that you have a circuit and it is described by X̄YZ̄ + XȲZ.
"This form is composed of two groups of three. Each group of three is a 'minterm'. What the expression minterm is intended to imply it that each of the groups of three in the expression takes on a value of 1 only for one of the eight possible combinations of X, Y and Z and their inverses." http://www.facstaff.bucknell.edu/mastascu/elessonshtml/Logic/Logic2.html
So what the "min" refers to is the fact that these terms are the "minimal" terms you need in order to build a certain function. If you would like more information, the example above is explained in more context in the link provided.
Edit: The "reason they used MIN for ANDs, and MAX for ORs" is that:
In Sum of Products (what you call ANDs) only one of the minterms must be true for the expression to be true.
In Product of Sums (what you call ORs) all the maxterms must be true for the expression to be true.
min(0,0) = 0
min(0,1) = 0
min(1,0) = 0
min(1,1) = 1
So minimum is pretty much like logical AND.
max(0,0) = 0
max(0,1) = 1
max(1,0) = 1
max(1,1) = 1
So maximum is pretty much like logical OR.
In Sum Of Products (SOP), each term of the SOP expression is called a "minterm" because,
say, an SOP expression is given as:
F(X,Y,Z) = X'.Y'.Z + X.Y'.Z' + X.Y'.Z + X.Y.Z
for this SOP expression to be "1" or true (being a positive logic),
ANY of the term of the expression should be 1.
thus the word "minterm".
i.e, any of the term (X'Y'Z) , (XY'Z') , (XY'Z) or (XYZ) being 1, results in F(X,Y,Z) to be 1!!
Thus they are called "minterms".
On the other hand,
In Product Of Sum (POS), each term of the POS expression is called a "maxterm" because,
say an POS expression is given as: F(X,Y,Z) = (X+Y+Z).(X+Y'+Z).(X+Y'+Z').(X'+Y'+Z)
for this POS expression to be "0" (because POS is considered as a negative logic and we consider 0 terms), ALL of the terms of the expression should be 0. thus the word "max term"!!
i.e for F(X,Y,Z) to be 0,
each of the terms (X+Y+Z), (X+Y'+Z), (X+Y'+Z') and (X'+Y'+Z) should be equal to "0", otherwise F won't be zero!!
Thus each of the terms in POS expression is called a MAXTERM (maximum all the terms!) because all terms should be zero for F to
be zero, whereas any of the terms in POS being one results in F to be
one. Thus it is known as MINTERM (minimum one term!)
I believe that AB is called a minterm is because it occupies the minimum area on a Venn diagram; while A+B is called a MAXTERM because it occupies a maximum area in a Venn diagram. Draw the two diagrams and the meanings will become obvious
Ed Brumgnach
Here is another way to think about it.
A product is called a minterm because it has minimum-satisfiability where as a sum is called a maxterm because it has maximum-satisfiability among all practically interesting boolean functions.
They are called terms because they are used as the building-blocks of various canonical representations of arbitrary boolean functions.
Details:
Note that '0' and '1' are the trivial boolean functions.
Assume a set of boolean variables x1,x2,...,xk and a non-trivial boolean function f(x1,x2,...,xk).
Conventionally, an input is said to satisfy the boolean function f, whenever f holds a value of 1 for that input.
Note that there are exactly 2^k inputs possible, and any non-trivial boolean-function can satisfy a minimum of 1 input to a maximum of 2^k -1 inputs.
Now consider the two simple boolean functions of interest: sum of all variables S, and product of all variables P (variables may/may-not appear as complements). S is one boolean function that has maximum-satisfiability hence called as maxterm, where as P is the one having minimum-satisfiability hence called a minterm.