Say the decision tree has k classes (c1,c2,...,ck) to classify and the dataset of the parent node is D. Pi denotes the proportion of elements labelled with class ci. And Gini impurity is:
If one partitions the node to subnodes with subsets D1 and D2 which are complementary and not intersected. How to prove:
I understand that the information gain should not be negative so this inequality should exist. Could anyone help?
There is phase in genetic algorithm where we should choose to crossover the chromosomes from parents to offspring.
It is easy to do via binary form.
But what to do if we encodes the chromosomes using the value encoding?
Let's say one bit in my chromosomes is a DOUBLE type value, let's say 0.99, its range is (0-1) since it will represent a probability.
How to crossover this DOUBLE number?
Convert to binary to crossover then convert back...?
You could use the blend crossover operator (the variant with α = 0):
p1 first parent
p2 second parent
u random number in [0, 1]
offspring = (1 - u) * p1 + u * p2
Assuming p1 < p2, this crossover operator creates a random solution in the range [p1, p2].
The blend crossover operator has the interesting property that if the difference between parents is small, the difference between the child and parent solutions is also small. So the spread of current population dictates the spread of solutions in the resulting population (this is a form of adaptation).
A more advanced version of the blend crossover operator (BLX-α) and another well known operator (Simulated Binary Crossover) are described in Self-Adaptive Genetic Algorithms with Simulated Binary Crossover by Kalyanmoy Deb and Hans-Georg Beyer (a short summary here).
Differential Evolution is another possibility.
I am trying to create a loop code in MATLAB that "fills up" the elements in an empty column vector of size l x 1, called m. As I don't have very much experience in MATLAB, I am unsure if this is the correct way to go about it.
Note: Seeing as i pertains to the complex quantity in matlab, I denote the i-th element of an array as the ii-th element.
l=length(A); %The number of rows in the empty vector we seek as our output;
%so as to preallocate space for this vector.
q=eigencentrality(A);%An lx1 column vector whose ii-th elements are used in the loop.
l1=max(eig(A)); %A scalar used in the loop.
CS=sg_centrality(A); %%An lx1 column vector whose ii-th elements are used in the loop.
%Now for the actual loop that will "fill up" each ii-th entry
%of our empty vector, m.
m=NaN(l,1); %create the empty vector to be "filled up".
for ii=1:l
m(ii,:)=log(q(ii)^2)*sinh(l1)/CS(ii)^1/2;%this is the form that I want each entry
%of m to have. Note how the ii-th element
%of m depends on the corresponding ii-th
%element of CS and q!
end
Is this the right way to go about "filling up" such an empty column vector, m, whose entries depend on the corresponding elements of two other vectors as above?
Cheers!
You can do this completely vectorized. Vectorization is the act of processing data chunks at a time rather than individually as you're doing in your code. In fact, this is one of MATLAB's main advantages. You can replace that for loop with:
m = log(q.^2).*(sinh(l1)./CS).^1/2;
The .* and .^ operators are known as element-wise operators. This means that each value in each of q, li and CS will contribute to the corresponding position in the output. There's no need to use a loop.
Check out this MathWorks note on vectorization here: http://www.mathworks.com/help/matlab/matlab_prog/vectorization.html
You can vectorize all the operations, without using a for loop. This should work:
m=log(q.^2).*(sinh(l1)./CS).^1/2;
Note, that the dots denote elementwise operations. Usually this is much faster than using a loop. Also, just as a side note, you can then drop the preallocation.
I would be really glad if you could suggest me a Matlab library containing functions that will allow me to list:
1) all paths from a source to dest node on a network identified by its adjacency matrix
2) when applying dijkstra algorithm, I want to get the list of nodes not only the distance in terms of edges.
I already looked at this but it just provide with the shortest distance.
Thank you for your support.
I don't know a library, but 1) should be quite simple to write by yourself.
If you want to analyse if you could reach one node from another node just calculate:
with the number of Nodes N, and the adjacency matrix G
the k-th matrix gives you information about the reachability in k-steps. If you use syms to name your edges in the matrix G, you will be able to identify all possible paths in the resulting matrix E.
Let A be the adjacency matrix for the graph G = (V,E). A(i,j) = 1 if the nodes i and j are connected with an edge, A(i,j) = 0 otherwise.
My objective is the one of understanding whether G is acyclic or not. A cycle is defined in the following way:
i and j are connected: A(i,j) = 1
j and k are connected: A(j,k) = 1
k and i are connected: A(k,i) = 1
I have implemented a solution which navigates the matrix as follows:
Start from an edge (i,j)
Select the set O of edges which are outgoing from j, i.e., all the 1s in the j-th row of A
Navigate O in a DFS fashion
If one of the paths generated from this navigation leads to the node i, then a cycle is detected
Obviously this solution is very slow, since I have to evaluate all the paths in the matrix. If A is very big, the required overhead is very huge. I was wondering whether there is a way of navigating the adjacency matrix so as to find cycles without using an expensive algorithm such as DFS.
I would like to implement my solution in MATLAB.
Thanks in advance,
Eleanore.
I came across this question when answering this math.stackexchange question. For future readers, I feel like I need to point out (as others have already) that Danil Asotsky's answer is incorrect, and provide an alternative approach. The theorem Danil is referring to is that the (i,j) entry of A^k counts the number of walks of length k from i to j in G. The key thing here is that a walk is allowed to repeat vertices. So even if a diagonal entries of A^k is positive, each walk the entry is counting may contain repeated vertices, and so wouldn't count as a cycle.
Counterexample: A path of length 4 would contain a 4-cycle according to Danil's answer (not to mention that the answer would imply P=NP because it would solve the Hamilton cycle problem).
Anyways, here is another approach. A graph is acyclic if and only if it is a forest, i.e., it has c components and exactly n-c edges, where n is the number of vertices. Fortunately, there is a way to calculate the number of components using the Laplacian matrix L, which is obtained by replacing the (i,i) entry of -A with the sum of entries in row i of A (i.e., the degree of vertex labeled i). Then it is known that the number of components of G is n-rank(L) (i.e., the multiplicity of 0 as an eigenvalue of L).
So G has a cycle if and only if the number of edges is at least n-(n-rank(L))+1. On the other hand, by the handshaking lemma, the number of edges is exactly half of trace(L). So:
G is acyclic if and only if 0.5*trace(L)=rank(L). Equivalently, G has a cycle if and only if 0.5*trace(L) >= rank(L)+1.
Based on the observation of Danil, you need to compute A^n, a slightly more efficient way of doing so is
n = size(A,1);
An = A;
for ii = 2:n
An = An * A; % do not re-compute A^n from skratch
if trace(An) ~= 0
fprintf(1, 'got cycles\n');
end
end
If A is the adjacency matrix of the directed or undirected graph G, then the matrix A^n (i.e., the matrix product of n copies of A) has following property: the entry in row i and column j gives the number of (directed or undirected) walks of length n from vertex i to vertex j.
E.g. if for some integer n matrix A^n contain at least one non-zero diagonal entry, than graph has cycle of size n.
Most easy way check for non-zero diagonal elements of matrix is calculate matrix trace(A) = sum(diag(A)) (in our case elements of matrix power will be always non-negative).
Matlab solution can be following:
for n=2:size(A,1)
if trace(A^n) ~= 0
fprintf('Graph contain cycle of size %d', n)
break;
end
end
This approach uses DFS, but is very efficient, because we don't repeat nodes in subsequent DFS's.
High-level approach:
Initialize the values of all the nodes to -1.
Do a DFS from each unexplored node, setting that node's value to that of an auto-incremented value starting from 0.
For these DFS's, update each node's value with previous node's value + i/n^k where that node is the ith child of the previous node and k is the depth explored, skipping already explored nodes (except for checking for a bigger value).
So, an example for n = 10:
0.1 0.11 0.111
j - k - p
0 / \ 0.12
i \ 0.2 l
m
1 1.1
q - o
...
You can also use i/branching factor+1 for each node to reduce the significant digits of the numbers, but that requires additional calculation to determine.
So above we did a DFS from i, which had 2 children j and m. m had no children, j had 2 children, .... Then we finished with i and started another DFS from the next unexplored node q.
Whenever you encounter a bigger value, you know that a cycle occurred.
Complexity:
You check every node at most once, and at every node you do n checks, so complexity is O(n^2), which is the same as looking at every entry in the matrix once (which you can't do much better than).
Note:
I'll also just note that an adjacency list will probably be faster than an adjacency matrix unless it's a very dense graph.
That is the problem I also found. The explanation, I thought, is the following:
when we talk about cycle, implicitly we mean directed cycles. The adjacency matrix that you have has a different meaning when you consider the directed graph; it is indeed a directed cycle of length 2. So, the solution of $A^n$ is actually for directed graphs. For undirected graphs, I guess a fix would be to just consider the upper triangular version of the matrix (the rest filled with zero) and repeat the procedure. Let me know if this is the right answer.
If digraph G is represented by its Adjacency matrix M then M'=(I - M ) will be singular if there is a cycle in it.
I : identity matrix of same order of M
Some more thoughts on the matrix approach... The example cited is the adjacency matrix for a disconnected graph (nodes 1&2 are connected, and nodes 3&4 are connected, but neither pair is connected to the other pair). When you calculate A^2, the answer (as stated) is the identity matrix. However, since Trace(A^2) = 4, this indicates that there are 2 loops each of length 2 (which is correct). Calculating A^3 is not permitted until these loops are properly identified and removed from the matrix. This is an involved procedure requiring several steps and is detailed nicely by R.L. Norman, "A Matrix Method for Location of Cycles of a Directed Graph," AIChE J, 11-3 (1965) pp. 450-452. Please note: it is unclear from the author whether this approach is guaranteed to find ALL cycles, UNIQUE cycles, and/or ELEMENTARY cycles. My experience suggests that it definitely does not identify ONLY unique cycles.
I cannot add a comment directly, but this comment by Casteels (#casteels) is incorrect:
#Pushpendre My point is that if Danil's answer was correct for directed >graphs, then it would be correct for undirected graphs as well, which it is >not. The counterexample in my previous comment does not have the adjacency >matrix you wrote down; I said to replace each edge with a directed edge in >each direction. This gives the same adjacency matrix as the undirected case. >Are you sure you are not confusing cycle with closed walk? – Casteels Apr 24 >'15 at 9:20
As soon as a directed graph has two vertices with arcs in both directions, then it has a cycle of length 2, and the square of its adjacency matrix (which, in the 'construction' proposed above, would indeed be equal to that of the underlying undirected graph), will have a non-zero diagonal coefficient (as does the square of every adjacency matrix of a non-empty undirected graph, since an edge immediately gives a (non-elementary) walk of length 2 from a vertex to itself). So in that case, Danil's answer essentially correctly detects a cycle. The reasoning above is not correct.
Danil's answer is indeed correct for directed graphs. In a digraph, a single arc cannot be traversed both ways, so every closed directed walk must contain a directed cycle, which will create a non-zero coefficient on the diagonal of some power of the original adjacency matrix of the directed graph. So one can keep computing the powers of the matrix increasingly from 1 to the number of vertices, stopping as soon as a diagonal coefficient is non-zero.