Optimal solution for: All possible acyclic paths in a graph problem - matlab

I am dealing with undirected graph. I need to find all possible acyclic paths within a graph:
with G(V,E)
find all subsets of V that are acyclic paths
I am using either python scipy or matlab - whichever would be appropriate.
Is there any clever solution for this?
I'm trying to achieve it with a breadth-first search (see wiki)
I also have this toolbox in matlab: http://www.mathworks.com/matlabcentral/fileexchange/4266-grtheory-graph-theory-toolbox but it seems there's no straightforward solution for my problem.
PS. The problem practically is stated as: Transit Network Design Problem: Find such a transport network that minimizes cost of passangers and operators (i.e. optimal subway network for urban area)
Thanks in advance
Rafal

I think the problem as stated in your PS may be a NP problem. If so, there are straightforward solutions only for graphs with very limited numbers of nodes (N ~ <= 20). Other solutions will be approximate, giving rise to only local optimums. The solution to your problem as stated in the question will simply be to calculate all the permutations of the node orders. Again this will become computationally infeasible with comparatively low numbers of nodes (possibly higher than 20 but not much).

Do you need only the shortest paths between all pairs of vertices, or really all paths?

Related

Multiple drivers with Mapbox Optimization API?

Using the Mapbox Optimization API is it possible to optimize the routes between multiple drivers?
Example: 6 locations are added, 2 drivers are added, the routes get split / optimized between the two drivers
I'm still in the planning stage, so I haven't poked around too much myself yet, but the code and all the examples I've seen are directed towards single driver optimization only... Has anybody done something like this before? Anything you can recommend to point me in the right direction?
Mapbox's Optimization API returns a duration-optimized route between the input coordinates, which is also known as solving the so-called "Travelling Salesman Problem". This is a well-known, NP-hard graph theory problem, meaning there is no general polynomial-time solution known for the problem.
The underlying data used for computing the aforementioned duration-optimized route are the cost functions of the edges connecting the coordinates input to the API request. You could retrieve the cost values (including traffic) between a set of these coordinate positions using Mapbox's Matrix API.
Adding a second driver/salesman to the problem makes the problem exponentially harder to solve, as discussed in the answer to this Stack Overflow post.
Here is a link to a scientific paper discussing a possible approach to this problem.
As evidenced by the research community, a solution for the Multiple Travelling Salesman Problem is not straightforward to implement. If you do not want to engage in this non-trivial task of implementing an algorithm that would solve it for you, you could implement a function that will make an educated guess on how to split up the destination coordinates between the two drivers. This "educated guess" could be based on values obtained from the Matrix API. You could make a one-to-many request for each driver, then take the lesser of the two durations for each coordinate and assign the coordinate to the appropriate driver. Then, you can use Mapbox's Optimization API to solve the two separate travelling salesman problems individually.
Even if you did implement an algorithm that would solve the Multiple Travelling Salesman Problem, the problem's complexity grows exponentially with the number of drivers and the number of waypoints. Therefore, you could end up with a solution that works, but would not necessarily compute in a reliable amount of time. These performance limitations are something to keep in mind when going about implementing a solution.

Why use Crossover in Neural Network training?

Why specifically is it used?
I know it increases variation which may help explore the problem space, but how much does it increase the probability of finding the optimal solution/configuration in time? And does it do anything else advantageous?
And does it necessarily always help, or are there instances in which it would increase the time taken to find the optimal solution?
As Patrick Trentin said, crossover improve the speed of convergence, because it allows to combine good genes that are already found in the population.
But, for neuro-evolution, crossover is facing the "permutation problem", also known as "the competing convention problem". When two parents are permutations of the same network, then, except in rare cases, their offspring will always have a lower fitness. Because the same part of the network is copied in two different locations, and so the offspring is losing viable genes for one of these two locations.
for example the networks A,B,C,D and D,C,B,A that are permutations of the same network. The offspring can be:
A,B,C,D (copy of parent 1)
D,C,B,A (copy of parent 2)
A,C,B,D OK
A,B,C,A
A,B,B,A
A,B,B,D
A,C,B,A
A,C,C,A
A,C,C,D
D,B,C,A OK
D,C,B,D
D,B,B,A
D,B,B,D
D,B,C,D
D,C,C,A
D,C,C,D
So, for this example, 2/16 of the offspring are copies of the parents. 2/16 are combinations without duplicates. And 12/16 have duplicated genes.
The permutation problem occurs because networks that are permutations one of the other have the same fitness. So, even for an elitist GA, if one is selected as parent, the other will also often be selected as parent.
The permutations may be only partial. In this case, the result is better than for complete permutations, but the offspring will, in a lot of cases, still have a lower fitness than the parents.
To avoid the permutation problem, I heard about similarity based crossover, that compute similarity of neurons and their connected synapses, doing the crossing-over between the most similar neurons instead of a crossing-over based on the locus.
When evolving topology of the networks, some NEAT specialists think the permutation problem is part of a broader problem: "the variable lenght genome problem". And NEAT seems to avoid this problem by speciation of the networks, when two networks are too differents in topology and weights, they aren't allowed to mate. So, NEAT algorithm seems to consider permuted networks as too different, and doesn't allow them to mate.
A website about NEAT also says:
However, in another sense, one could say that the variable length genome problem can never be "solved" because it is inherent in any system that generates different constructions that solve the same problem. For example, both a bird and a bat represent solutions to the problem of flight, yet they are not compatible since they are different conventions of doing the same thing. The same situation can happen in NEAT, where very different structures might arise that do the same thing. Of course, such structures will not mate, avoiding the serious consequence of damaged offspring. Still, it can be said that since disparate representations can exist simultaneously, incompatible genomes are still present and therefore the problem is not "solved." Ultimately, it is subjective whether or not the problem has been solved. It depends on what you would consider a solution. However, it is at least correct to say, "the problem of variable length genomes is avoided."
Edit: To answer your comment.
You may be right for similarity based crossover, I'm not sure it totally avoids the permutation problem.
About the ultimate goal of crossover, without considering the permutation problem, I'm not sure it is useful for the evolution of neural networks, but my thought is: if we divide a neural network in several parts, each part contributes to the fitness, so two networks with a high fitness may have different good parts. And combining these parts should create an even better network. Some offspring will of course inherit the bad parts, but some other offspring will inherit the good parts.
Like Ray suggested, it could be useful to experiment the evolution of neural networks with and without crossover. As there is randomness in the evolution, the problem is to run a large number of tests, to compute the average evolution speed.
About evolving something else than a neural network, I found a paper that says an algorithm using crossover outperforms a mutation-only algorithm for solving the all-pairs shortest path problem (APSP).
Edit 2:
Even if the permutation problem seems to be only applicable to some particular problems like neuro-evolution, I don't think we can say the same about crossover, because maybe we are missing something about the problems that don't seem to be suitable for crossover.
I found a free version of the paper about similarity based crossover for neuro-evolution, and it shows that:
an algorithm using a naive crossover performs worse than a mutation-only algorithm.
using similarity based crossover it performs better than a mutation-only algorithm for all tested cases.
NEAT algorithm sometimes performs better than a mutation-only algorithm.
Crossover is complex and I think there is a lack of studies that compare it with mutation-only algorithms, maybe because its usefulness highly depends:
of its engineering, in function of particular problems like the permutation problem. So of the type of crossover we use (similarity based, single point, uniform, edge recombination, etc...).
And of the mating algorithm. For example, this paper shows that a gendered genetic algorithm strongly outperforms a non-gendered genetic algorithm for solving the TSP. For solving two other problems, the algorithm doesn't strongly outperforms, but it is better than the non-gendered GA. In this experiment, males are selected on their fitness, and females are selected on their ability to produce a good offspring. Unfortunately, this study doesn't compare the results with a mutation-only algorithm.

Python Clustering Algorithms

I've been looking around scipy and sklearn for clustering algorithms for a particular problem I have. I need some way of characterizing a population of N particles into k groups, where k is not necessarily know, and in addition to this, no a priori linking lengths are known (similar to this question).
I've tried kmeans, which works well if you know how many clusters you want. I've tried dbscan, which does poorly unless you tell it a characteristic length scale on which to stop looking (or start looking) for clusters. The problem is, I have potentially thousands of these clusters of particles, and I cannot spend the time to tell kmeans/dbscan algorithms what they should go off of.
Here is an example of what dbscan find:
You can see that there really are two separate populations here, though adjusting the epsilon factor (the max. distance between neighboring clusters parameter), I simply cannot get it to see those two populations of particles.
Is there any other algorithms which would work here? I'm looking for minimal information upfront - in other words, I'd like the algorithm to be able to make "smart" decisions about what could constitute a separate cluster.
I've found one that requires NO a priori information/guesses and does very well for what I'm asking it to do. It's called Mean Shift and is located in SciKit-Learn. It's also relatively quick (compared to other algorithms like Affinity Propagation).
Here's an example of what it gives:
I also want to point out that in the documentation is states that it may not scale well.
When using DBSCAN it can be helpful to scale/normalize data or
distances beforehand, so that estimation of epsilon will be relative.
There is a implementation of DBSCAN - I think its the one
Anony-Mousse somewhere denoted as 'floating around' - , which comes
with a epsilon estimator function. It works, as long as its not fed
with large datasets.
There are several incomplete versions of OPTICS at github. Maybe
you can find one to adapt it for your purpose. Still
trying to figure out myself, which effect minPts has, using one and
the same extraction method.
You can try a minimum spanning tree (zahn algorithm) and then remove the longest edge similar to alpha shapes. I used it with a delaunay triangulation and a concave hull:http://www.phpdevpad.de/geofence. You can also try a hierarchical cluster for example clusterfck.
Your plot indicates that you chose the minPts parameter way too small.
Have a look at OPTICS, which does no longer need the epsilon parameter of DBSCAN.

Choosing Clustering Method based on results

I'm using WEKA for my thesis and have over 1000 lines of data. The database includes demographical information (Age, Location, status etc.) followed by name of products (valued 1 or 0). The end results is a recommender system.
I used two methods of clustering, K-Means and DBScan.
When using K-means I tried 3 different number of cluster, while using DBscan I chose 3 different epsilons (Epsilon 3 = 48 clusters with ignored 17% of data, Epsilone 2.5 = 19 clusters while cluster 0 holds 229 items with ignored 6%.) Meaning i have 6 different clustering results for same data.
How do I choose what's best suits my data ?
What is "best"?
As some smart people noticed:
the validity of a clustering is often in the eye of the beholder
There is no objectively "better" for clustering, or you are not doing cluster analysis.
Even when a result actually is "better" on some mathematical measure such as separation, silhouette or even when using a supervised evaluation using labels - its still only better at optimizing towards some mathematical goal, not to your use case.
K-means finds a local optimal sum-of-squares assignment for a given k. (And if you increase k, there exists a better assignment!) DBSCAN (it's actually correctly spelled all uppercase) always finds the optimal density-connected components for the given MinPts/Epsilon combination. Yet, both just optimize with respect to some mathematical criterion. Unless this critertion aligns with your requirements, it is worthless. So there is no best, until you know what you need. But if you know what you need, you would not need to do cluster analysis.
So what to do?
Try different algorithms and different parameters and analyze the output with your domain knowledge, if they help you with the problem you are trying to solve. If they help you solving your problem, then they are good. If they do not help, try again.
Over time, you will collect some experience. For example, if the sum-of-squares is meaningless for your domain, don't use k-means. If your data does not have meaningful density, don't use density based clustering such as DBSCAN. It's not that these algorithms fail. They just don't solve your problem, they solve a different problem that you are not interested in. And they might be really good at solving this other problem...

ELKI implementation of OPTICS clustering algorithm detects only one cluster

I'm having issue with using OPTICS implementation in ELKI environment. I have used the same data for DBSCAN implementation and it worked like a charm. Probably I'm missing something with parameters but I can't figure it out, everything seems to be right.
Data is a simple 300х2 matrix, consists of 3 clusters with 100 points in each.
DBSCAN result:
Clustering result of DBSCAN
MinPts = 10, Eps = 1
OPTICS result:
Clustering result of OPTICS
MinPts = 10
You apparently already found the solution yourself, but here is the long story:
The OPTICS class in ELKI only computes the cluster order / reachability diagram.
In order to extract clusters, you have different choices, one of which (the one from the original OPTICS publication) is available in ELKI.
So in order to extract clusters in ELKI, you need to use the OPTICSXi algorithm, which will in turn use either OPTICS or the index based DeLiClu to compute the cluster order.
The reason why this is split into two parts in ELKI probably is so that you can on one hand implement another logic for extracting the clusters, and on the other hand implement different methods like DeLiClu for computing the cluster order. That would align well with the modular architecture of ELKI.
IIRC there is at least one more method (apparently not yet in ELKI) that extracts clusters by looking for local maxima, then extending them horizontally until they hit the end of the valley. And there was a different one that used "inflexion points" of the plot.
#AnonyMousse pretty much put it right. I just can't upvote or comment yet.
We hope to have some students contribute the other cluster extraction methods as small student projects over time. They are not essential for our research, but they are good tasks for students that want to learn about ELKI to get started.
ELKI is a fast moving project, and it lives from community contributions. We would be happy to see you contribute some code to it. We know that the codebase is not easy to get started with - it is fairly large, and the generality of the implementation and the support for index structures make it a bit hard to get started. We try to add Tutorials to help you to get started. And once you are used to it, you will actually benefit from the architecture: your algorithms get the benfits of indexing and arbitrary distance functions, while if you would implement from scratch, you would likely only support Euclidean distance, and no index acceleration.
Seeing that you struggled with OPTICS, I will try to write an OPTICS tutorial in the new year. In particular, OPTICS can benefit a lot from using an appropriate index structure.