Why OSRM implemented Contraction Hierarchies and MLD instead of A*? - osrm

I'm going through the OSRM implementation; they implemented routing algorithms CH and MLD. I wanted to know the motivation behind it to use these algorithms. More importantly, we can't dynamically change the edge weights in these two algorithms.

The CH and MLD algorithms implemented in OSRM are "speedup algorithms" - they make shortest-path finding faster on augmented graph.
The typical tradeoff with these types of algorithms is that you lose flexibility - the shape of the augmented graph depends on the weights, so if you change them, you need to regenerate the augmented graph in order for it to continue to be valid.

Related

Multiple drivers with Mapbox Optimization API?

Using the Mapbox Optimization API is it possible to optimize the routes between multiple drivers?
Example: 6 locations are added, 2 drivers are added, the routes get split / optimized between the two drivers
I'm still in the planning stage, so I haven't poked around too much myself yet, but the code and all the examples I've seen are directed towards single driver optimization only... Has anybody done something like this before? Anything you can recommend to point me in the right direction?
Mapbox's Optimization API returns a duration-optimized route between the input coordinates, which is also known as solving the so-called "Travelling Salesman Problem". This is a well-known, NP-hard graph theory problem, meaning there is no general polynomial-time solution known for the problem.
The underlying data used for computing the aforementioned duration-optimized route are the cost functions of the edges connecting the coordinates input to the API request. You could retrieve the cost values (including traffic) between a set of these coordinate positions using Mapbox's Matrix API.
Adding a second driver/salesman to the problem makes the problem exponentially harder to solve, as discussed in the answer to this Stack Overflow post.
Here is a link to a scientific paper discussing a possible approach to this problem.
As evidenced by the research community, a solution for the Multiple Travelling Salesman Problem is not straightforward to implement. If you do not want to engage in this non-trivial task of implementing an algorithm that would solve it for you, you could implement a function that will make an educated guess on how to split up the destination coordinates between the two drivers. This "educated guess" could be based on values obtained from the Matrix API. You could make a one-to-many request for each driver, then take the lesser of the two durations for each coordinate and assign the coordinate to the appropriate driver. Then, you can use Mapbox's Optimization API to solve the two separate travelling salesman problems individually.
Even if you did implement an algorithm that would solve the Multiple Travelling Salesman Problem, the problem's complexity grows exponentially with the number of drivers and the number of waypoints. Therefore, you could end up with a solution that works, but would not necessarily compute in a reliable amount of time. These performance limitations are something to keep in mind when going about implementing a solution.

kmean clustering: variable selection

I'm applying a kmean algorithm for clustering my customer base. I'm struggling conceptually on the selection process of the dimensions (variables) to include in the model. I was wondering if there are methods established to compare among models with different variables. In particular, I was thinking to use the common SSwithin / SSbetween ratio, but I'm not sure if that can be applied to compare models with a different number of dimensions...
Any suggestions>?
Thanks a lot.
Classic approaches are sequential selection algorithms like "sequential floating forward selection" (SFFS) or "sequential floating backward elimination (SFBS). Those are heuristic methods where you eliminate (or add) one feature at the time based on your performance metric, e.g,. mean squared error (MSE). Also, you could use a genetic algorithm for that if you like.
Here is an easy-going paper that summarizes the ideas:
Feature Selection from Huge Feature Sets
And a more advanced one that could be useful: Unsupervised Feature Selection for the k-means Clustering Problem
EDIT:
When I think about it again, I initially had the question in mind "how do I select the k (a fixed number) best features (where k < d)," e.g., for computational efficiency or visualization purposes. Now, I think what you where asking is more like "What is the feature subset that performs best overall?" The silhouette index (similarity of points within a cluster) could be useful, but I really don't think you can really improve the performance via feature selection unless you have the ground truth labels.
I have to admit that I have more experience with supervised rather than unsupervised methods. Thus, I typically prefer regularization over feature selection/dimensionality reduction when it comes to tackling the "curse of dimensionality." I use dimensionality reduction frequently for data compression though.

results of two feature selection algo do not match

I am working on two feature selection algorithms for a real world problem where the sample size is 30 and feature size is 80. The first algorithm is wrapper forward feature selection using SVM classifier, the second is filter feature selection algorithm using Pearson product-moment correlation coefficient and Spearman's rank correlation coefficient. It turns out that the selected features by these two algorithms are not overlap at all. Is it reasonable? Does it mean I made mistakes in my implementation? Thank you.
FYI, I am using Libsvm + matlab.
It can definitely happen as both strategies do not have the same expression power.
Trust the wrapper if you want the best feature subset for prediction, trust the correlation if you want all features that are linked to the output/predicted variable. Those subsets can be quite different, especially if you have many redundant features.
Using top correlated features is a strategy which assumes that the relationships between the features and the output/predicted variable are linear, (or at least monotonous in case of Spearman's rank correlation), and that features are statistically independent one from another, and do not 'interact' with one another. Those assumptions are most often violated in real world problems.
Correlations, or other 'filters' such as mutual information, are better used either to filter out features, to decide which features not to consider, rather than to decide which features to consider. Filters are necessary when the initial feature count is very large (hundreds, thousands) to reduce the workload for a subsequent wrapper algorithm.
Depending on the distribution of the data you can either use spearman or pearson.The latter is used for normal distribution while former for non-normal.Find the distribution and use appropriate one.

How to find bridges (community connecting nodes) in large networks represented using the adjacency matrix

I have networks of roughly 10K to 100K nodes which are all connected. These nodes are typically grouped into clusters of communities which are strongly connected with many edges between them and there are hubs etc. Between the communities there are nodes with a few edges bridging / connecting the communities together. These datasets are in adjacency matrices
I have tried spectral clustering (Ding et al 2001) but it is really slow on large data sets and seems to stop working when there is a lot of ambiguity (bridges which are not the only bridge route to another cluster- other communities can act as alternative proxy routes).
I have tried some of the methods from martelot such as the Newman algorithm for modularity optimisation but have not incorporated the stability optimisation functions in that effort (could that be crucial?). On synthetic data sets where the clusters are created by random graphs (ER graphs) the methods work but on real ones where there is nested hierarchy the results are scattered. Using a standalone visualization application/tool the bridges are evident though.
What methods would you recommend/advise to try? I am using MATLAB.
What do you want to do, exactly? Detect communities, or bridges between them? Those are two different problems. Once you have the communities, it's straightforward enough identifying the edges connecting nodes from two distinct communities. So, I guess you want to detect communities.
There are actually thousands methods for this purpose, some of them implemented in Matlab, such as the one you cite, or the generalized Louvain algorithm (also based on modularity optimization). However, most of them are rather available as C or C++ programs, such as InfoMap (based on a data compression paradigm), WalkTrap (clustering using a random walk-based distance), Markov Cluster (simulates some propagation mechanism), and the list goes on...
Those tools formalize the notion of community structure more or less differently, potentially leading to different (estimated) community structures, when applied on the same network. And of course, different communities means different bridges, too. So the question is rather to know how to pick the appropriate method for your data. You seem to have a priori knowledge regarding the networks you are studying, so you should use that to make your choice (rather than the programming language). For instance, even if you don't state it explicitly, you seem to be looking for a hierarchical community structure: not all tools are able to detect this kind of structure. Similarly, if you think one node can belong to several communities at the same time, then you should consider looking for overlapping communities, for instance using CFinder (based on clique percolation).
I'd advise you to have a look at this excellent review of community detection, you might find some interesting information allowing you to pick a method: Community Detection in Graphs. Also, from a programming point of view, I'd advise you to play with the igraph library (available for C, R and Python): it contains several standard community detection tools. You can try them on your data and see what you get.

Feasibility of Machine Learning techniques for Network Intrusion Detection

Is there a machine learning concept (algorithm or multi-classifier system) that can detect the variance of network attacks(or try to).
One of the biggest problems for signature based intrusion detection systems is the inability to detect new or variant attacks.
Reading up, anomaly detection seems to still be a statistical based en-devour it refers to detecting patterns in a given data set which isn't the same as detecting variation in packet payloads. Anomaly based NIDS monitors network traffic and compares it against an established baseline of a normal traffic profile. The baseline characterizes what is "normal" for the network - such as the normal bandwidth usage, the common protocols used, correct combinations of ports numbers and devices etc
Say some one uses Virus A to propagate through a network then some one writes a rule to stop Virus A but another person writes a "variation" of Virus A called Virus B purely for the purposes of evading that initial rule but still using most if not all of the same tactics/code. Is there not a way to detect variance?
If there is whats the umbrella term it would come under, as ive been under the illusion that anomaly detection was it.
Could machine learning be used for pattern recognition(rather than pattern matching) at the packet payload level?
i think your intution to look at machine learning techniques is correct, or will turn out to be correct (One of the biggest problems for signature based intrusion detection systems is the inability to detect new or variant attacks.) The superior performance of ML techiques is in general due to the ability of these algorithms to generalize (a multiplicity of soft constraints rather than a few hard constraints). and to adapt (updates based on new training instances to frustrate simple countermeasures)--two attributes that i would imagine are crucial for identifying network attacks.
The theoretical promise aside, there are practical difficulties with applying ML techniques to problems like the one recited in the OP. By far the most significant is the difficultly in gathering data to train the classifier. In particular, reliably labeling data points as "intrusion" is probably not easy; likewise, my guess is that these instances are sparsely distributed in the raw data."
I suppose it's this limitation that has led to the increased interest (as evidenced at least by the published literature) in applying unsupervised ML techniques to problems like network intrusion detection.
Unsupervised techniques differ from supervised techniques in that the data is fed to the algorithms without a response variable (i.e., without the class labels). In these cases you are relying on the algorithm to discern structure in the data--i.e., some inherent ordering in the data into reasonably stable groups or clusters (possibly what you the OP had in mind by "variance." So with an unsupervised technique, there is no need to explicitly show the algorithm instances of each class, nor is it necessary to establish baseline measurements, etc.
The most frequently used unsupervised ML technique applied to problems of this type is probably the Kohonen Map (also sometimes called self-organizing map or SOM.)
i use Kohonen Maps frequently, but so far not for this purpose. There are however, numerous published reports of their successful application in your domain of interest, e.g.,
Dynamic Intrusion Detection Using Self-Organizing Maps
Multiple Self-Organizing Maps for Intrusion Detection
I know MATLAB has at least one available implementation of Kohonen Map--the SOM Toolbox. The homepage for this Toolbox also contains a brief introduction to Kohonen Maps.