What is the algorithm of OrientDB partitioning? - orientdb

I can't find the partitioning algorithm which is supported by OrientDB.
I need a graph database which is supports clever algorithm of partitioning or rebalancing to decrease the number of cutted edges (edge which points on another server). Because I have a lot of reads but few writes.
Also, does Titan database support some clever algorithm?

Related

How to cluster data using self-organising maps?

Suppose that we train a self-organising map (SOM) with a given dataset. Would it make sense to cluster the neurons of the SOM instead of the original datapoints? This doubt came to me after reading this paper, in which the following is stated:
The most important benefit of this procedure
is that computational load decreases considerably, making
it possible to cluster large data sets and to consider several
different preprocessing strategies in a limited time. Naturally,
the approach is valid only if the clusters found using the SOM
are similar to those of the original data.
In this answer it is clearly stated that SOMs don't include clustering, but some clustering procedure can be made on the SOM after it has been trained. I thought that this meant the clustering was done on the neurons of the SOM, which are in some sense a mapping of the original data, but I'm not sure about this. So, what I want to know is:
Is it correct to cluster data performing the clustering algorithm on the trained neuron weights as datapoints? If not, how is clustering done using a SOM then?
What characteristics should a dataset have, in general, for this approach to be useful?
Yes, the usual approach seems to be either hierarchical or k-means (you'll need to dig this up how it was originally done - as seen in the paper you linked, many variants including two-level approaches have been explored later) on the neurons. If you consider SOMs to be a quantization and projection technique, all of these approaches are valid to use.
It's cheaper because they are just 2 dimensional, Euclidean, and much fewer points. So that is well in line with the source that you have.
Note that a SOM neuron may be empty, it it is inbetween of two extremely well separated clusters.

'ufactor' parameter in metis for imbalance clustering

I have been using METIS for clustering social media users.
By default, it was outputting clusters with same number of vertices in each side, which is not ideal in real world scenario. So, I was trying to find way to loosen the constraint of "same number of vertices" and get possible imbalance partition with minimized cut value.
I find a parameter ufactor in the manual which is suitable(I think) for my case but I did not grasp what it is really doing. I have large graph and tried with some value of ufactor. For one data set ufactor=1000 works very well but for another dataset it could not even partition the graph. I can not interpret this result as i did not understand what it's really doing. Here is what i find in the manual about this:
Specifies the maximum allowed load imbalance among the partitions. A value of x indicates that the
allowed load imbalance is (1 + x)/1000. The load imbalance for the jth constraint is defined to be
max_i(w[j, i])/t[j, i]), where w[j, i] is the fraction of the overall weight of the jth constraint that
is assigned to the ith partition and t[j, i] is the desired target weight of the jth constraint for the
ith partition (i.e., that specified via -tpwgts). For -ptype=rb, the default value is 1 (i.e., load
imbalance of 1.001) and for -ptype=kway, the default value is 30 (i.e., load imbalance of 1.03).
Can anybody help me to interpret this? Here, what is jth constraints? what is -ptype=rb/kway?
First of all, let me mention that I think METIS is the wrong tool here, because it is used for graph partitioning, where the emphasis is on minimizing the number of edges between partitions while keeping the partitions balanced (more or less equal sizes)
What you probably want to do is community detection within social networks, i.e. the search for clusters which maximize internal connectivity (large number of edges between nodes from the same cluster) and minimize external connectivity (small number of edges between different clusters).
This can be achieved by maximizing the so-called Modularity of the clustering
There are several approaches to tackle this problem, a popular heuristic being Label propagation.
If you don't want to implement the algorithm yourself, I would recommend using a framework like NetworKit (unfortunately, I don't know any other such frameworks yet), which implements Label propagation, some modularity-based algorithms and many helpful tools.
But back to your original question:
What is -ptype=rb/kway?
There are multiple ways how you can approach the graph partitioning problem: You can either try to partition the graph into your desired number of partitions directly (k-way partitioning) or you can split the graph in half repeatedly until you have the desired number of partitions (recursive bisection, rb)
What is the jth constraint?
METIS allows you to try and optimize multiple balance constraints at the same time, i.e. if you have multiple types of calculations on the graph that should all be more or less balanced among the compute nodes.
See the manual:
Many important types of multi-phase and multi-
physics computations require that multiple quantities be load balanced simultaneously.
[...]
METIS includes partitioning routines that can be used to partition a graph in the presence of such multiple balancing
constraints. Each vertex is now assigned a vector of m weights and the objective of the partitioning routines is
to minimize the edge-cut subject to the constraints that each one of the m weights is equally distributed among the
domains.
EDIT: Since you clarified that you wanted to look at a fixed number of clusters, I see how graph partitioning could be helpful here. Let me illustrate what ufactor means:
The imbalance of a partitioned graph is (in this simple case) computed as the maximum of the imbalance for each partition, which is roughly the quotient partition size / average partition size. So if we allow a maximum imbalance of 2, this means that the largest partition is twice as big as the average partition. Note however that ufactor doesn't specify the imbalance directly, it specifies how many permille away from 1 the imbalance is allowed to be.
So ufactor=143 actually means that your maximal allowed imbalance is 1.143, which makes sense since your clusters are not that far from each other. So in your case, you will probably use larger values for ufactor to allow the groups to be of quite different sizes.
Consequences of large imbalance
If your imbalance is too large, it might happen that all the strongly-connected parts land in the same partition while only isolated nodes are put in the other partitions. This is due to the fact that the algorithm tries to minimize the number of cut edges between different partitions, which will be lower if we put all the high-degree nodes in the same partition.
What about spectral partitioning, ...?
The general approach of METIS works as follows:
Most input graphs are too large to partition directly, which is why so-called multilevel methods are used:
The graph is first coarsened (nodes are combined while trying to preserve the graph structure) until its size becomes feasible to partition directly
The coarsest graph is partitioned using an initial partitioning technique, where we could use a variety of approaches (combinatorial bisection, spectral bisection, exact solutions using ILPs, ...).
The graph is then uncoarsened, where in each step a small number of nodes are moved from partition to partition in a local search to improve the overall edge cut.
My personal recommendation
I should however note that while graph partitioning might be a valid model for your case, METIS itself might not be the ideal implementation for you:
As you can read on the METIS homepage, it is mostly used for rather sparse graphs ('finite element methods, linear programming, VLSI, and transportation'), whereas social networks are much denser and have a different structure (degrees follow a power-law distribution)
The coarsening approach of METIS uses heavy edge matching to combine nodes which are somehow close together, which works great for the intended applications, for social networks however, clustering-based coarsening techniques might prove more efficient.
Another library that is a bit slower in general, but implements some presets especially for social networks is KaHIP, see the manual for details.
(I should mention however that I am biased in this regard, since I worked extensively with this library ;-) )

Partitioning densed data points using clustering

I have to cluster data which are power profiles of the solar panel output. I tried various algorithm including classical K-means to shape based clustering as well. I have to decide number of cluster possible in the pool of data. And I am always getting 2 cluster, so I think they are very dense.
Is there any way I can partition dense cluster?

Clustering Algorithm for average energy measurements

I have a data set which consists of data points having attributes like:
average daily consumption of energy
average daily generation of energy
type of energy source
average daily energy fed in to grid
daily energy tariff
I am new to clustering techniques.
So my question is which clustering algorithm will be best for such kind of data to form clusters ?
I think hierarchical clustering is a good choice. Have a look here Clustering Algorithms
The more simple way to do clustering is by kmeans algorithm. If all of your attributes are numerical, then this is the easiest way of doing the clustering. Even if they are not, you would have to find a distance measure for caterogical or nominal attributes, but still kmeans is a good choice. Kmeans is a partitional clustering algorithm... i wouldn't use hierarchical clustering for this case. But that also depends on what you want to do. you need to evaluate if you want to find clusters within clusters or they all have to be totally apart from each other and not included on each other.
Take care.
1) First, try with k-means. If that fulfills your demand that's it. Play with different number of clusters (controlled by parameter k). There are a number of implementations of k-means and you can implement your own version if you have good programming skills.
K-means generally works well if data looks like a circular/spherical shape. This means that there is some Gaussianity in the data (data comes from a Gaussian distribution).
2) if k-means doesn't fulfill your expectations, it is time to read and think more. Then I suggest reading a good survey paper. the most common techniques are implemented in several programming languages and data mining frameworks, many of them are free to download and use.
3) if applying state-of-the-art clustering techniques is not enough, it is time to design a new technique. Then you can think by yourself or associate with a machine learning expert.
Since most of your data is continuous, and it reasonable to assume that energy consumption and generation are normally distributed, I would use statistical methods for clustering.
Such as:
Gaussian Mixture Models
Bayesian Hierarchical Clustering
The advantage of these methods over metric-based clustering algorithms (e.g. k-means) is that we can take advantage of the fact that we are dealing with averages, and we can make assumptions on the distributions from which those average were calculated.

What is the relation between topic modeling and document clustering?

Topic modeling identifies distribution of topics in a document collection, which effectively identifies the clusters in the collection. So is it right to say that topic modeling is a technique to do document clustering?
A topic is quite different from a cluster of docs, after all, a topic is not composed of docs.
However, these two techniques are indeed related. I believe Topic Modeling is a viable way of deciding how similar documents are, hence a viable way for document clustering.
In representing each document as a topic distribution (actually a vector), topic modeling techniques reduce the feature dimensionality from number of distinct words appeared (in a corpus) to the number of topics. Similarity between docs' Topic distributions can be calculated using Cosine metrics and many other metrics, which reflect the similarity of the docs themselves in terms of the topics/themes they cover. Based on this quantified similarity measure, many clustering algorithms can be applied to group the documents.
And in this sense, I think it is right to say that topic modeling is a technique to do document clustering.
The relation between clustering and classification is very similar to the relation between topic modeling and multi-label classification.
In single-label multi-class classification we assign just one label per each document. And in clustering we put each document in just one group. The fact is that we can't define the clusters in advance as we define labels. If we ignore this fact, grouping and labeling are essentially the same thing.
However, in real world problems flat classification is not sufficient. Often documents are related to multiple categories/classes. Thus we leverage the multi-label classification. Now, we can see the topic modeling as the unsupervised version of multi-label classification as we can put each document under multiple groups/topics. Here again, I'm ignoring the fact that we can't decide what topics to use as labels in advance.