How to modify vertex data when calling the mapTriplets method in Graphx of Spark - scala

The mapTriplets operation in Graphx of Spark can transform the triplets into some other form as the definition describes:
def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
My data is a sparse bipartite graph, and the vertices data of an edge will be updated during each iteration. For example, here is an edge (srcAttr, dstAttr, attr), the vertex of srcAttr and dstAttr will be modified according to attr. Therefore, what I need is to get all (srcAttr, dstAttr, attr) combinations, and use attr to update the vertices.
Graphx provides the mapTriplets method which can transform all (srcAttr, dstAttr, attr) combinations, but I cannot figure out how to modify vertex when executing this method.
So, is there any strategy that can modify the vertices when traversing all edges?

I cannot figure out how to modify vertex when executing this method
Because it is simply not possible. First of all GraphX data structures, same as other distributed data structures in Spark, are immutable. Moreover mapTriplets is designed to transforms edges not vertices.
is there any strategy that can modify the vertices when traversing all edges?
If you want to transform vertices using edge data then aggregateMessages should give you what you want. It takes two functions
one from EdgeContext to Unit, which can be used to send messages to the source and/or destination nodes
second one which reduces messages for each vertex
and returns a VertexRDD which can be further used to construct a new graph.

Related

Networkx create graph from adjacency matrix without edge weights

When I call G = nx.convert_matrix.from_numpy_array(A, create_using=nx.DiGraph), where A is a 0-1 adjacency matrix, the resulting graph automatically contains edge weights of 1.0 for each edge. How can I prevent this attribute from being added?
I realize I can write
for _,_,d in G.edges(data=True):
d.clear()
but I would prefer if the attributes were not added in the first place.
There is no way to do that with native networkx functions. This is how you can do it:
G = nx.empty_graph(0, nx.DiGraph)
G.add_nodes_from(range(A.shape[0]))
G.add_edges_from(((int(e[0]), int(e[1])) for e in zip(*A.nonzero())))
This is exactly how the nx.convert_matrix.from_numpy_array function is implemented internally. I got however rid of all controls, so be careful with this. Additional details can be found here

Arangodb- Differences between filter and expandFilter in Foxx traversal

In traversal config options, there is two settings that seemed to do the same thing that is filter and expandFilter. is there any differences between them?
While filter is used to limit returned result of vertices using a traversal, expandFilter can exclude certain edges from the traversal.
filter: vertex filter function. The function signature is function (config, vertex, path). It may return one of the following values:
undefined: vertex will be included in the result and connected edges will be traversed
"exclude": vertex will not be included in the result and connected edges will be traversed
"prune": vertex will be included in the result but connected edges will not be traversed
[ "prune", "exclude" ]: vertex will not be included in the result and connected edges will not be returned
expandFilter: filter function applied on each edge/vertex combination determined by the expander. The function signature is function (config, vertex, edge, path). The function should return true if the edge/vertex combination should be processed, and false if it should be ignored.
This is documented in the ArangoDB Manual.

Fitting Spark ML Kmeans for subsets or groups of data

I've got a Dataset where each Row is a (class: String, vectors: Array[Array[Float]]), and I'd like to fit a kmeans model in Spark MLLib per class. I can explode the vectors to normalize the data, loop through the classes, filter the entire dataset by class, and fit a model per iteration of the loop, but that's horribly inefficient (although it's how Spark does it in the fit method of the OneVsRest classifier here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala).
Here's a snippet that accomplishes this with a ParArray, inspired by OneVsRest's approach:
val classes = normalized_data.select("class").distinct.map(_.getString(0)).collect
val kmeans = new KMeans().setK(5)
val models = classes.par.map { class =>
val training_data = unpacked_data.filter($"label" === class)
val model = kmeans.fit(training_data)
(class, model)
}
It seems that the KMeans fit method needs the data to be a Dataset with one-row-per-vector, which suggests normalizing / exploding the data, but what's the best way to go about this? Can't I somehow leverage the fact that I start with all of my data points in each row and/or group on the label to use just these points without explicitly filtering the entire dataset for every class I want to build a model for?
PS- I know KMeans.fit actually needs org.apache.spark.ml.linalg.Vector; presume I've transformed my Array[Float] accordingly.

Cannot get clustering output Mahout

I am running kmeans in Mahout and as an output I get folders clusters-x, clusters-x-final and clusteredPoints.
If I understood well, clusters-x are centroid locations in each of iterations, clusters-x-final are final centroid locations, and clusteredPoints should be the points being clustered with cluster id and weight which represents probability of belonging to cluster (depending on the distance between point and its centroid). On the other hand, clusters-x and clusters-x-final contain clusters centroids, number of elements, features values of centroid and the radius of the cluster (distance between centroid and its farthest point.
How do I examine this outputs?
I used cluster dumper successfully for clusters-x and clusters-x-final from terminal, but when I used it clusteredPoints, I got an empty file? What seems to be the problem?
And how can I get this values from code? I mean, the centroid values and points belonging to clusters?
FOr clusteredPoint I used IntWritable as key, and WeightedPropertyVectorWritable for value, in a while loop, but it passes the loop like there are no elements in clusteredPoints?
This is even more strange because the file that I get with clusterDumper is empty?
What could be the problem?
Any help would be greatly appreciated!
I believe your interpretation of the data is correct (I've only been working with Mahout for ~3 weeks, so someone more seasoned should probably weigh in on this).
As far as linking points back to the input that created them I've used NamedVector, where the name is the key for the vector. When you read one of the generated points files (clusteredPoints) you can convert each row (point vector) back into a NamedVector and retrieve the name using .getName().
Update in response to comment
When you initially read your data into Mahout, you convert it into a collection of vectors with which you then write to a file (points) for use in the clustering algorithms later. Mahout gives you several Vector types which you can use, but they also give you access to a Vector wrapper class called NamedVector which will allow you to identify each vector.
For example, you could create each NamedVector as follows:
NamedVector nVec = new NamedVector(
new SequentialAccessSparseVector(vectorDimensions),
vectorName
);
Then you write your collection of NamedVectors to file with something like:
SequenceFile.Writer writer = new SequenceFile.Writer(...);
VectorWritable writable = new VectorWritable();
// the next two lines will be in a loop, but I'm omitting it for clarity
writable.set(nVec);
writer.append(new Text(nVec.getName()), nVec);
You can now use this file as input to one of the clustering algorithms.
After having run one of the clustering algorithms with your points file, it will have generated yet another points file, but it will be in a directory named clusteredPoints.
You can then read in this points file and extract the name you associated to each vector. It'll look something like this:
IntWritable clusterId = new IntWritable();
WeightedPropertyVectorWritable vector = new WeightedPropertyVectorWritable();
while (reader.next(clusterId, vector))
{
NamedVector nVec = (NamedVector)vector.getVector();
// you now have access to the original name using nVec.getName()
}
check the parameter named "clusterClassificationThreshold".
clusterClassificationThreshold should be 0.
You can check this http://mail-archives.apache.org/mod_mbox/mahout-user/201211.mbox/%3C50B62629.5020700#windwardsolutions.com%3E

Delaunay Triangulation - Removing Triangles

I made a Delaunay Triangulation using Matlab version 2013. I want to remove some of the triangles, meaning canceling their connectivity, for example triangle number 760. How can I make this change? When I tried to edit the connectivity list:
dt.ConnectivityList(760 , :) = [];
I got the message:
Cannot assign values to the triangulation.
I thought about maybe copying specific fields to a different structure, but:
a. I'm not familiar with structures so I don't know how to do it right.
b. After I copy the structure, how can I get my triangles?
dt contains 3 fields: Points, ConnectivityList and Constraints (empty field).
A brief note on MATLAB objects. When you access a field for reading, you are basically doing get(obj, fieldname);. When you try to set a field as you are doing, you are actually calling set(obj, fieldname, new_value). Objects do not necessarily allow you to do these operations.
The triangulation object is read-only, so you will have to make copies of all the fields. If, as you mentioned, you would like to make a structure with similar fields, you can do as follows:
dts = struct('Points', dt.Points, 'ConnectivityList', dt.ConnectivityList);
Now you can edit the fields.
dts.ConnectivityList(760) = [];
You may be able to plot the new structure, but the methods of the delaunayTriangulation class will not be available to you.
To plot the result, use trisurf:
trisurf(dts.ConnectivityList, dts.Points);
I was facing same problem. I found another solution. Instead of creating a new struct just create an object of its super class i.e. triangulation class with edited connectivity list.
Here is my code
P- list of points
C- Constraints (optional)
dt=delaunayTriangulation(P,C); %created triangulation but dt won't let you change connectivity list
list=dt.ConnectivityList;
%your changes here
x=triangulation(list,dt.Points);
Now you can use x as triangulation object
triplot(x)