How to access the data points within a cluster? - scala

So I am trying to create a Google Maps like software that would evaluate the trip time from point a to point b based on the taxi data.
val taxiFile = sc.textFile("/resources/LabData/nyctaxisub.csv")
val header = taxiFile.first()
val taxiNoHead = taxiFile.filter(x=> (x !=header))
//3, 4, 9, 10 are the dropoff and pickup longitude and latitude
val taxiData=taxiNoHead.
filter(_.split(",")(3)!="" ).
filter(_.split(",")(4)!="")
val taxiFence=taxiData.
filter(_.split(",")(3).toDouble>40.70).
filter(_.split(",")(3).toDouble<40.77).
filter(_.split(",")(4).toDouble>(-74.02)).
filter(_.split(",")(4).toDouble<(-73.7)).
filter(_.split(",")(9).toDouble>40.70).
filter(_.split(",")(9).toDouble<40.77).
filter(_.split(",")(10).toDouble>(-74.02)).
filter(_.split(",")(10).toDouble<(-73.7))
val taxi=taxiFence.
map{
line=>Vectors.dense(
line.split(',').slice(3,5).map(_ .toDouble)
)
}.cache()
val iterationCount=10
val clusterCount=40
val model=KMeans.train(taxi,clusterCount,iterationCount)
val clusterCenters=model.clusterCenters.map(_.toArray)
clusterCenters.foreach(lines=>println(lines(0),lines(1)))
(Here is a link to map with all of my cluster centers http://www.darrinward.com/lat-long/?id=1922446)
Now my problem is that I am not sure how to calculate the average time from one cluster to another. This could be done easily if there was a way to access the data points assigned to each cluster and then take an average of the trip_time_in_secs for trips that go from the cluster to the other cluster nearest to the dropoff point. In addition if there were clusters between points a and b, for instance point c. It would also calculate the avg. time from point a to c and from c to b and then sum these together. Similarly to Dijkstra's algorithm.
Is this possible to do? And if yes, how?

Related

What kind of variable select for incrementing node labels in a community detection algorithm

i am working on a community detection algorithm that uses the concept of propagating label to nodes. i have problem in selecting the true type for the Label_counter variable.
we have an algorithm with name LPA(label propagation algorithm) which propagates labels to nodes through iterations. think labels as node property. the initial label for each node is the node id, and in iterations nodes update their new label based on the most frequent label among its neighbors. the algorithm i am working on is something like LPA. at first every node has initial label equal to 0 and then nodes get new labels. as nodes update and get new labels, based on some conditions the Label_counter should be incremented by one to use this value as label for other nodes . for example label=1 or label = 2 and so on. for example we have zachary karate club dataset that it has 34 nodes and the dataset has 2 communities.
the initial state is like this:
(1,0)
(2,0)
.
.
.
(34,0)
first number is node Id and second one is label.
as nodes get new label, the Label_counter increments and other nodes in next iterations get new label and again Label_counter increments.
(1,1)
(2,1)
(3,1)
.
.
.
(33,3)
(34,3)
nodes with same label, belong to same community.
the problem that i have is:
because nodes in RDD and variables are distributed across the machines(each machine has a copy of variables) and executors dont have connection with each other, if an executor updates the Label_counter, other executors wont be informed of new value of Label_counter and maybe nodes will get wrong labels, IS it true to use Accumulator as label counter in this case, because Accumulators are shared variables across machines, or there is other ways for handling this problem???
In spark it is always complicated to compute index like values because they depend on things that are not in all the partitions. I can propose the following idea.
Compute the number of time the condition is met per partition
Compute the cumulated increment per partition so that we know the initial increment of each partition.
Increment the values of the partition based on that initial increment
Here is what the code could look like this. Let me start by setting up a few things.
// Let's define some condition
def condition(node : Long) = node % 10 == 1
// step 0, generate the data
val rdd = spark.range(34)
.select('id+1).repartition(10).rdd
.map(r => (r.getAs[Long](0), 0))
.sortBy(_._1).cache()
rdd.collect
Array[(Long, Int)] = Array((1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0),
(9,0), (10,0), (11,0), (12,0), (13,0), (14,0), (15,0), (16,0), (17,0), (18,0),
(19,0), (20,0), (21,0), (22,0), (23,0), (24,0), (25,0), (26,0), (27,0), (28,0),
(29,0), (30,0), (31,0), (32,0), (33,0), (34,0))
Then the core of the solution:
// step 1 and 2
val partIncrInit = rdd
// to each partition, we associate the number of times we need to increment
.mapPartitionsWithIndex{ case (i,p) =>
Iterator(i -> p.map(_._1).count(condition))
}
.collect.sorted // sort by partition index
.map(_._2) // we don't need the index anymore
.scanLeft(0)(_+_) // cumulated sum
// step 3, we increment each partition based on this initial increment.
val result = rdd
.mapPartitionsWithIndex{ case (i, p) =>
var incr = 0
p.map{ case (node, value) =>
if(condition(node))
incr+=1
(node, partIncrInit(i) + value + incr)
}
}
result.collect
Array[(Long, Int)] = Array((1,1), (2,1), (3,1), (4,1), (5,1), (6,1), (7,1), (8,1),
(9,1), (10,1), (11,2), (12,2), (13,2), (14,2), (15,2), (16,2), (17,2), (18,2),
(19,2), (20,2), (21,3), (22,3), (23,3), (24,3), (25,3), (26,3), (27,3), (28,3),
(29,3), (30,3), (31,4), (32,4), (33,4), (34,4))

How To Use kmedoids from pyclustering with set number of clusters

I am trying to use k-medoids to cluster some trajectory data I am working with (multiple points along the trajectory of an aircraft). I want to cluster these into a set number of clusters (as I know how many types of paths there should be).
I have found that k-medoids is implemented inside the pyclustering package, and am trying to use that. I am technically able to get it to cluster, but I do not know how to control the number of clusters. I originally thought it was directly tied to the number of elements inside what I called initial_medoids, but experimentation shows that it is more complicated than this. My relevant code snippet is below.
Note that D holds a list of lists. Each list corresponds to a single trajectory.
def hausdorff( u, v):
d = max(directed_hausdorff(u, v)[0], directed_hausdorff(v, u)[0])
return d
traj_count = len(traj_lst)
D = np.zeros((traj_count, traj_count))
for i in range(traj_count):
for j in range(i + 1, traj_count):
distance = hausdorff(traj_lst[i], traj_lst[j])
D[i, j] = distance
D[j, i] = distance
from pyclustering.cluster.kmedoids import kmedoids
initial_medoids = [104, 345, 123, 1]
kmedoids_instance = kmedoids(traj_lst, initial_medoids)
kmedoids_instance.process()
cluster_lst = kmedoids_instance.get_clusters()[0]
num_clusters = len(np.unique(cluster_lst))
print('There were %i clusters found' %num_clusters)
I have a total of 1900 trajectories, and the above-code finds 1424 clusters. I had expected that I could control the number of clusters through the length of initial_medoids, as I did not see any option to input the number of clusters into the program, but this seems unrelated. Could anyone guide me as to the mistake I am making? How do I choose the number of clusters?
In case of requirement to obtain clusters you need to call get_clusters():
cluster_lst = kmedoids_instance.get_clusters()
Not get_clusters()[0] (in this case it is a list of object indexes in the first cluster):
cluster_lst = kmedoids_instance.get_clusters()[0]
And that is correct, you can control amount of clusters by initial_medoids.
It is true you can control the number of cluster, which correspond to the length of initial_medoids.
The documentation is not clear about this. The get__clusters function "Returns list of medoids of allocated clusters represented by indexes from the input data". so, this function does not return the cluster labels. It returns the index of rows in your original (input) data.
Please check the shape of cluster_lst in your example, using .get_clusters() and not .get_clusters()[0] as annoviko suggested. In your case, this shape should be (4,). So, you have a list of four elements (clusters), each containing the index or rows in your original data.
To get, for example, data from the first cluster, use:
kmedoids_instance = kmedoids(traj_lst, initial_medoids)
kmedoids_instance.process()
cluster_lst = kmedoids_instance.get_clusters()
traj_lst_first_cluster = traj_lst[cluster_lst[0]]

How do I view the datapoints that are added to a cluster after applying K-Means algorithm?

I have implemented k-means algorithm in scala as follows.
def clustering(clustnum:Int,iternum:Int,parsedData: RDD[org.apache.spark.mllib.linalg.Vector]): Unit= {
val clusters = KMeans.train(parsedData, clustnum, iternum)
println("The Cluster centers of each column for "+clustnum+" clusters and "+iternum+" iterations are:- ")
clusters.clusterCenters.foreach(println)
val predictions= clusters.predict(parsedData)
predictions.collect()
}
I know how I can print cluster centers of each cluster but is there a function in scala which prints which rows have been added to which cluster?
The data I am working with contains rows of float values with each row having an ID. It has around 34 columns and around 200 rows. I am working on spark in scala.
I need to be able to see the result.
As in Id_1 is in cluster 1 or so and so.
Edit : I was able to do this much
println(clustnum+" clusters and "+iternum+" iterations ")
val vectorsAndClusterIdx = parsedData.map{ point =>
val prediction = clusters.predict(point)
(point.toString, prediction)
}
vectorsAndClusterIdx.collect().foreach(println)
It prints the cluster ID and the row that is added to the cluster
The row is shown as a string and the cluster ID is the printed after
([1.0,1998.0,1.0,1.0,1.0,1.0,14305.0,39567.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],4)
([2.0,1998.0,1.0,1.0,2.0,1.0,185.0,2514.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],0)
([3.0,1998.0,1.0,1.0,2.0,2.0,27.0,272.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],0)
But is there some way to just print the row ID and cluster ID only?
Would using dataframes help me here?
You can use the predict() function of KMeansModel.
Have a look at the documentation: http://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.mllib.clustering.KMeansModel
In your code:
KMeans.train(parsedData, clustnum, iternum)
returns a KMeansModel object.
So, you can do this:
val predictions = clusters.predict(parsedData)
and get a MapPartitionsRDD as result.
predictions.collect()
gives you an Array with the cluster index assignments.
println(clustnum+" clusters and "+iternum+" iterations ")
val vectorsAndClusterIdx = parsedData.map{ point =>
val prediction = clusters.predict(point)
(point.toString, prediction)
}
vectorsAndClusterIdx.collect().foreach(println)
Seems to solve my problem. It prints the cluster ID and the row that is added to the cluster
The row is shown as a string and the cluster ID is the printed after
([1.0,1998.0,1.0,1.0,1.0,1.0,14305.0,39567.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],4)
([2.0,1998.0,1.0,1.0,2.0,1.0,185.0,2514.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],0)
([3.0,1998.0,1.0,1.0,2.0,2.0,27.0,272.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],0)

How to use RowMatrix.columnSimilarities (similarity search)

TL;DR; I am trying to train off of an existing data set (Seq[Words] with corresponding categories), and use that trained dataset to filter another dataset using category similarity.
I am trying to train a corpus of data and then use it for text analysis*. I've tried using NaiveBayes, but that seems to only work with the data you have, so it's predict algorithm will always return something, even if it doesn't match anything.
So, I am now trying to use TFIDF and passing that output into a RowMatrix and computing the similarities. But, I'm not sure how to run my query (one word for now). Here's what I've tried:
val rddOfTfidfFromCorpus : RDD[Vector]
val query = "word"
val tf = new HashingTF().transform(List(query))
val tfIDF = new IDF().fit(sc.makeRDD(List(tf))).transform(tf)
val mergedVectors = rddOfTfidfFromCorpus.union(sc.makeRDD(List(tfIDF)))
val similarities = new RowMatrix(mergedVectors).columnSimilarities(1.0)
Here is where I'm stuck (if I've even done everything right until here). I tried filtering the similarities i and j down to the parts of my query's TFIDF and end up with an empty collection.
The gist is that I want to train on a corpus of data and find what category it falls in. The above code is at least trying to get it down to one category and checking if I can get a prediction from that at least....
*Note that this is a toy example, so I only need something that works well enough
*I am using Spark 1.4.0
Using columnSimilarities doesn't make sense here. Since each column in your matrix represents a set of terms you'll get a matrix of similarities between tokens not documents. You could transpose the matrix and then use columnSimilarities but as far as I understand what you want is a similarity between query and corpus. You can express that using matrix multiplication as follows:
For starters you'll need an IDFModel you've trained on a corpus. Lets assume it is called idf:
import org.apache.spark.mllib.feature.IDFModel
val idf: IDFModel = ??? // Trained using corpus data
and a small helper:
def toBlockMatrix(rdd: RDD[Vector]) = new IndexedRowMatrix(
rdd.zipWithIndex.map{case (v, i) => IndexedRow(i, v)}
).toCoordinateMatrix.toBlockMatrix
First lets convert query to an RDD and compute TF:
val query: Seq[String] = ???
val queryTf = new HashingTF().transform(query)
Next we can apply IDF model and convert result to matrix:
val queryTfidf = idf.transform(queryTf)
val queryMatrix = toBlockMatrix(queryTfidf)
We'll need a corpus matrix as well:
val corpusMatrix = toBlockMatrix(rddOfTfidfFromCorpus)
If you multiple both we get a matrix with number of rows equal to the number of docs in the query and number of columns equal to the number of documents in the corpus.
val dotProducts = queryMatrix.multiply(corpusMatrix.transpose)
To get a proper cosine similarity you have to divide by a product of magnitudes but if you can handle that.
There are two problems here. First of all it is rather expensive. Moreover I am not sure if it really useful. To reduce cost you can apply some dimensionality reduction algorithm first but lets leave it for now.
Judging from a following statement
NaiveBayes (...) seems to only work with the data you have, so it's predict algorithm will always return something, even if it doesn't match anything.
I guess you want some kind of unsupervised learning method. The simplest thing you can try is K-means:
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
val numClusters: Int = ???
val numIterations = 20
val model = KMeans.train(rddOfTfidfFromCorpus, numClusters, numIterations)
val predictions = model.predict(queryTfidf)

Finding maximum edge weight in Spark GraphX

Let`s say I have a graph with double values for edge attributes and I
want to find the maximum edge weight of my graph. If I do this:
val max = sc.accumulator(0.0) //max holds the maximum edge weight
g.edges.distinct.collect.foreach{ e => if (e.attr > max.value) max.value
= e.attr }
I want to ask how much work is done on the master and how much on the
executors, because I know that collect() method brings the entire RDD to
the master? Does a parallelism happen? Is there a better way to find the
maximum edge weight?
NOTE:
g.edges.distinct.foreach{ e => if (e.attr > max.value) max.value =
e.attr } // does not work without the collect() method.
//I use an accumulator because I want to use the max edge weight later
And if I want to apply some averaging function to the attributes of edges that have same srcId and dstId between two graphs, what is the best way to do it?
You can either aggregate:
graph.edges.aggregate(Double.NegativeInfinity)(
(m, e) => e.attr.max(m),
(m1, m2) => m1.max(m2)
)
or map and take max:
graph.edges.map(_.attr).max
Regarding your attempts:
If you collect all data is processed sequentially on a driver so there is no reason to use an accumulator.
it doesn't work because accumulators are write-only from a worker perspective.