How to parallelize KMeans? - pyspark

I have a DataFrame with millions of rows. I have columns A, B and C. I have to cluster C over A and B.
For now what I'm doing is a for over every A and B set. Obviously it's slow, but I can't really parallelize it. I tried using Window and pandas_udf, but it doens't work with scalar functions.
Edit: some code to help understand my problem
# suppose we have a variable my_df which is a dataframe with millions of rows
# its schema is type, code, quantity
# I have to cluster quantity by types
# hundreds of types in real case
types = [1, 2, 3]
for index, t in enumerate(types):
filtered_my_df = my_df.filter(my_df.type == t)
# then I apply kmeans on filtered_my_df...
When I have hundres or even thousands of types, this code obviously is not scalable. But I can't figure out how to do this in scalable way.

In case of Kmeans, it randomly selects k objects from the whole objects which represent initial cluster centers. Then, each remaining object is assigned to the cluster to which it is the most similar, based on the distance between the object and the cluster center. Then new mean for each cluster is calculated. This process iterates until the criterion function converges. This is why, it becomes very time costly.
You can use Parallel K-Means Based on MapReduce.
The distance computations between one object with the centers is irrelevant to the distance computations between other objects with the corresponding centers. distance computations between different objects with centers can be parallel executed.
MLlib implementation of k-means corresponds to the algorithm called K-Means\5 which is a parallel version of the original one. The method header is defined as follows: Let's say parseddata is you spark RDD.
KMeans.train(k, maxIterations, initializationMode, runs)
from pyspark.mllib.clustering import KMeans
clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=10, initializationMode='random')
# Evaluation
from math import sqrt
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = (parsedData.map(lambda point:error(point)).reduce(lambda x, y: x+y))
print('Within Set Sum of Squared Error = ' + str(WSSSE))

Related

Pseudo randomization in MATLAB with minimum intervals between stimulus categories

For an experiment I need to pseudo randomize a vector of 100 trials of stimulus categories, 80% of which are category A, 10% B, and 10% C. The B trials have at least two non-B trials between each other, and the C trials must come after two A trials and have two A trials following them.
At first I tried building a script that randomized a vector and sort of "popped" out the trials that were not where they should be, and put them in a space in the vector where there was a long series of A trials. I'm worried though that this is overcomplicated and will create an endless series of unforeseen errors that will need to be debugged, as well as it not being random enough.
After that I tried building a script which simply shuffles the vector until it reaches the criteria, which seems to require less code. However now that I have spent several hours on it, I am wondering if these criteria aren't too strict for this to make sense, meaning that it would take forever for the vector to shuffle before it actually met the criteria.
What do you think is the simplest way to handle this problem? Additionally, which would be the best shuffle function to use, since Shuffle in psychtoolbox seems to not be working correctly?
The scope of this question moves much beyond language-specific constructs, and involves a good understanding of probability and permutation/combinations.
An approach to solving this question is:
Create blocks of vectors, such that each block is independent to be placed anywhere.
Randomly allocate these blocks to get a final random vector satisfying all constraints.
Part 0: Category A
Since category A has no constraints imposed on it, we will go to the next category.
Part 1: Make category C independent
The only constraint on category C is that it must have two A's before and after. Hence, we first create random groups of 5 vectors, of the pattern A A C A A.
At this point, we have an array of A vectors (excluding blocks), blocks of A A C A A vectors, and B vectors.
Part 2: Resolving placement of B
The constraint on B is that two consecutive Bs must have at-least 2 non-B vectors between them.
Visualize as follows: Let's pool A and A A C A A in one array, X. Let's place all Bs in a row (suppose there are 3 Bs):
s0 B s1 B s2 B s3
Where s is the number of vectors between each B. Hence, we require that s1, s2 be at least 2, and overall s0 + s1 + s2 + s3 equal to number of vectors in X.
The task is then to choose random vectors from X and assign them to each s. At the end, we finally have a random vector with all categories shuffled, satisfying the constraints.
P.S. This can be mapped to the classic problem of finding a set of random numbers that add up to a certain sum, with constraints.
It is easier to reduce the constrained sum problem to one with no constraints. This can be done as:
s0 B s1 t1 B s2 t2 B s3
Where t1 and t2 are chosen from X just enough to satisfy constraints on B, and s0 + s1 + s2 + s3 equal to number of vectors in X not in t.
Implementation
Implementing the same in MATLAB could benefit from using cell arrays, and this algorithm for the random numbers of constant sum.
You would also need to maintain separate pools for each category, and keep building blocks and piece them together.
Really, this is not trivial but also not impossible. This is the approach you could try, if you want to step aside from brute-force search like you have tried before.

Kmeans - group by

I want to do kmeans labels for numClusters = 6 so that I can group by the labels later.
How do I select the columns to do kmeans on?
val clusterThis = scaledDF.select($"id",$"setting1",$"setting2",$"setting3")
// dataset description lists six operation modes
val operatingModes = 6
// Cluster the data into two classes using KMeans
val numClusters = operatingModes
val numIterations = 20
import sqlContext.implicits._
val clusters = KMeans.train(clusterThis.rdd, numClusters, numIterations)
clusters.predict(clusterThis)
//... join back on id
As you can see in KMeans's Example the object uses just one column as features. In that example and by coincidence it has the same name. However, that name depends on you, but the important thing is that this column must be a Vector (dense or sparse).
Thus, you would need to combine your features (different columns) into one, for this task you can use a VectorAssembler.
By the way, K-means doesn't work with categorical features. You can read this post K-means clustering for mixed numeric and categorical data to notice the reasons.

Transforming a dataset from SQL to RDD[vector]

I know several questions has been asked on similar topics but I couldn't apply any of the answers to my problem, also I am wondering about best practices.
I have loaded a dateset for ML to a SQL database. I want to apply mllib's clustering function according to it. I have loaded the SQL database to DataFrame using sqlContext, dropped the irrelevant columns. then happened the problematic part, I create a vector by parsing each row of the DataFrame.
The Vector is then transformed to RDD using the toJavaRDD function.
Here is the code (works):
val usersDF = sqlContext.read.format("jdbc").option("url","jdbc:mysql://localhost/database").
option("driver","com.mysql.jdbc.Driver").option("dbtable","table").
option("user","woot").option("password","woot-password").load()
val cleanDF = usersDF.drop("id").drop("username")
cleanDF.show()
val parsedData = cleanDF.map(s => Vectors.dense(s.toString().replaceAll("[\\[\\]]", "").trim.split(',').map(_.toDouble))).cache()
val splits = parsedData.randomSplit(Array(0.6,0.4), seed = 11L)
val train_set = splits(0).cache()
val gmm = new GaussianMixture().setK(2).run(train_set)
My main question regards to what I read on spark documentation about: Local vector, in my understanding the DataFrame mapping will be performed on the workers and later will be sent to the Driver when creating the Vector(Is that the meaning of local vector) only to later be sent to the workers again? isn't there a better way to achieve this?
Another things is that it seems a little odd to load SQL to DataFrame only to turn it into string and parse it again. Are there any other best practices suggestions?
From the link you suggested
A local vector has integer-typed and 0-based indices and double-typed
values, stored on a single machine. MLlib supports two types of local
vectors: dense and sparse.
A distributed matrix has long-typed row and column indices and
double-typed values, stored distributively in one or more RDDs.
The local vector are behaving like any object you would use for your RDD (String, Integer, Array), they are created and stored on a single machine, the worker node, and only if you collect them they will be sent to the driver node.
If you consider a vector x of size 2n storing it distributively you would separate it in two halfs of length n, x1 and x2, (x = x1::x2). To perform the dot product with another vector y, the workers will perform r1=x1*y1 (on machine 1) and r2=x2*y2 (on machine 2) and then you will need to group the partial results giving r=r1+r2. Your vector x is distributed, the vectors x1 and x2 are again local vectors. If you have x as a local vector then in a single step you can perform on a worker node r=x*y.
For your second question, I do not see why you would store the vectors in SQL format. Having a CSV file like this would be sufficient:
label feature1 feature2 ...
1, 0.5, 1.2 ...
0, 0.2, 0.5 ...

Linking the Machine Learning Prediction back to the original data set

I am into a process of doing a POC on Retail Transaction Data using few Machine learning Algorithms and coming up with a prediction model for Out of stock analysis. My questions might sound stupid but I would really appreciate if you or anyone else can answer me.
So far I have been able to get a data set ==> Convert the features into a (labelpoint , Feature Vectors) ==> Train a ML model ==> Run the model on Test DataSet and ==> Get the predictions.
Problem 1:
Since I have no experience on any of the JAVA/Python/Scala languages, I am building my features in the database and saving that data as a CSV file for my machine learning Algorithm.
How do we create features using Scala from raw data.
Problem 2:
The Source Data set consists of many features for a set of (Store, Product , date) and their recorded OOS events (Target)
StoreID(Text column), ProductID(Text Column), TranDate , (Label/Target), Feature1, Feature2........................FeatureN
Since the Features can only contain numeric values so, I just create features out of the numeric columns and not the text ones (Which is the natural key for me). When I run the model on a validation set I get a (Prediction, Label) array back.
Now how do I link this resultant set back to the original data set and see which specific (Store, Product, Date) might have a possible Out Of Stock event ?
I hope the problem statement was clear enough.
MJ
Spark's Linear Regression Example
Here's a snippet from the Spark Docs Linear Regression example that is fairly instructive and easy to follow.
It solves both your "Problem 1" and "Problem 2"
It doesn't need a JOIN and doesn't even rely on RDD order.
// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
Here data is a RDD of text lines
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
Problem 1: Parsing the Features
This is data dependent. Here we see that lines are being split on , into fields. It appears this data was a CSV of entirely numeric data.
The first field is treated as the label of a labelled point (dependent variable), and the rest of the fields are converted from text to double (floating point) and stuck in a vector. This vector holds the features or independent variables.
In your own projects, the part of this you need to remember is the goal of parsing into an RDD of LabeledPoints where the 1st parameter of LabeledPoint, the label, is the true dependent numeric value and the features, or 2nd parameter, is a Vector of numbers.
Getting the data into this condition requires knowing how to code. Python may be easiest for data parsing. You can always use other tools to create a purely numeric CSV, with the dependent variable in the first column, and the numeric features in the other columns, and no header line -- and then duplicate the example parsing function.
// Building the model
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData, numIterations)
At this point we have a trained model object. The model object has a predict method that operates on feature vectors and returns estimates of the dependent variable.
Encoding Text features
The ML routines typically want numeric feature vectors, but you can often translate free text or categorical features (color, size, brand name) into numeric vectors in some space. There are a variety of ways to do this, such as Bag-Of-Words for text, or One Hot Encoding for categorical data where you code a 1.0 or 0.0 for membership in each possible category (watch out for multicollinearity though). These methodologies can create large feature vectors, which is why there are iterative methods available in Spark for training models. Spark also has a SparseVector() class, where you can easily create vectors with all but certain feature dimensions set to 0.0
Problem 2: Comparing model Predictions to the True values
Next they test this model with the training data, but the calls
would be the same with external test data provided that the test data is a RDD of LabeledPoint( dependent value, Vector(features)). The input could be changed by changing the variable parsedData to some other RDD.
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
Notice that this returns tuples of the true dependent variable previously stored in point.label, and the model's prediction from the point.features for each row or LabeledPoint.
Now we are ready to do Mean Squared Error, since the valuesAndPreds RDD contains tuples (v,p) of true value v and the prediction p both of type Double.
The MSE is a single number, first the tuples are mapped to an rdd of squared distances ||v-p||**2 individually, and then averaged, yielding a single number.
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
Spark's Logistic Example
This is similar, but here you can see data is already parsed and split into training and test sets.
// Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
Here the model is trained against the training set.
// Run training algorithm to build the model
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(10)
.run(training)
And tested (compared) against the test set. Notice that even though this is a different model (Logistic instead of Linear) there is still a model.predict method that takes a point's features vector as a parameter and returns the prediction for that point.
Once again the prediction is paired with the true value, from the label, in a tuple for comparison in a performance metric.
// Compute raw scores on the test set.
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
val prediction = model.predict(features)
(prediction, label)
}
// Get evaluation metrics.
val metrics = new MulticlassMetrics(predictionAndLabels)
val precision = metrics.precision
println("Precision = " + precision)
What about JOIN? So RDD.join comes in if you have two RDDs of (key, value) pairs, and need an RDD corresponding to the intersection of keys with both values. But we didn't need that here.

Apache Spark flatMap time complexity

I've been trying to find a way to count the number of times sets of Strings occur in a transaction database (implementing the Apriori algorithm in a distributed fashion). The code I have currently is as follows:
val cand_br = sc.broadcast(cand)
transactions.flatMap(trans => freq(trans, cand_br.value))
.reduceByKey(_ + _)
}
def freq(trans: Set[String], cand: Array[Set[String]]) : Array[(Set[String],Int)] = {
var res = ArrayBuffer[(Set[String],Int)]()
for (c <- cand) {
if (c.subsetOf(trans)) {
res += ((c,1))
}
}
return res.toArray
}
transactions starts out as an RDD[Set[String]], and I'm trying to convert it to an RDD[(K, V), with K every element in cand and V the number of occurrences of each element in cand in the transaction list.
When watching performance on the UI, the flatMap stage quickly takes about 3min to finish, whereas the rest takes < 1ms.
transactions.count() ~= 88000 and cand.length ~= 24000 for an idea of the data I'm dealing with. I've tried different ways of persisting the data, but I'm pretty positive that it's an algorithmic problem I am faced with.
Is there a more optimal solution to solve this subproblem?
PS: I'm fairly new to Scala / Spark framework, so there might be some strange constructions in this code
Probably, the right question to ask in this case would be: "what is the time complexity of this algorithm". I think it is very much unrelated to Spark's flatMap operation.
Rough O-complexity analysis
Given 2 collections of Sets of size m and n, this algorithm is counting how many elements of one collection are a subset of elements of the other collection, so it looks like complexity m x n. Looking one level deeper, we also see that 'subsetOf' is linear of the number of elements of the subset. x subSet y == x forAll y, so actually the complexity is m x n x s where s is the cardinality of the subsets being checked.
In other words, this flatMap operation has a lot of work to do.
Going Parallel
Now, going back to Spark, we can also observe that this algo is embarrassingly parallel and we can take advantage of Spark's capabilities to our advantage.
To compare some approaches, I loaded the 'retail' dataset [1] and ran the algo on val cand = transactions.filter(_.size<4).collect. Data size is a close neighbor of the question:
Transactions.count = 88162
cand.size = 15451
Some comparative runs on local mode:
Vainilla: 1.2 minutes
Increase transactions partitions up to # of cores (8): 33 secs
I also tried an alternative implementation, using cartesian instead of flatmap:
transactions
.cartesian(candRDD)
.map{case (tx, cd) => (cd, if (cd.subsetOf(tx)) 1 else 0)}
.reduceByKey(_ + _)
.collect
But that resulted in much longer runs as seen in the top 2 lines of the Spark UI (cartesian and cartesian with a higher number of partitions): 2.5 min
Given I only have 8 logical cores available, going above that does not help.
Sanity checks:
Is there any added 'Spark flatMap time complexity'? Probably some, as it involves serializing closures and unpacking collections, but negligible in comparison with the function being executed.
Let's see if we can do a better job: I implemented the same algo using plain scala:
val resLocal = reduceByKey(transLocal.flatMap(trans => freq(trans, cand)))
Where the reduceByKey operation is a naive implementation taken from [2]
Execution time: 3.67 seconds.
Sparks gives you parallelism out of the box. This impl is totally sequential and therefore takes longer to complete.
Last sanity check: A trivial flatmap operation:
transactions
.flatMap(trans => Seq((trans, 1)))
.reduceByKey( _ + _)
.collect
Execution time: 0.88 secs
Conclusions:
Spark is buying you parallelism and clustering and this algo can take advantage of it. Use more cores and partition the input data accordingly.
There's nothing wrong with flatmap. The time complexity prize goes to the function inside it.