I am into a process of doing a POC on Retail Transaction Data using few Machine learning Algorithms and coming up with a prediction model for Out of stock analysis. My questions might sound stupid but I would really appreciate if you or anyone else can answer me.
So far I have been able to get a data set ==> Convert the features into a (labelpoint , Feature Vectors) ==> Train a ML model ==> Run the model on Test DataSet and ==> Get the predictions.
Problem 1:
Since I have no experience on any of the JAVA/Python/Scala languages, I am building my features in the database and saving that data as a CSV file for my machine learning Algorithm.
How do we create features using Scala from raw data.
Problem 2:
The Source Data set consists of many features for a set of (Store, Product , date) and their recorded OOS events (Target)
StoreID(Text column), ProductID(Text Column), TranDate , (Label/Target), Feature1, Feature2........................FeatureN
Since the Features can only contain numeric values so, I just create features out of the numeric columns and not the text ones (Which is the natural key for me). When I run the model on a validation set I get a (Prediction, Label) array back.
Now how do I link this resultant set back to the original data set and see which specific (Store, Product, Date) might have a possible Out Of Stock event ?
I hope the problem statement was clear enough.
MJ
Spark's Linear Regression Example
Here's a snippet from the Spark Docs Linear Regression example that is fairly instructive and easy to follow.
It solves both your "Problem 1" and "Problem 2"
It doesn't need a JOIN and doesn't even rely on RDD order.
// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
Here data is a RDD of text lines
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
Problem 1: Parsing the Features
This is data dependent. Here we see that lines are being split on , into fields. It appears this data was a CSV of entirely numeric data.
The first field is treated as the label of a labelled point (dependent variable), and the rest of the fields are converted from text to double (floating point) and stuck in a vector. This vector holds the features or independent variables.
In your own projects, the part of this you need to remember is the goal of parsing into an RDD of LabeledPoints where the 1st parameter of LabeledPoint, the label, is the true dependent numeric value and the features, or 2nd parameter, is a Vector of numbers.
Getting the data into this condition requires knowing how to code. Python may be easiest for data parsing. You can always use other tools to create a purely numeric CSV, with the dependent variable in the first column, and the numeric features in the other columns, and no header line -- and then duplicate the example parsing function.
// Building the model
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData, numIterations)
At this point we have a trained model object. The model object has a predict method that operates on feature vectors and returns estimates of the dependent variable.
Encoding Text features
The ML routines typically want numeric feature vectors, but you can often translate free text or categorical features (color, size, brand name) into numeric vectors in some space. There are a variety of ways to do this, such as Bag-Of-Words for text, or One Hot Encoding for categorical data where you code a 1.0 or 0.0 for membership in each possible category (watch out for multicollinearity though). These methodologies can create large feature vectors, which is why there are iterative methods available in Spark for training models. Spark also has a SparseVector() class, where you can easily create vectors with all but certain feature dimensions set to 0.0
Problem 2: Comparing model Predictions to the True values
Next they test this model with the training data, but the calls
would be the same with external test data provided that the test data is a RDD of LabeledPoint( dependent value, Vector(features)). The input could be changed by changing the variable parsedData to some other RDD.
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
Notice that this returns tuples of the true dependent variable previously stored in point.label, and the model's prediction from the point.features for each row or LabeledPoint.
Now we are ready to do Mean Squared Error, since the valuesAndPreds RDD contains tuples (v,p) of true value v and the prediction p both of type Double.
The MSE is a single number, first the tuples are mapped to an rdd of squared distances ||v-p||**2 individually, and then averaged, yielding a single number.
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
Spark's Logistic Example
This is similar, but here you can see data is already parsed and split into training and test sets.
// Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
Here the model is trained against the training set.
// Run training algorithm to build the model
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(10)
.run(training)
And tested (compared) against the test set. Notice that even though this is a different model (Logistic instead of Linear) there is still a model.predict method that takes a point's features vector as a parameter and returns the prediction for that point.
Once again the prediction is paired with the true value, from the label, in a tuple for comparison in a performance metric.
// Compute raw scores on the test set.
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
val prediction = model.predict(features)
(prediction, label)
}
// Get evaluation metrics.
val metrics = new MulticlassMetrics(predictionAndLabels)
val precision = metrics.precision
println("Precision = " + precision)
What about JOIN? So RDD.join comes in if you have two RDDs of (key, value) pairs, and need an RDD corresponding to the intersection of keys with both values. But we didn't need that here.
Related
I have a multiclass classification problem I'm looking to sort with logistic regression. I know this can also be tackled by decision trees and random forest, but wish to stick specifically with "LogisticRegressionWithLBFGS".
I have all the data tidying done. I have my data nice and tidy in a dataframe with a:
label field (String), a feature vector (vector of features/ numbers) and a third column "LabelIndex" (numbers representing the class).
When I do a train test split on the data frame and try to fit it to: LogisticRegressionWithLBFGS
val model = new LogisticRegressionWithLBFGS().setNumClasses(10).setIntercept(true).setValidateData(true).run("trainingData")
It doesn't like the "run" part.
The example I am working off, loads a data file in via:
val data = MLUtils.loadLibSVMFile(Spark.sparkContext, "data/mnist.bz2")
(i'm trying to copy the example, and slot in my own data. But its in a different format, looks different etc)
I was doing a bit of reading, and I'd come across, I need to convert my dataframe to a RDD[LabeledPoint].
I need to map it.
I'm having problems finding good info on how to do this.
How do I simply convert a Dataframe with 3 fields as described above, "Label" (String), "Features" (feature vector), "IndexedLabel" (Double)
into a RDD[LabeledPoint]?
Got it working:
Can't convert Dataframe to Labeled Point
This link showed me how to make the conversion successfully.
We are using ALS for a recommender model based on user/click data via Spark/Scala.
The rating column is a score [0,1]
val als = new ALS()
.setImplicitPrefs(true)
.setRank(myrank)
.setRegParam(mylambda)
.setAlpha(myalpha)
.setMaxIter(numIter)
.setUserCol("myuseridx")
.setItemCol("myitemidx")
.setRatingCol("rating")
val model = als.fit(training)
My question is: must the input data for implicit models technically contain all user item combinations, i.e. also the ones which were not bought?
ALS solves the recommender problem by fixing the user or the item matrix and solving it using least squares. Essentially, for an implicit dataset, it means that all items that are not set to one considered zeros. So you'd only need to include the positive observations.
Some more discussion here:
http://yifanhu.net/PUB/cf.pdf
I have built a pretty basic naive bayes over apache spark and using mllib of course. But I have a few clarifications on what exactly neutrality means.
From what I understand, in a given dataset there are pre-labeled sentences which comprise of the necessary classes, let's take 3 for example below.
0-> Negative sentiment
1-> Positive sentiment
2-> Neutral sentiment
This neutral is pre-labeled in the training set itself.
Is there any other form of neutrality handling. Suppose if there are no neutral sentences available in the dataset then is it possible that I can calculate it from the scale of probability like
0.0 - 0.4 => Negative
0.4- - 0.6 => Neutral
0.6 - 1.0 => Positive
Is such kind of mapping possible in spark. I searched around but could not find any. The NaiveBayesModel class in the RDD API has a predict method which just returns a double that is mapped according to the training set i.e if only 0,1 is there it will return only 0,1 and not in a scaled manner such as 0.0 - 1.0 as above.
Any pointers/advice on this would be incredibly helpful.
Edit - 1
Sample code
//Performs tokenization,pos tagging and then lemmatization
//Returns a array of string
val tokenizedString = Util.tokenizeData(text)
val hashingTF = new HashingTF()
//Returns a double
//According to the training set 1.0 => Positive, 0.0 => Negative
val status = model.predict(hashingTF.transform(tokenizedString.toSeq))
if(status == 1.0) "Positive" else "Negative"
Sample dataset content
1,Awesome movie
0,This movie sucks
Of course the original dataset contains more longer sentences, but this should be enough for explanations I guess
Using the above code I am calculating. My question is the same
1) Neutrality handling in dataset
In the above dataset if I am adding another category such as
2,This movie can be enjoyed by kids
For arguments sake, lets assume that it is a neutral review, then the model.predict method will give either 1.0,0.0,2.0 based on the passed in sentence.
2) Using the model.predictProbabilities it gives an array of doubles, but I am not sure in what order it gives the result i.e index 0 is for negative or for positive? With three features i.e Negative,Positive,Neutral then in what order will that method return the predictions?
It would have been helpful to have the code that builds the model (for your example to work, the 0.0 from the dataset must be converted to 0.0 as a Double in the model, either after indexing it with a StringIndexer stage, or if you converted that from the file), but assuming that this code works:
val status = model.predict(hashingTF.transform(tokenizedString.toSeq))
if(status == 1.0) "Positive" else "Negative"
Then yes, it means the probabilities at index 0 is that of negative and at 1 that of positive (it's a bit strange and there must be a reason, but everything is a double in ML, even feature and category indexes). If you have something like this in your code:
val labelIndexer = new StringIndexer()
.setInputCol("sentiment")
.setOutputCol("indexedsentiment")
.fit(trainingData)
Then you can use labelIndexer.labels to identify the labels (probability at index 0 is for labelIndexer.labels at index 0.
Now regarding your other questions.
Neutrality can mean two different things. Type 1: a review contains as much positive and negative words Type 2: there is (almost) no sentiment expressed.
A Neutral category can be very helpful if you want to manage Type 2. If that is the case, you need neutral examples in your dataset. Naive Bayes is not a good classifier to apply thresholding on the probabilities in order to determine Type 2 neutrality.
Option 1: Build a dataset (if you think you will have to deal with a lot of Type 2 neutral texts). The good news is, building a neutral dataset is not too difficult. For instance you can pick random texts that are not movie reviews and assume they are neutral. It would be even better if you could pick content that is closely related to movies (but neutral), like a dataset of movie synopsis. You could then create a multi-class Naive Bayes classifier (between neutral, positive and negative) or a hierarchical classifier (first step is a binary classifier that determines whether a text is a movie review or not, second step to determine the overall sentiment).
Option 2 (can be used to deal with both Type 1 and 2). As I said, Naive Bayes is not very great to deal with thresholds on the probabilities, but you can try that. Without a dataset though, it will be difficult to determine the thresholds to use. Another approach is to identify the number of words or stems that have a significant polarity. One quick and dirty way to achieve that is to query your classifier with each individual word and count the number of times it returns "positive" with a probability significantly higher than the negative class (discard if the probabilities are too close to each other, for instance within 25% - a bit of experimentations will be needed here). At the end, you may end up with say 20 positive words vs 15 negative ones and determine it is neutral because it is balanced or if you have 0 positive and 1 negative, return neutral because the count of polarized words is too low.
Good luck and hope this helped.
I am not sure if I understand the problem but:
prior in Naive Bayes is computed from the data and cannot be set manually.
in MLLib you can use predictProbabilities to obtain class probabilities.
in ML you can use setThresholds to set prediction threshold for each class.
I want to do kmeans labels for numClusters = 6 so that I can group by the labels later.
How do I select the columns to do kmeans on?
val clusterThis = scaledDF.select($"id",$"setting1",$"setting2",$"setting3")
// dataset description lists six operation modes
val operatingModes = 6
// Cluster the data into two classes using KMeans
val numClusters = operatingModes
val numIterations = 20
import sqlContext.implicits._
val clusters = KMeans.train(clusterThis.rdd, numClusters, numIterations)
clusters.predict(clusterThis)
//... join back on id
As you can see in KMeans's Example the object uses just one column as features. In that example and by coincidence it has the same name. However, that name depends on you, but the important thing is that this column must be a Vector (dense or sparse).
Thus, you would need to combine your features (different columns) into one, for this task you can use a VectorAssembler.
By the way, K-means doesn't work with categorical features. You can read this post K-means clustering for mixed numeric and categorical data to notice the reasons.
I know several questions has been asked on similar topics but I couldn't apply any of the answers to my problem, also I am wondering about best practices.
I have loaded a dateset for ML to a SQL database. I want to apply mllib's clustering function according to it. I have loaded the SQL database to DataFrame using sqlContext, dropped the irrelevant columns. then happened the problematic part, I create a vector by parsing each row of the DataFrame.
The Vector is then transformed to RDD using the toJavaRDD function.
Here is the code (works):
val usersDF = sqlContext.read.format("jdbc").option("url","jdbc:mysql://localhost/database").
option("driver","com.mysql.jdbc.Driver").option("dbtable","table").
option("user","woot").option("password","woot-password").load()
val cleanDF = usersDF.drop("id").drop("username")
cleanDF.show()
val parsedData = cleanDF.map(s => Vectors.dense(s.toString().replaceAll("[\\[\\]]", "").trim.split(',').map(_.toDouble))).cache()
val splits = parsedData.randomSplit(Array(0.6,0.4), seed = 11L)
val train_set = splits(0).cache()
val gmm = new GaussianMixture().setK(2).run(train_set)
My main question regards to what I read on spark documentation about: Local vector, in my understanding the DataFrame mapping will be performed on the workers and later will be sent to the Driver when creating the Vector(Is that the meaning of local vector) only to later be sent to the workers again? isn't there a better way to achieve this?
Another things is that it seems a little odd to load SQL to DataFrame only to turn it into string and parse it again. Are there any other best practices suggestions?
From the link you suggested
A local vector has integer-typed and 0-based indices and double-typed
values, stored on a single machine. MLlib supports two types of local
vectors: dense and sparse.
A distributed matrix has long-typed row and column indices and
double-typed values, stored distributively in one or more RDDs.
The local vector are behaving like any object you would use for your RDD (String, Integer, Array), they are created and stored on a single machine, the worker node, and only if you collect them they will be sent to the driver node.
If you consider a vector x of size 2n storing it distributively you would separate it in two halfs of length n, x1 and x2, (x = x1::x2). To perform the dot product with another vector y, the workers will perform r1=x1*y1 (on machine 1) and r2=x2*y2 (on machine 2) and then you will need to group the partial results giving r=r1+r2. Your vector x is distributed, the vectors x1 and x2 are again local vectors. If you have x as a local vector then in a single step you can perform on a worker node r=x*y.
For your second question, I do not see why you would store the vectors in SQL format. Having a CSV file like this would be sufficient:
label feature1 feature2 ...
1, 0.5, 1.2 ...
0, 0.2, 0.5 ...