Spark: How to run logistic regression using only some features from LabeledPoint? - scala

I have a LabeledPoint on witch I want to run logistic regression:
Data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] =
MapPartitionsRDD[3335] at map at <console>:44
using code:
val splits = Data.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val numIterations = 100
val model = LogisticRegressionWithSGD.train(training, numIterations)
My problem is that I don't want to use all of the features from LabeledPoint, but only some of them. I've got a list o features that I wan't to use, for example:
LoF=List(223244,334453...
How can I get only the features that I want to use from LabeledPoint o select them in logistic regression?

Feature selection allows selecting the most relevant features for use in model construction. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. The number of features to select can be tuned using a held-out validation set.
One way to do what you are seeking is using the ElementwiseProduct.
ElementwiseProduct multiplies each input vector by a provided “weight” vector, using element-wise multiplication. In other words, it scales each column of the dataset by a scalar multiplier. This represents the Hadamard product between the input vector, v and transforming vector, w, to yield a result vector.
So if we set the weight of the features we want to keep to 1.0 and the others to 0.0, we can say that the remaining resulting features computed by the ElementwiseProduct of the original vector and the 0-1 weight vectors will select the features we need :
import org.apache.spark.mllib.feature.ElementwiseProduct
import org.apache.spark.mllib.linalg.Vectors
// Creating dummy LabeledPoint RDD
val data = sc.parallelize(Array(LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0,5.0,1.0)), LabeledPoint(1.0,Vectors.dense(4.0, 5.0, 6.0,1.0,2.0)),LabeledPoint(0.0,Vectors.dense(4.0, 2.0, 3.0,0.0,2.0))))
data.toDF.show
// +-----+--------------------+
// |label| features|
// +-----+--------------------+
// | 1.0|[1.0,0.0,3.0,5.0,...|
// | 1.0|[4.0,5.0,6.0,1.0,...|
// | 0.0|[4.0,2.0,3.0,0.0,...|
// +-----+--------------------+
// You'll need to know how many features you have, I have used 5 for the example
val numFeatures = 5
// The indices represent the features we want to keep
// Note : indices start with 0 so actually here you are keeping features 4 and 5.
val indices = List(3, 4).toArray
// Now we can create our weights vectors
val weights = Array.fill[Double](indices.size)(1)
// Create the sparse vector of the features we need to keep.
val transformingVector = Vectors.sparse(numFeatures, indices, weights)
// Init our vector transformer
val transformer = new ElementwiseProduct(transformingVector)
// Apply it to the data.
val transformedData = data.map(x => LabeledPoint(x.label,transformer.transform(x.features).toSparse))
transformedData.toDF.show
// +-----+-------------------+
// |label| features|
// +-----+-------------------+
// | 1.0|(5,[3,4],[5.0,1.0])|
// | 1.0|(5,[3,4],[1.0,2.0])|
// | 0.0| (5,[4],[2.0])|
// +-----+-------------------+
Note:
You noticed that I used the sparse vector representation for space optimization.
features are sparse vectors.

Related

Computing distance of a vector from the center of K-means cluster

I have training data set and I ran K-means on it with K=4 and got four cluster centers. For new data point(s), I would like to know not only the predicted cluster but also the distance from that cluster's center. Is there an API to compute the euclidean distance from the center ? I can make 2 API calls, if that is needed. I am using Scala and I couldn't find any example anywhere.
Since Spark 2.0 Vectors.sqdist can be used to calculate the squared distance between two Vectors.
You could use a UDF to calculate for every point the distance from its center, like so:
import org.apache.spark.ml.linalg.{Vectors, Vector}
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.sql.functions.udf
// Sample points
val points = Seq(Vectors.dense(1,0), Vectors.dense(2,-3), Vectors.dense(0.5, -1), Vectors.dense(1.5, -1.5))
val df = points.map(Tuple1.apply).toDF("features")
// K-means
val kmeans = new KMeans()
.setFeaturesCol("features")
.setK(2)
val kmeansModel = kmeans.fit(df)
val predictedDF = kmeansModel.transform(df)
// predictedDF.schema = (features: Vector, prediction: Int)
// Cluster Centers
kmeansModel.clusterCenters foreach println
/*
[1.75,-2.25]
[0.75,-0.5]
*/
// UDF that calculates for each point distance from each cluster center
val distFromCenter = udf((features: Vector, c: Int) => Vectors.sqdist(features, kmeansModel.clusterCenters(c)))
val distancesDF = predictedDF.withColumn("distanceFromCenter", distFromCenter($"features", $"prediction"))
distancesDF.show(false)
/*
+----------+----------+------------------+
|features |prediction|distanceFromCenter|
+----------+----------+------------------+
|[1.0,0.0] |1 |0.3125 |
|[2.0,-3.0]|0 |0.625 |
|[0.5,-1.0]|1 |0.3125 |
|[1.5,-1.5]|0 |0.625 |
+----------+----------+------------------+
*/
NOTE: Vectors.sqdist calculates squared distance between 2 Vectors (without square root). If you need Euclidean distance necessarily, you could use Math.sqrt(Vectors.sqdist(...))
The following worked for me ...
def EuclideanDistance(x: Array[Double], y: Array[Double]) = {
scala.math.sqrt((xs zip ys).map { case (x,y) => scala.math.pow(y - x, 2.0) }.sum)
}

Pyspark euclidean distance between entry and column

I am working with pyspark, and wondering if there is any smart way to get euclidean dstance between one row entry of array and the whole column. For instance, there is a dataset like this.
+--------------------+---+
| features| id|
+--------------------+---+
|[0,1,2,3,4,5 ...| 0|
|[0,1,2,3,4,5 ...| 1|
|[1,2,3,6,7,8 ...| 2|
Choose one of the column i.e. id==1, and calculate the euclidean distance. In this case, the result should be [0,0,sqrt(1+1+1+9+9+9)].
Can anybody figure out how to do this efficiently?
Thanks!
If you want euclidean for a fixed entry with a column, simply do this.
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType
from scipy.spatial import distance
fixed_entry = [0,3,2,7...] #for example, the entry against which you want distances
distance_udf = F.udf(lambda x: float(distance.euclidean(x, fixed_entry)), FloatType())
df = df.withColumn('distances', distance_udf(F.col('features')))
Your df will have a column of distances.
You can do BucketedRandomProjectionLSH [1] to get a cartesian of distances between your data frame.
from pyspark.ml.feature import BucketedRandomProjectionLSH
brp = BucketedRandomProjectionLSH(
inputCol="features", outputCol="hashes", seed=12345, bucketLength=1.0
)
model = brp.fit(df)
model.approxSimilarityJoin(df, df, 3.0, distCol="EuclideanDistance")
You can also get distances for one row to column with approxNearestNeighbors [2], but the results are limited by numNearestNeighbors, so you could give it the count of the entire data frame.
one_row = df.where(df.id == 1).first().features
model.approxNearestNeighbors(df2, one_row, df.count()).collect()
Also, make sure to convert your data to Vectors!
from pyspark.sql import functions as F
to_dense_vector = F.udf(Vectors.dense, VectorUDF())
df = df.withColumn('features', to_dense_vector('features'))
[1] https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html?highlight=approx#pyspark.ml.feature.BucketedRandomProjectionLSH
[2] https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html?highlight=approx#pyspark.ml.feature.BucketedRandomProjectionLSHModel.approxNearestNeighbors
Here is an implementation using SQL Function power() to compute Euclidean distance between matching rows in two dataframes
cols2Join = ['Key1','Key2']
colsFeature =['Feature1','Feature2','Feature3','Feature4']
columns = cols2Join + colsFeature
valuesA = [('key1value1','key2value1',111,22,33,.334),('key1value3','key2value3', 333,444,12,.445),('key1value5','key2value5',555,666,101,.99),('key1value7','key2value7',777,888,10,.019)]
table1 = spark.createDataFrame(valuesA,columns)
valuesB = [('key1value1','key2value1',22,33,3,.1),('key1value3','key2value3', 88,99,4,1.23),('key1value5','key2value5',4,44,1,.998),('key1value7','key2value7',9,99,1,.3)]
table2= spark.createDataFrame(valuesB,columns)
#Create the sql expression using list comprehension, we use sql function power to compute euclidean distance inline
beginExpr='power(('
InnerExpr = ['power((a.{}-b.{}),2)'.format(x,x) for x in colsFeature]
InnerExpr = '+'.join(str(e) for e in InnerExpr)
endExpr ='),0.5) AS EuclideanDistance'
distanceExpr = beginExpr + InnerExpr + endExpr
Expr = cols2Join+ [distanceExpr]
#now just join the tables and use Select Expr to get Euclidean distance
outDF = table1.alias('a').join(table2.alias('b'),cols2Join,how="inner").selectExpr(Expr)
display(outDF)
If you need to find euclidean distances between only one particular row and every other row in dataframe, then you can filter & collect that row and pass it to udf.
But, if you need to calculate distance between all pairs You need to use join.
Repartition the dataframe by id, it will speed up the join operation. No need to calculate full pairwise matrix, just calculate the upper or lower half and replicate it. I wrote a function for myself based on the this logic.
df = df.repartition("id")
df.cache()
df.show()
#metric = any callable function to calculate distance b/w two vectors
def pairwise_metric(Y, metric, col_name="metric"):
Y2 = Y.select(f.col("id").alias("id2"),
f.col("features").alias("features2"))
# join to create lower or upper half
Y = Y.join(Y2, Y.id < Y2.id2, "inner")
def sort_list(x):
x = sorted(x, key=lambda y:y[0])
x = list(map(lambda y:y[1], x))
return(x)
udf_diff = f.udf(lambda x,y: metric(x,y), t.FloatType())
udf_sort = f.udf(sort_list, t.ArrayType(t.FloatType()))
Yid = Y2.select("id2").distinct().select("id2",
f.col("id2").alias("id")).withColumn("dist", f.lit(0.0))
Y = Y.withColumn("dist", udf_diff("features",
"features2")).drop("features","features2")
# just swap the column names and take union to get the other half
Y =Y.union(Y.select(f.col("id2").alias("id"),
f.col("id").alias("id2"), "dist"))
# union for the diagonal elements of distance matrix
Y = Y.union(Yid)
st1 = f.struct(["id2", "dist"]).alias("vals")
# groupby , aggregate and sort
Y = (Y.select("id",st1).groupBy("id").agg(f.collect_list("vals").
alias("vals")).withColumn("dist",udf_sort("vals")).drop("vals"))
return(Y.select(f.col("id").alias("id1"), f.col("dist").alias(col_name)))

Kmeans Spark ML

I would like to perform KMeans using the Spark ML. Input is a libsvm dataset:
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
// Start time
//val intial_Data=spark.read.option("header",true).csv("C://sample_lda_data.txt")
val dataset = spark.read.format("libsvm").load("C:\\spark\\data\\mllib\\sample_kmeans_data.txt")
// Trains a k-means model.
val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(dataset)
// Evaluate clustering by computing Within Set Sum of Squared Errors.
val WSSSE = model.computeCost(dataset)
println(s"Within Set Sum of Squared Errors = $WSSSE")
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
So i would like to use a csv file and apply KMeans by the Spark ML.
I did this:
val intial_Data=spark.read.option("header",true).csv("C://sample_lda_data.txt")
val arrayCol= array(inputData.columns.drop(1).map(col).map(_.cast(DoubleType)): _*)
import spark.implicits._
// select array column and first column, and map into LabeledPoints
val result = inputData.select(col("col1").cast(DoubleType), arrayCol).map(r => LabeledPoint(r.getAs[Double](0),Vectors.dense(r.getAs[WrappedArray[Double]](1).toArray)))
// Trains a k-means model
val kmeans = new KMeans().setK(2)
val model = kmeans.fit(result)
// Evaluate clustering by computing Within Set Sum of Squared Errors.
val WSSSE = model.computeCost(dataset)
println(s"Within Set Sum of Squared Errors = $WSSSE")
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
I tried to turn csv file into a Dataset[LabledPoint].
Is my transformation correct?
In spark 2 instead of MLlib , we are using ML package. Which workon dataset and ML flows work in pipeline model. What u need to do is U have to make a dataset and make two columns feature,label. feature is the a vector of features u need to feed into the algo. The other column label is the target column. To make feature column u just need to use vector assembler to assemble all the features u want to use. If you have a target colunm then rename it as label. after fitting this dataset into algo u will get your model.

How to classify new training example after model training in apache spark?

Reading the src of https://spark.apache.org/docs/1.5.2/ml-ann.html :
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row
// Load training data
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_multiclass_classification_data.txt").toDF()
// Split the data into train and test
val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network:
// input layer of size 4 (features), two intermediate of size 5 and 4 and output of size 3 (classes)
val layers = Array[Int](4, 5, 4, 3)
// create the trainer and set its parameters
val trainer = new MultilayerPerceptronClassifier()
.setLayers(layers)
.setBlockSize(128)
.setSeed(1234L)
.setMaxIter(100)
// train the model
val model = trainer.fit(train)
// compute precision on the test set
val result = model.transform(test)
val predictionAndLabels = result.select("prediction", "label")
val evaluator = new MulticlassClassificationEvaluator()
.setMetricName("precision")
println("Precision:" + evaluator.evaluate(predictionAndLabels))
Once the model has been trained how can a new training example be classified ?
Can a new training example be added to the model where the label is not set and the classifier will try to classify this training example based on the training data ?
Why is it required that the dataframe labels are of type Double ?
Firstly, the only way to add another observation to the model is by incorporating that data point into the training data, in this case to your train variable. In order to achieve this, you can convert that point into a DataFrame (obviously of only one record) and then use the unionAll method. Nevertheless, you will have to retrain the model using this new dataset.
However, to classify observations using your model you will have to convert your unclassified data into a DataFrame with the same structure that had your training data. And then use the method transform of your model. By the way, notice that models have that method, because they are subclasses of Transformer.
Finally, you have to use Double because that is the way how the LabeledPoint class was defined. It receives a Double as label and a SparseVector or DenseVector as features. I don't know the exact motivation but in my own experience, which isn't wide, all classification and regression algorithms work with float point numbers.Furthermore, gradient descent algorithm, which is widely used to fit most models, uses numbers not letters nor classes to compute the error in each iteration.

How to define a function and pass training and test datasets in Scala?

I want to define a function in Scala in which I can pass my training and test datasets and then it perform a simple machine learning algorithm and returns some statistics. How should do that? What will be the parameters data type?
Imagine, you need to define a function which by taking training and test datasets performs a simple classification algorithm and then return the accuracy.
What I expect to have is like as follow:
val data = MLUtils.loadLibSVMFile(sc, datadir + "/example.txt");
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L);
val training = splits(0).cache();
val test = splits(1);
val results1 = SVMFunction(training, test)
val results2 = RegressionFunction(training, test)
val results3 = ClassificationFunction(training, test)
I need just the declaration of the functions and not the code that produce the results1, results2, and results3.
def SVMFunction ("I need help here"){
//I know how to work with the training and test datasets to generate the results.
//So no need to discuss what should be here
}
Thanks.
In case you're using supervised learning you should opt for LabeledPoint. Excerpt from mllib doc:
A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either 0 (negative) or 1 (positive). For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ....
And example is:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
// Create a labeled point with a positive label and a dense feature vector.
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
// Create a labeled point with a negative label and a sparse feature vector.
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))