Can't convert Dataframe to Labeled Point - scala

My program uses Spark.ML, I use logistic regression on dataframes. However I would like to use LogisticRegressionWithLBFGS too so I want to convert my dataframe into LabeledPoint.
The following code gives me an error
val model = new LogisticRegressionWithLBFGS().run(dff3.rdd.map(row=>LabeledPoint(row.getAs[Double]("label"),org.apache.spark.mllib.linalg.SparseVector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features")))))
Error :
org.apache.spark.ml.linalg.DenseVector cannot be cast to org.apache.spark.ml.linalg.SparseVector
So I changed SparseVector to DenseVector but it doesn't work :
org.apache.spark.ml.linalg.SparseVector cannot be cast to org.apache.spark.ml.linalg.DenseVector

Have you tried to use org.apache.spark.mllib.linalg.Vectors.fromML instead?
Note: This answer is a copy paste from the comments to allow it to be closed.

Related

Use of exponential on columns within scala spark how to make it work

This is the code i wanted to implement . I am getting overload error . Is there a way around it ?
import scala.math._
dF = dF.withColumn("col2",(8.333*exp($"col1")))
error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: Double
How would one perform exponential operations like this one ?
You can use same function in spark as
import org.apache.spark.sql.functions.exp
dF = dF.withColumn("col2",exp($"col1"))
You are trying to use exp function from scala.math which requires Double but you are passing Column so it's not working. Spark has the same function you can use that.
Hope this helps!

javanullpointerexception after df.na.fill("Missing") in scala?

I've been trying to learn/use Scala for machine learning and to do that I need to convert string variables to an index of dummies.
The way I've done it is with the StringIndexer in Scala. Before running I've used df.na.fill("missing") to replace missing values. Even after I run that I still get a NullPointerException.
Is there something else I should be doing or something else I should be checking? I used printSchema to filter only on the string columns to get the list of columns I needed to run StringIndexer on.
val newDf1 = reweight.na.fill("Missing")
val cat_cols = Array("highest_tier_nm", "day_of_week", "month",
"provided", "docsis", "dwelling_type_grp", "dwelling_type_cd", "market"
"bulk_flag")
val transformers: Array[org.apache.spark.ml.PipelineStage] = cat_cols
.map(cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index"))
val stages: Array[org.apache.spark.ml.PipelineStage] = transformers
val categorical = new Pipeline().setStages(stages)
val cat_reweight = categorical.fit(newDf)
Normally when using machine learning you would train the model with one part of the data and then test it with another part. Hence, there are two different methods to use to reflect this. You have only used fit() which is equivalent to training a model (or a pipeline).
This mean that your cat_reweight is not a dataframe, it is a PipelineModel. A PipelineModel have a function transform() that takes data with the same format as the one used for training and gives a dataframe as output. In other words, you should add .transform(newDf1) after fit(newDf1).
Another possible issue is that in your code you have used fit(newDf) instead of fit(newDf1). Make sure the correct dataframe is used for both the fit() and transform() methods, otherwise you will get a NullPointerException.
It works for me when running locally, however, if you still get an error you could try to cache() after replacing the nulls and then performing an action to make sure all transformations are done.
Hope it helps!

How to replace specific columns multiple value in Spark Dataframe?

I am trying to replace or update some specific column value in dataframe, as we know Dataframe is immutable, I am trying to transform in to new dataframe instead of Update or Replacement.
I tried dataframe.replace as explained in Spark doc, but it's giving me error as error: value replace is not a member of org.apache.spark.sql.DataFrame
I tried below option.For passing multiple value I am passing in array
val new_df= df.replace("Stringcolumn", Map(array("11","17","18","10"->"12")))
but I am getting error as
error: overloaded method value array with alternatives
Help is really appreciated!!
To access org.apache.spark.sql.DataFrameNaFunctions such as replace you have to call .na. So your code should look something like this,
import com.google.common.collect.ImmutableMap
df.na.replace("Stringcolumn", Map(10 -> 12, 11 -> 17))
see here to get all the list of DataFrameNaFunctions and how to use them

pyspark FPGrowth doesn't work with RDD

I am trying to use the FPGrowth function on some data in Spark. I tested the example here with no problems:
https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html
However, my dataset is coming from hive
data = hiveContext.sql('select transactionid, itemid from transactions')
model = FPGrowth.train(data, minSupport=0.1, numPartitions=100)
This failed with Method does not exist:
py4j.protocol.Py4JError: An error occurred while calling o764.trainFPGrowthModel. Trace:
py4j.Py4JException: Method trainFPGrowthModel([class org.apache.spark.sql.DataFrame, class java.lang.Double, class java.lang.Integer]) does not exist
So, I converted it to an RDD:
data=data.rdd
Now I start getting some strange pickle serializer errors.
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
Then I start looking at the types. In the example, the data is run through a flatmap. This returns a different type than the RDD.
RDD Type returned by flatmap: pyspark.rdd.PipelinedRDD
RDD Type returned by hiveContext: pyspark.rdd.RDD
FPGrowth only seems to work with the PipelinedRDD. Is there some way I can convert a regular RDD to a PipelinedRDD?
Thanks!
Well, my query was wrong, but changed that to use collect_set and then
I managed to get around the type error by doing:
data=data.map(lambda row: row[0])

How to convert from linalg.Vector to regression.Labeledpoint format?

So I was trying to implement a simple machine learning code in the spark-shell and when I tried to give the a csv file, it demanded a libsvm format, so I used the phraug library to convert my dataset into the required format. While that worked, I also needed to normalize my data, so I used Standard Scaler to transform the data. That also worked fine, The next step was to train the machine and for that I used the SVMWithSGD model. But when I tried to train I kept getting the error
error: type mismatch;
found: org.apache.spark.rdd.RDD[(Double,org.apache.spark.mllib.linalg.Vector)]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint]
I understand that it is a compatibility issue, and the Vector.dense function can be used,but I don't want to split it again and what I don't understand is, isn't there a direct method so that I can use it for the train method?
P.S. To help you understand the data currently looks like this
(0.0,[0.03376345160534202,-0.6339809012492886,-6.719697792783955,-6.719697792783965,-6.30231507117855,-8.72828614492483,0.03884804438718658,0.3041969425433718])
(0.0,[0.2535328275090413,-0.8780294632355746,-6.719697792783955,-6.719697792783965,-6.30231507117855,-8.72828614492483,0.26407233411369857,0.3041969425433718])
Assuming your RDD[Double, Vector] is called vectorRDD:
val labeledPointRDD = vectorRDD map {
case (label, vector) => LabeledPoint(label, vector)
}