So I was trying to implement a simple machine learning code in the spark-shell and when I tried to give the a csv file, it demanded a libsvm format, so I used the phraug library to convert my dataset into the required format. While that worked, I also needed to normalize my data, so I used Standard Scaler to transform the data. That also worked fine, The next step was to train the machine and for that I used the SVMWithSGD model. But when I tried to train I kept getting the error
error: type mismatch;
found: org.apache.spark.rdd.RDD[(Double,org.apache.spark.mllib.linalg.Vector)]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint]
I understand that it is a compatibility issue, and the Vector.dense function can be used,but I don't want to split it again and what I don't understand is, isn't there a direct method so that I can use it for the train method?
P.S. To help you understand the data currently looks like this
(0.0,[0.03376345160534202,-0.6339809012492886,-6.719697792783955,-6.719697792783965,-6.30231507117855,-8.72828614492483,0.03884804438718658,0.3041969425433718])
(0.0,[0.2535328275090413,-0.8780294632355746,-6.719697792783955,-6.719697792783965,-6.30231507117855,-8.72828614492483,0.26407233411369857,0.3041969425433718])
Assuming your RDD[Double, Vector] is called vectorRDD:
val labeledPointRDD = vectorRDD map {
case (label, vector) => LabeledPoint(label, vector)
}
Related
Hi I am very new to spark/Scala and trying to implement some functionality.My requirement is very simple.I have to perform all the the operations using DataSet API.
Question1:
I converted the csv in form a case Class?Is it correct way of converting data frame to DataSet??Am I doing it correctly?
Also when I am trying to do transformation on orderItemFile1,for filter/map operation I am able to access with _.order_id.But same is not happening with groupBy
case class orderItemDetails (order_id_order_item:Int, item_desc:String,qty:Int, sale_value:Int)
val orderItemFile1=ss.read.format("csv")
.option("header",true)
.option("infersSchema",true)
.load("src/main/resources/Order_ItemData.csv").as[orderItemDetails]
orderItemFile1.filter(_.order_id_order_item>100) //Works Fine
orderItemFile1.map(_.order_id_order_item.toInt)// Works Fine
//Error .Inside group By I am unable to access it as _.order_id_order_item. Why So?
orderItemFile1.groupBy(_.order_id_order_item)
//Below Works..But How this will provide compile time safely as committed
//by DataSet Api.I can pass any wrong column name also here and it will be //caught only on run time
orderItemFile1.groupBy(orderItemFile1("order_id_order_item")).agg(sum(orderItemFile1("item_desc")))
Perhaps the functionality you're looking for is #groupByKey. See example here.
As for your first question, basically yes, you're reading a CSV into a Dataset[A] where A is a case class you've declared.
I am trying to build kd-trees from points in a pair RDD called "RDDofPoints" with type RDD[BoundingBox[Double],(Double,Double)]. All the points are assigned to a particular bounding box and my goal is to build a kd-tree for each of the Bounding Boxes.
I am trying to use reduceByKey for this purpose. However, I am stuck at how to call the buildtree function in this case.
The function declaration of buildtree is:
def buildtree(points: RDD[(Double, Double)], depth: Int = 0): Option[KdNodeforRDD]
And, I am trying to call it as:
val treefromPairRDD = RDDofPoints.reduceByKey((k,v) => buildtree(v))
This does not work obviously. I am fairly new to Scala and Spark, please suggest what could be the appropriate way to go about in this situation. I am not sure about using reduceByKey, if some other pair RDD function can be applied here, which one would it be?
Thank you.
I've been trying to learn/use Scala for machine learning and to do that I need to convert string variables to an index of dummies.
The way I've done it is with the StringIndexer in Scala. Before running I've used df.na.fill("missing") to replace missing values. Even after I run that I still get a NullPointerException.
Is there something else I should be doing or something else I should be checking? I used printSchema to filter only on the string columns to get the list of columns I needed to run StringIndexer on.
val newDf1 = reweight.na.fill("Missing")
val cat_cols = Array("highest_tier_nm", "day_of_week", "month",
"provided", "docsis", "dwelling_type_grp", "dwelling_type_cd", "market"
"bulk_flag")
val transformers: Array[org.apache.spark.ml.PipelineStage] = cat_cols
.map(cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index"))
val stages: Array[org.apache.spark.ml.PipelineStage] = transformers
val categorical = new Pipeline().setStages(stages)
val cat_reweight = categorical.fit(newDf)
Normally when using machine learning you would train the model with one part of the data and then test it with another part. Hence, there are two different methods to use to reflect this. You have only used fit() which is equivalent to training a model (or a pipeline).
This mean that your cat_reweight is not a dataframe, it is a PipelineModel. A PipelineModel have a function transform() that takes data with the same format as the one used for training and gives a dataframe as output. In other words, you should add .transform(newDf1) after fit(newDf1).
Another possible issue is that in your code you have used fit(newDf) instead of fit(newDf1). Make sure the correct dataframe is used for both the fit() and transform() methods, otherwise you will get a NullPointerException.
It works for me when running locally, however, if you still get an error you could try to cache() after replacing the nulls and then performing an action to make sure all transformations are done.
Hope it helps!
My program uses Spark.ML, I use logistic regression on dataframes. However I would like to use LogisticRegressionWithLBFGS too so I want to convert my dataframe into LabeledPoint.
The following code gives me an error
val model = new LogisticRegressionWithLBFGS().run(dff3.rdd.map(row=>LabeledPoint(row.getAs[Double]("label"),org.apache.spark.mllib.linalg.SparseVector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features")))))
Error :
org.apache.spark.ml.linalg.DenseVector cannot be cast to org.apache.spark.ml.linalg.SparseVector
So I changed SparseVector to DenseVector but it doesn't work :
org.apache.spark.ml.linalg.SparseVector cannot be cast to org.apache.spark.ml.linalg.DenseVector
Have you tried to use org.apache.spark.mllib.linalg.Vectors.fromML instead?
Note: This answer is a copy paste from the comments to allow it to be closed.
I'm new to Spark and Scala so I might have misunderstood some basic things here. I'm trying to train Sparks word2vec model on my own data. According to their documentation, one way to do this is
val input = sc.textFile("text8").map(line => line.split(" ").toSeq)
val word2vec = new Word2Vec()
val model = word2vec.fit(input)
The text8 dataset contains one line of many words, meaning that input will become an RDD[Seq[String]].
After massaging my own dataset, which has one word per line, using different maps etc. I'm left with an RDD[String], but I can't seem to be able to train the word2vec model on it. I tried doing input.map(v => Seq(v)) which does actually give an RDD[Seq[String]], but that will give one sequence for each word, which I guess is totally wrong.
How can I wrap a sequence around my strings, or is there something else I have missed?
EDIT
So I kind of figured it out. From my clean being an RDD[String] I do val input = sc.parallelize(Seq(clean.collect().toSeq)). This gives me the correct data structure (RDD[Seq[String]]) to fit the word2vec model. However, running collect on a large dataset gives me out of memory error. I'm not quite sure how they intend the fitting to be done? Maybe it is not really parallelizable. Or maybe I'm supposed to have several semi-long sequences of strings inside and RDD, instead of one long sequence like I have now?
It seems that the documentation is updated in an other location (even though I was looking at the "latest" docs). New docs are at: https://spark.apache.org/docs/latest/ml-features.html
The new example drops the text8 example file alltogether. I'm doubting whether the original example ever worked as intended. The RDD input to word2vec should be a set of lists of strings, typically sentences or otherwise constructed n-grams.
Example included for other lost souls:
val documentDF = sqlContext.createDataFrame(Seq(
"Hi I heard about Spark".split(" "),
"I wish Java could use case classes".split(" "),
"Logistic regression models are neat".split(" ")
).map(Tuple1.apply)).toDF("text")
// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
.setInputCol("text")
.setOutputCol("result")
.setVectorSize(3)
.setMinCount(0)
val model = word2Vec.fit(documentDF)
Why not
input.map(v => v.split(" "))
or whatever would be an appropriate delimiter to split your words on. This will give you the desired sequence of strings - but with valid words.
As I can recall, word2vec in ml take dataframe as argument and word2vec in mllib can take rdd as argument. The example you posted is for word2vec in ml. Here is the official guide: https://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec