I am trying to implement ALS on Spark. I have used ml class instead of mllib because the CSV file contains String in one column. Rating class in mllib do not accept String as a parameter.
I want to use predict function from org.apache.spark.mllib.recommendation.MatrixFactorizationModel class but while running it is searching in org.apache.spark.rdd.RDD.
This is the code I am using.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.ml.recommendation.ALS
import org.apache.spark.ml.recommendation.ALS.Rating
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
object LoadCsv{
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Load CSV")
val sc = new SparkContext(conf)
println("READING FILE...............................");
val data = sc.textFile("file.csv")
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
Rating[String](user, item, rate.toFloat)
})
//val (userFactors, itemFactors) = ALS.train(ratings)
//Build the recommendation model using ALS
val rank = 10
val numIterations = 10
val model = ALS.train(ratings, rank, numIterations)
// Evaluate the model on rating data
val usersProducts = ratings.map { case Rating(user, product, rate) =>
(user, product)
}
// GETTING ERROR OVER HERE.
val predictions =
model.predict(usersProducts).map { case Rating(user, product, rate) =>
((user, product), rate)
}
val ratesAndPreds = ratings.map { case Rating(user, product, rate) =>
((user, product), rate)
}.join(predictions)
val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
val err = (r1 - r2)
err * err
}.mean()
println("Mean Squared Error = " + MSE)
// Save and load model
//model.save(sc, "/home/shishir/spark-Projects/op")
//val sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter")
// $example off$
}
}
On running the code, I am getting this error:
LoadCsv.scala:34: value predict is not a member of (org.apache.spark.rdd.RDD[(String, Array[Float])], org.apache.spark.rdd.RDD[(String, Array[Float])])
[error] model.predict(usersProducts).map { case Rating(user, product, rate) =>
Your imports are "incorrect", you are using this:
import org.apache.spark.ml.recommendation.ALS
import org.apache.spark.ml.recommendation.ALS.Rating
When you should be using this:
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
You can use the other package, it's just that the result won't be a model but (as the error says) and RDD.
You can read up online why there are two ML packages (from what I remember the mllib package is the older one and contains some design flaws so they reimplement in ml so they can use pipelines).
It looks like you're mixing MLLib and ML style approach. If your data uses supported ID types (it doesn't look like it is the case here) you can use MLLib implementation:
import org.apache.spark.mllib.recommendation.{ALS => OldALS}
import org.apache.spark.mllib.recommendation.{
MatrixFactorizationModel => OldModel}
import org.apache.spark.mllib.recommendation.{Rating => OldRating}
val ratings: RDD[OldRating] = ???
val model: OldModel = OldALS()
.setAlpha(0.01)
.setIterations(numIterations)
.setRank(rank)
.run(ratings)
If your data uses non-standard ID and you want an access to user friendly API it is better to use DataFrames:
val ratings: RDD[org.apache.spark.ml.recommendation.ALS.Rating[String]] = ???
val df = ratings.toDF
val als: org.apache.spark.ml.recommendation.ALS = new ALS()
.setAlpha(0.01)
.setMaxIter(numIterations)
.setRank(rank)
val model: org.apache.spark.ml.recommendation.ALSModel = als.fit(df)
Finally you can use your current approach but you'll have to operate on user factors and item factors directly without helpers like predict.
Related
Because I need alter train data after StringIndexer. (append unseen future to handle error when future model prediction) . So I need build a pipeline by trained Transfomers.
But I haven't find a way to do this thing ,
sample code:
// fit by original df
val catIndexer = catFeatures.map(cname => {
new StringIndexer()
.setHandleInvalid("keep") // would get error when future prediction if training data doesn't contain unseen feature
.setInputCol(cname)
.setOutputCol(cname + KeyColumns.stringIndexerSuffix)
})
val indexedCatFeatures = catIndexer.map(idx => idx.getOutputCol)
val stringIndexerPipeline = new Pipeline().setStages(catIndexer)
val stringIndexerPipelineFitted = stringIndexerPipeline.fit(trainDataset) // note: trainDataset
// transform original df with one new row(unseen feature), to avoid unseen feature when future prediction
val rdd = mdContext.get.spark.sparkContext.makeRDD(List(Row(newRow:_*)))
val newDF = mdContext.get.spark.createDataFrame(rdd, trainDataset.schema).na.fill(0)
val patchedTrainDataset = trainDataset.unionByName(newDF)
val strIndexTrainDataset = stringIndexerPipelineFitted.transform(patchedTrainDataset) // note: patchedTrainDataset
// onehot and assemble
val oneHotEncoder = new OneHotEncoderEstimator().setInputCols(indexedCatFeatures).setOutputCols(indexedCatFeatures.map(_+KeyColumns.oneHotEncoderSuffix))
.setDropLast(false)
val predictors = numFeatures ++ oneHotEncoder.getOutputCols
val assembler = new VectorAssembler().setInputCols(predictors).setOutputCol(KeyColumns.features)
val leftPipeline = new Pipeline().setStages(Array(oneHotEncoder, assembler))
// feature transfomers
val transfomers = stringIndexerPipeline.asInstanceOf[PipelineModel].stages ++ leftPipeline.asInstanceOf[PipelineModel].stages
// train model
...
//
val cv = new CrossValidator()
.setEstimator(modelPipeline)
.setEvaluator(new BinaryClassificationEvaluator().setLabelCol(KeyColumns.y))
.setEstimatorParamMaps(paramGrid)
.setNumFolds(cvConfig.folders)
.setParallelism(cvConfig.parallelism)
val transformedTrainDataset = leftPipeline.fit(strIndexTrainDataset).transform(strIndexTrainDataset)
val cvModel = cv.fit(transformedTrainDataset)
val bestModel = cvModel.bestModel.asInstanceOf[PipelineModel]
val newStages = transfomers ++ Array[SparkTransformer](bestModel.stages.last)
// !!!error can't new here
val newBestModel = new PipelineModel(bestModel.uid, newStages)
// !!!error can't new here
val newCvModel = new CrossValidatorModel(cvModel.uid, newBestModel, cvModel.avgMetrics)
Thanks for raising a great question.
According to this Q&A, we've known that you wouldn't get an instance of PipelineModel from new method (which is not legal for ). There are mainly two ways:
PipelineModel.load(file: String)
val pipelineModel = pipeline.fit(dataFrame)
Now here is the thing: you can skip the fit() implicitly by only adding trained Model into pipeline to get pipelineModel.
e.g.
// Still, add your trained models into a array
val trainedModels = cols.map(col => {
new ValueIndexerModel().setInputCol(col).setOutputCol(col + "_indexed").setLevels(level)
})
// just set your models as stages of a pipeline as usual
val pipeline = new PipeLine().setStages(trainedModels)
// fit, which will skip for models
val pipelineModel = pipeline.fit(dataFrame)
// then you get your pipelineModel, you can transform now
val transDF = pipelineModel.transform(dataFrame)
The reason we are able to handle it like this is according to the source code of spark:
val transformers = ListBuffer.empty[Transformer]
theStages.view.zipWithIndex.foreach { case (stage, index) =>
if (index <= indexOfLastEstimator) {
val transformer = stage match {
case estimator: Estimator[_] =>
estimator.fit(curDataset)
case t: Transformer =>
t
case _ =>
throw new IllegalArgumentException(
s"Does not support stage $stage of type ${stage.getClass}")
}
if (index < indexOfLastEstimator) {
curDataset = transformer.transform(curDataset)
}
transformers += transformer
} else {
transformers += stage.asInstanceOf[Transformer]
}
}
Your trained model is subclass of Transformer, so when fit, your pipeline of trained models will skip all the process of fit, and give a pipelineModel with your trained models. Thanks to zero323 and user1269298 in Q&A again.
Say I have a few features/columns in a dataframe on which I apply the regular OneHotEncoder, and one (let, n-th) column on which I need to apply my custom OneHotEncoder. Then I need to use VectorAssembler to assemble those features, and put into a Pipeline, finally fitting my trainData and getting predictions from my testData, such as:
val sIndexer1 = new StringIndexer().setInputCol("my_feature1").setOutputCol("indexed_feature1")
// ... let, n-1 such sIndexers for n-1 features
val featureEncoder = new OneHotEncoderEstimator().setInputCols(Array(sIndexer1.getOutputCol), ...).
setOutputCols(Array("encoded_feature1", ... ))
// **need to insert output from my custom OneHotEncoder function (please see below)**
// (which takes the n-th feature as input) in a way that matches the VectorAssembler below
val vectorAssembler = new VectorAssembler().setInputCols(featureEncoder.getOutputCols + ???).
setOutputCol("assembled_features")
...
val pipeline = new Pipeline().setStages(Array(sIndexer1, ...,featureEncoder, vectorAssembler, myClassifier))
val model = pipeline.fit(trainData)
val predictions = model.transform(testData)
How can I modify the building of the vectorAssembler so that it can ingest the output from the custom OneHotEncoder?
The problem is my desired oheEncodingTopN() cannot/should not refer to the "actual" dataframe, since it would be a part of the pipeline (to apply on trainData/testData).
Note:
I tested that the custom OneHotEncoder (see link) works just as expected separately on e.g. trainData. Basically, oheEncodingTopN applies OneHotEncoding on the input column, but for the top N frequent values only (e.g. N = 50), and put all the rest infrequent values in a dummy column (say, "default"), e.g.:
val oheEncoded = oheEncodingTopN(df, "my_featureN", 50)
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{col, lit, when}
import org.apache.spark.sql.Column
def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))
def oheEncodingTopN(df: DataFrame, colName: String, n: Int): DataFrame = {
df.createOrReplaceTempView("data")
val topNDF = spark.sql(s"select $colName, count(*) as count from data group by $colName order by count desc limit $n")
val pivotTopNDF = topNDF.
groupBy(colName).
pivot(colName).
count().
withColumn("default", lit(1))
val joinedTopNDF = df.join(pivotTopNDF, Seq(colName), "left").drop(colName)
val oheEncodedDF = joinedTopNDF.
na.fill(0, joinedTopNDF.columns).
withColumn("default", flip(col("default")))
oheEncodedDF
}
I think the cleanest way would be to create your own class that extends spark ML Transformer so that you can play with as you would do with any other transformer (like OneHotEncoder). Your class would look like this :
import org.apache.spark.ml.Transformer
import org.apache.spark.ml.param.Param
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, Dataset, Column}
class OHEncodingTopN(n :Int, override val uid: String) extends Transformer {
final val inputCol= new Param[String](this, "inputCol", "The input column")
final val outputCol = new Param[String](this, "outputCol", "The output column")
; def setInputCol(value: String): this.type = set(inputCol, value)
def setOutputCol(value: String): this.type = set(outputCol, value)
def this(n :Int) = this(n, Identifiable.randomUID("OHEncodingTopN"))
def copy(extra: ParamMap): OHEncodingTopN = {
defaultCopy(extra)
}
override def transformSchema(schema: StructType): StructType = {
// Check that the input type is what you want if needed
// val idx = schema.fieldIndex($(inputCol))
// val field = schema.fields(idx)
// if (field.dataType != StringType) {
// throw new Exception(s"Input type ${field.dataType} did not match input type StringType")
// }
// Add the return field
schema.add(StructField($(outputCol), IntegerType, false))
}
def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))
def transform(df: Dataset[_]): DataFrame = {
df.createOrReplaceTempView("data")
val colName = $(inputCol)
val topNDF = df.sparkSession.sql(s"select $colName, count(*) as count from data group by $colName order by count desc limit $n")
val pivotTopNDF = topNDF.
groupBy(colName).
pivot(colName).
count().
withColumn("default", lit(1))
val joinedTopNDF = df.join(pivotTopNDF, Seq(colName), "left").drop(colName)
val oheEncodedDF = joinedTopNDF.
na.fill(0, joinedTopNDF.columns).
withColumn("default", flip(col("default")))
oheEncodedDF
}
}
Now on a OHEncodingTopN object you should be able to call .getOuputCol to perform what you want. Good luck.
EDIT: your method that I just copy pasted in the transform method should be slightly modified in order to output a column of type Vector having the name given in the setOutputCol.
I am using Spark and I would like to train a machine learning model.
Because of bad results, I would like to display the error made by the model at each epoch of the training (on train and test dataset).
I will then use this information to determined if my model is underfitting or overfitting the data.
Question: How can I draw the learning curve of a model with spark ?
In the following example, I have implement my own evaluator and override the evaluate method to print the metrics I was needed, but only two values have been display (maxIter = 1000).
MinimalRunnableCode.scala:
import org.apache.spark.SparkConf
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}
import org.apache.spark.sql.SparkSession
object Min extends App {
// Open spark session.
val conf = new SparkConf()
.setMaster("local")
.set("spark.network.timeout", "800")
val ss = SparkSession.builder
.config(conf)
.getOrCreate
// Load data.
val data = ss.createDataFrame(ss.sparkContext.parallelize(
List(
(Vectors.dense(1, 2), 1),
(Vectors.dense(1, 3), 2),
(Vectors.dense(1, 2), 1),
(Vectors.dense(1, 3), 2),
(Vectors.dense(1, 2), 1),
(Vectors.dense(1, 3), 2),
(Vectors.dense(1, 2), 1),
(Vectors.dense(1, 3), 2),
(Vectors.dense(1, 2), 1),
(Vectors.dense(1, 3), 2),
(Vectors.dense(1, 4), 3)
)
))
.withColumnRenamed("_1", "features")
.withColumnRenamed("_2", "label")
val Array(training, test) = data.randomSplit(Array(0.8, 0.2), seed = 42)
// Create model of linear regression.
val lr = new LinearRegression().setMaxIter(1000)
// Create parameters grid that will be used to train different version of the linear model.
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.001))
.addGrid(lr.fitIntercept)
.addGrid(lr.elasticNetParam, Array(0.5))
.build()
// Create trainer using validation split to evaluate which set of parameters performs the best.
val trainValidationSplit = new TrainValidationSplit()
.setEstimator(lr)
.setEvaluator(new CustomRegressionEvaluator)
.setEstimatorParamMaps(paramGrid)
.setTrainRatio(0.8) // 80% of the data will be used for training and the remaining 20% for validation.
// Run train validation split, and choose the best set of parameters.
var model = trainValidationSplit.fit(training)
// Close spark session.
ss.stop()
}
CustomRegressionEvaluator.scala:
import org.apache.spark.ml.evaluation.{Evaluator, RegressionEvaluator}
import org.apache.spark.ml.param.{Param, ParamMap, Params}
import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable}
import org.apache.spark.mllib.evaluation.RegressionMetrics
import org.apache.spark.sql.{Dataset, Row}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
final class CustomRegressionEvaluator (override val uid: String) extends Evaluator with HasPredictionCol with HasLabelCol with DefaultParamsWritable {
def this() = this(Identifiable.randomUID("regEval"))
def checkNumericType(
schema: StructType,
colName: String,
msg: String = ""): Unit = {
val actualDataType = schema(colName).dataType
val message = if (msg != null && msg.trim.length > 0) " " + msg else ""
require(actualDataType.isInstanceOf[NumericType], s"Column $colName must be of type " +
s"NumericType but was actually of type $actualDataType.$message")
}
def checkColumnTypes(
schema: StructType,
colName: String,
dataTypes: Seq[DataType],
msg: String = ""): Unit = {
val actualDataType = schema(colName).dataType
val message = if (msg != null && msg.trim.length > 0) " " + msg else ""
require(dataTypes.exists(actualDataType.equals),
s"Column $colName must be of type equal to one of the following types: " +
s"${dataTypes.mkString("[", ", ", "]")} but was actually of type $actualDataType.$message")
}
var i = 0 // count the number of time the evaluate method is called
override def evaluate(dataset: Dataset[_]): Double = {
val schema = dataset.schema
checkColumnTypes(schema, $(predictionCol), Seq(DoubleType, FloatType))
checkNumericType(schema, $(labelCol))
val predictionAndLabels = dataset
.select(col($(predictionCol)).cast(DoubleType), col($(labelCol)).cast(DoubleType))
.rdd
.map { case Row(prediction: Double, label: Double) => (prediction, label) }
val metrics = new RegressionMetrics(predictionAndLabels)
val metric = "mae" match {
case "rmse" => metrics.rootMeanSquaredError
case "mse" => metrics.meanSquaredError
case "r2" => metrics.r2
case "mae" => metrics.meanAbsoluteError
}
println(s"$i $metric") // Print the metrics
i = i + 1 // Update counter
metric
}
override def copy(extra: ParamMap): RegressionEvaluator = defaultCopy(extra)
}
object RegressionEvaluator extends DefaultParamsReadable[RegressionEvaluator] {
override def load(path: String): RegressionEvaluator = super.load(path)
}
private[ml] trait HasPredictionCol extends Params {
/**
* Param for prediction column name.
* #group param
*/
final val predictionCol: Param[String] = new Param[String](this, "predictionCol", "prediction column name")
setDefault(predictionCol, "prediction")
/** #group getParam */
final def getPredictionCol: String = $(predictionCol)
}
private[ml] trait HasLabelCol extends Params {
/**
* Param for label column name.
* #group param
*/
final val labelCol: Param[String] = new Param[String](this, "labelCol", "label column name")
setDefault(labelCol, "label")
/** #group getParam */
final def getLabelCol: String = $(labelCol)
}
Here is a possible solution for the specific case of LinearRegression and any other algorithm that support objective history (in this case, And LinearRegressionTrainingSummary does the job).
Let's first create a minimal verifiable and complete example :
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.regression.{LinearRegression, LinearRegressionModel}
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}
import org.apache.spark.mllib.util.{LinearDataGenerator, MLUtils}
import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder().getOrCreate()
import org.apache.spark.ml.evaluation.RegressionEvaluator
import spark.implicits._
val data = {
val tmp = LinearDataGenerator.generateLinearRDD(
spark.sparkContext,
nexamples = 10000,
nfeatures = 4,
eps = 0.05
).toDF
MLUtils.convertVectorColumnsToML(tmp, "features")
}
As you've noticed, when you want to generate data for testing purposes for spark-mllib or spark-ml, it's advised to use data generators.
Now, let's train a linear regressor :
// Create model of linear regression.
val lr = new LinearRegression().setMaxIter(1000)
// The following line will create two sets of parameters
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.001)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.5)).build()
// Create trainer using validation split to evaluate which set of parameters performs the best.
// I'm using the regular RegressionEvaluator here
val trainValidationSplit = new TrainValidationSplit()
.setEstimator(lr)
.setEvaluator(new RegressionEvaluator)
.setEstimatorParamMaps(paramGrid)
.setTrainRatio(0.8) // 80% of the data will be used for training and the remaining 20% for validation.
// To retrieve subModels, make sure to set collectSubModels to true before fitting.
trainValidationSplit.setCollectSubModels(true)
// Run train validation split, and choose the best set of parameters.
var model = trainValidationSplit.fit(data)
Now since our model is trained, all we need is to get the objective history.
The following part needs a bit of gymnastics between the model and sub-models object parameters.
In case you have a Pipeline or so, this code needs to be modified, so use it carefully. It's just an example :
val objectiveHist = spark.sparkContext.parallelize(
model.subModels.zip(model.getEstimatorParamMaps).map {
case (m: LinearRegressionModel, pm: ParamMap) =>
val history: Array[Double] = m.summary.objectiveHistory
val idx: Seq[Int] = 1 until history.length
// regParam, elasticNetParam, fitIntercept
val parameters = pm.toSeq.map(pair => (pair.param.name, pair.value.toString)) match {
case Seq(x, y, z) => (x._2, y._2, z._2)
}
(parameters._1, parameters._2, parameters._3, idx.zip(history).toMap)
}).toDF("regParam", "elasticNetParam", "fitIntercept", "objectiveHistory")
We can now examine those metrics :
objectiveHist.show(false)
// +--------+---------------+------------+-------------------------------------------------------------------------------------------------------+
// |regParam|elasticNetParam|fitIntercept|objectiveHistory |
// +--------+---------------+------------+-------------------------------------------------------------------------------------------------------+
// |0.001 |0.5 |true |[1 -> 0.4999999999999999, 2 -> 0.4038796441909531, 3 -> 0.02659222058006269, 4 -> 0.026592220340980147]|
// |0.001 |0.5 |false |[1 -> 0.5000637621421942, 2 -> 0.4039303922115196, 3 -> 0.026592220673025396, 4 -> 0.02659222039347222]|
// +--------+---------------+------------+-------------------------------------------------------------------------------------------------------+
You can notice that the training process actually stops after 4 iterations.
If you want just the number of iterations, you can do the following instead :
val objectiveHist2 = spark.sparkContext.parallelize(
model.subModels.zip(model.getEstimatorParamMaps).map {
case (m: LinearRegressionModel, pm: ParamMap) =>
val history: Array[Double] = m.summary.objectiveHistory
// regParam, elasticNetParam, fitIntercept
val parameters = pm.toSeq.map(pair => (pair.param.name, pair.value.toString)) match {
case Seq(x, y, z) => (x._2, y._2, z._2)
}
(parameters._1, parameters._2, parameters._3, history.size)
}).toDF("regParam", "elasticNetParam", "fitIntercept", "iterations")
I've changed the number of features in the generator (nfeatures = 100) for the sake of demonstrations :
objectiveHist2.show
// +--------+---------------+------------+----------+
// |regParam|elasticNetParam|fitIntercept|iterations|
// +--------+---------------+------------+----------+
// | 0.001| 0.5| true| 11|
// | 0.001| 0.5| false| 11|
// +--------+---------------+------------+----------+
I am trying to figure out the most efficient way to accomplish putting this online csv file into a data frame in Scala.
To save a download, the csv file in the code looks like this:
"Symbol","Name","LastSale","MarketCap","ADR
TSO","IPOyear","Sector","Industry","Summary Quote"
"DDD","3D Systems Corporation","18.09","2058834640.41","n/a","n/a","Technology","Computer Software: Prepackaged Software","http://www.nasdaq.com/symbol/ddd"
"MMM","3M Company","211.68","126423673447.68","n/a","n/a","Health Care","Medical/Dental Instruments","http://www.nasdaq.com/symbol/mmm"
....
From my research, I start by downloading the csv, and placing it into a list buffer (since you can't do this with a list because it's immutable):
import scala.collection.mutable.ListBuffer
val sc = new SparkContext(conf)
var stockInfoNYSE_ListBuffer = new ListBuffer[java.lang.String]()
import scala.io.Source
val bufferedSource =
Source.fromURL("http://www.nasdaq.com/screening/companies-by-
industry.aspx?exchange=NYSE&render=download")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
stockInfoNYSE_ListBuffer += s"${cols(0)},${cols(1)},${cols(2)},${cols(3)},${cols(4)},${cols(5)},${cols(6)},${cols(7)},${cols(8)}"
}
bufferedSource.close
val stockInfoNYSE_List = stockInfoNYSE_ListBuffer.toList
So we have a list. You can basically get each value like this:
// SYMBOL : stockInfoNYSE_List(1).split(",")(0)
// COMPANY NAME : stockInfoNYSE_List(1).split(",")(1)
// IPOYear : stockInfoNYSE_List(1).split(",")(5)
// Sector : stockInfoNYSE_List(1).split(",")(6)
// Industry : stockInfoNYSE_List(1).split(",")(7)
Here is where I get stuck- how do I get this to a dataframe? The wrong approaches I have taken. I didn't put all the values in just yet- was a simple test.
case class StockMap(Symbol: String, Name: String)
val caseClassDS = Seq(StockMap(stockInfoNYSE_List(1).split(",")(0),
StockMap(stockInfoNYSE_List(1).split(",")(1))).toDS()
caseClassDS.show()
The problem with the approach above: I can only figure out how to add one sequence (row) by hard coding it. I want every Row in the list.
My second failed attempt:
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val test = stockInfoNYSE_List.toDF
This will just give you the array, and I want to divide up the values.
Array(["Symbol","Name","LastSale","MarketCap","ADR TSO","IPOyear","Sector","Industry","Summary Quote"], ["DDD","3D Systems Corporation","18.09","2058834640.41","n/a","n/a","Technology","Computer Software: Prepackaged Software","http://www.nasdaq.com/symbol/ddd"], ["MMM","3M Company","211.68","126423673447.68","n/a","n/a","Health Care","Medical/Dental Instruments","http://www.nasdaq.com/symbol/mmm"],.......
case class TestClass(Symbol:String,Name:String,LastSale:String,MarketCap :String,ADR_TSO:String,IPOyear:String,Sector: String,Industry:String,Summary_Quote:String
| )
defined class TestClass
var stockDF= stockInfoNYSE_ListBuffer.drop(1)
val demoDS = stockDF.map(line => {
val fields = line.replace("\"","").split(",")
TestClass(fields(0), fields(1), fields(2),fields(3), fields(4), fields(5),fields(6), fields(7), fields(8))
})
scala> demoDS.toDS.show
+------+--------------------+--------+---------------+-------------+-------+-----------------+--------------------+--------------------+
|Symbol| Name|LastSale| MarketCap| ADR_TSO|IPOyear| Sector| Industry| Summary_Quote|
+------+--------------------+--------+---------------+-------------+-------+-----------------+--------------------+--------------------+
| DDD|3D Systems Corpor...| 18.09| 2058834640.41| n/a| n/a| Technology|Computer Software...|http://www.nasdaq...|
| MMM| 3M Company| 211.68|126423673447.68| n/a| n/a| Health Care|Medical/Dental In...|http://www.nasdaq...|
In case anyone is trying to get this example working, here is the code using the above solution:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import scala.collection.mutable.ListBuffer
import sqlContext.implicits._
var stockInfoNYSE_ListBuffer = new ListBuffer[java.lang.String]()
import scala.io.Source
val bufferedSource =
Source.fromURL("http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NYSE&render=download")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
stockInfoNYSE_ListBuffer += s"${cols(0)},${cols(1)},${cols(2)},${cols(3)},${cols(4)},${cols(5)},${cols(6)},${cols(7)},${cols(8)}"
}
bufferedSource.close
case class TestClass(Symbol:String,Name:String,LastSale:String,MarketCap :String,ADR_TSO:String,IPOyear:String,Sector: String,Industry:String,Summary_Quote:String )
var stockDF= stockInfoNYSE_ListBuffer.drop(1)
val demoDS = stockDF.map(line => {
val fields = line.replace("\"","").split(",")
TestClass(fields(0), fields(1), fields(2),fields(3), fields(4), fields(5),fields(6), fields(7), fields(8))
})
demoDS.toDF().show
I have followed the guide given in the link
http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html
But this is outdated as it uses spark Mlib RDD approach. The New Spark 2.0 has DataFrame approach.
Now My problem is I have got the updated code
val ratings = spark.read.textFile("data/mllib/als/sample_movielens_ratings.txt")
.map(parseRating)
.toDF()
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))
// Build the recommendation model using ALS on the training data
val als = new ALS()
.setMaxIter(5)
.setRegParam(0.01)
.setUserCol("userId")
.setItemCol("movieId")
.setRatingCol("rating")
val model = als.fit(training)
// Evaluate the model by computing the RMSE on the test data
val predictions = model.transform(test)
Now Here is the problem, In the old code the model that was obtained was a MatrixFactorizationModel, Now it has its own model(ALSModel)
In MatrixFactorizationModel you could directly do
val recommendations = bestModel.get
.predict(userID)
Which will give the list of products with highest probability of user liking them.
But Now there is no .predict method. Any Idea how to recommend a list of products given a user Id
Use transform method on model:
import spark.implicits._
val dataFrameToPredict = sparkContext.parallelize(Seq((111, 222)))
.toDF("userId", "productId")
val predictionsOfProducts = model.transform (dataFrameToPredict)
There's a jira ticket to implement recommend(User|Product) method, but it's not yet on default branch
Now you have DataFrame with score for user
You can simply use orderBy and limit to show N recommended products:
// where is for case when we have big DataFrame with many users
model.transform (dataFrameToPredict.where('userId === givenUserId))
.select ('productId, 'prediction)
.orderBy('prediction.desc)
.limit(N)
.map { case Row (productId: Int, prediction: Double) => (productId, prediction) }
.collect()
DataFrame dataFrameToPredict can be some large user-product DataFrame, for example all users x all products
The ALS Model in Spark contains the following helpful methods:
recommendForAllItems(int numUsers)
Returns top numUsers users recommended for each item, for all items.
recommendForAllUsers(int numItems)
Returns top numItems items recommended for each user, for all users.
recommendForItemSubset(Dataset<?> dataset, int numUsers)
Returns top numUsers users recommended for each item id in the input data set.
recommendForUserSubset(Dataset<?> dataset, int numItems)
Returns top numItems items recommended for each user id in the input data set.
e.g. Python
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import explode
alsEstimator = ALS()
(alsEstimator.setRank(1)
.setUserCol("user_id")
.setItemCol("product_id")
.setRatingCol("rating")
.setMaxIter(20)
.setColdStartStrategy("drop"))
alsModel = alsEstimator.fit(productRatings)
recommendForSubsetDF = alsModel.recommendForUserSubset(TargetUsers, 40)
recommendationsDF = (recommendForSubsetDF
.select("user_id", explode("recommendations")
.alias("recommendation"))
.select("user_id", "recommendation.*")
)
display(recommendationsDF)
e.g. Scala:
import org.apache.spark.ml.recommendation.ALS
import org.apache.spark.sql.functions.explode
val alsEstimator = new ALS().setRank(1)
.setUserCol("user_id")
.setItemCol("product_id")
.setRatingCol("rating")
.setMaxIter(20)
.setColdStartStrategy("drop")
val alsModel = alsEstimator.fit(productRatings)
val recommendForSubsetDF = alsModel.recommendForUserSubset(sampleTargetUsers, 40)
val recommendationsDF = recommendForSubsetDF
.select($"user_id", explode($"recommendations").alias("recommendation"))
.select($"user_id", $"recommendation.*")
display(recommendationsDF)
Here is what I did to get recommendations for a specific user with spark.ml:
import com.github.fommil.netlib.BLAS.{getInstance => blas}
userFactors.lookup(userId).headOption.fold(Map.empty[String, Float]) { user =>
val ratings = itemFactors.map { case (id, features) =>
val rating = blas.sdot(features.length, user, 1, features, 1)
(id, rating)
}
ratings.sortBy(_._2).take(numResults).toMap
}
Both userFactors and itemFactors in my case are RDD[(String, Array[Float])] but you should be able to do something similar with DataFrames.