Unresolved imports in 'Smile' machine learning library - scala

The following import does not resolve.
import smile.data.{Attribute, AttributeDataset, NumericAttribute}
Any idea on this if those attributes were removed from new Smile version 3.0.0 ?
I am using Scala 2.13.10.
How to translate the following code to Smile v3.0.0 ?
import smile.clustering.KMeans
import smile.data.{AttributeDataset, NumericAttribute}
import smile.read
import smile.write
// Load the data into a Smile AttributeDataset
val data: AttributeDataset = read.csv("data.csv", header = true)
// Split the data into two equal parts
val (data1, data2) = data.split(0.5)
// Train a k-means clustering model on the first part of the data
val model1 = KMeans.fit(data1.x(), 3)
// Train another k-means clustering model on the second part of the data
val model2 = KMeans.fit(data2.x(), 3)

The classes smile.data.{Attribute, AttributeDataset, NumericAttribute} existed till Smile 1.5.3 (Scala 2.10.x-2.12.x)
https://mvnrepository.com/artifact/com.github.haifengl/smile-scala
They were removed in Smile 2.0.0 (Nov 22, 2019)
https://github.com/haifengl/smile/commit/07c5d2507d6ee8bd1dd68202fdbb323dada91a2f (StructType seems to replace Attribute[])
https://github.com/haifengl/smile/releases/tag/v2.0.0
Fully redesigned API. It is leaner, simpler and even more friendly.

Okay done, chatGPT did it :)
// Load the data for instance 2
val data2 = Array(
Array(5.0, 7.0),
Array(3.5, 5.0),
Array(4.5, 5.0),
Array(3.5, 4.5)
)
// Train the KMeans model with 2 clusters for instance 1
val model1 = KMeans.fit(data2, 2)
// Use the combined model to predict the cluster for new instances
val prediction = model1.predict(Array(4.0, 5.0))
println(s"Instance belongs to cluster: $prediction")

Related

Scala, Spark, Geotrellis Rdd CRS reprojection

I load a set of point from a CSV file to a RDD:
case class yieldrow(Elevation:Double,DryYield:Double)
val points :RDD[PointFeature[yieldrow]] = lines.map { line =>
val fields = line.split(",")
val point = Point(fields(1).toDouble,fields(0).toDouble)
Feature(point, yieldrow(fields(4).toDouble,fields(20)))
}
Then get:
points: org.apache.spark.rdd.RDD[geotrellis.vector.PointFeature[yieldrow]]
Now need to reproject from EPSG:4326 to EPSG:3270
So I create the CRS from and to:
val crsFrom : geotrellis.proj4.CRS = geotrellis.proj4.CRS.fromName("EPSG:4326")
val crsTo : geotrellis.proj4.CRS = geotrellis.proj4.CRS.fromEpsgCode(32720)
But I can not create the transform and also i don not know:
Hot to apply a transform to a single point:
val pt = Point( -64.9772376007928, -33.6408083223936)
How to use the mapGeom method of Feature to make a CRS transformation ?
points.map(_.mapGeom(?????))
points.map(feature => feature.mapGeom(????))
How to use ReprojectPointFeature(pointfeature) ?
The documentation not have basic code samples.
Any help will be appreciate
I'll start from the last question:
Indeed to perform a reproject on a PointFeature you can use ReprojectPointFeature implict case class. To use it just be sure that you have import geotrellis.vector._ in reproject function call scope.
import geotrellis.vector._
points.map(_.reproject(crsFrom, crsTo))
The same import works for a Point too:
import geotrellis.vector._
pt.reproject(crsFrom, crsTo)
points.map(_.mapGeom(_.reproject(crsFrom, crsTo)))

Get best parameters for TrainValidationSplit scala

I am using the Spark Scala ML API and I am trying to pass a pipeline ALS model to the TrainValidationSplit. The code executes but I am unable to retrieve the best parameters...thoughts?
val alsPipeline = new Pipeline().setStages(Array(idIndexer , modelIndexer, als))
val paramGrid = new ParamGridBuilder().
addGrid(als.maxIter, Array(5, 10)).
addGrid(als.regParam, Array(0.01, 0.05, 0.1)).
addGrid(als.implicitPrefs).
build()
val tvs = new TrainValidationSplit().
setEstimator(alsPipeline).
setEvaluator(new RegressionEvaluator().
setMetricName("rmse").
setLabelCol("purchases").
setPredictionCol("prediction")).
setEstimatorParamMaps(paramGrid).
setTrainRatio(0.75)
val alsModel = tvs.fit(trainALS)
You could get the rmse for each parameter in your grid using:
alsModel.getEstimatorParamMaps.zip(alsModel.avgMetrics)

Spark 2.0 ALS Recommendation how to recommend to a user

I have followed the guide given in the link
http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html
But this is outdated as it uses spark Mlib RDD approach. The New Spark 2.0 has DataFrame approach.
Now My problem is I have got the updated code
val ratings = spark.read.textFile("data/mllib/als/sample_movielens_ratings.txt")
.map(parseRating)
.toDF()
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))
// Build the recommendation model using ALS on the training data
val als = new ALS()
.setMaxIter(5)
.setRegParam(0.01)
.setUserCol("userId")
.setItemCol("movieId")
.setRatingCol("rating")
val model = als.fit(training)
// Evaluate the model by computing the RMSE on the test data
val predictions = model.transform(test)
Now Here is the problem, In the old code the model that was obtained was a MatrixFactorizationModel, Now it has its own model(ALSModel)
In MatrixFactorizationModel you could directly do
val recommendations = bestModel.get
.predict(userID)
Which will give the list of products with highest probability of user liking them.
But Now there is no .predict method. Any Idea how to recommend a list of products given a user Id
Use transform method on model:
import spark.implicits._
val dataFrameToPredict = sparkContext.parallelize(Seq((111, 222)))
.toDF("userId", "productId")
val predictionsOfProducts = model.transform (dataFrameToPredict)
There's a jira ticket to implement recommend(User|Product) method, but it's not yet on default branch
Now you have DataFrame with score for user
You can simply use orderBy and limit to show N recommended products:
// where is for case when we have big DataFrame with many users
model.transform (dataFrameToPredict.where('userId === givenUserId))
.select ('productId, 'prediction)
.orderBy('prediction.desc)
.limit(N)
.map { case Row (productId: Int, prediction: Double) => (productId, prediction) }
.collect()
DataFrame dataFrameToPredict can be some large user-product DataFrame, for example all users x all products
The ALS Model in Spark contains the following helpful methods:
recommendForAllItems(int numUsers)
Returns top numUsers users recommended for each item, for all items.
recommendForAllUsers(int numItems)
Returns top numItems items recommended for each user, for all users.
recommendForItemSubset(Dataset<?> dataset, int numUsers)
Returns top numUsers users recommended for each item id in the input data set.
recommendForUserSubset(Dataset<?> dataset, int numItems)
Returns top numItems items recommended for each user id in the input data set.
e.g. Python
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import explode
alsEstimator = ALS()
(alsEstimator.setRank(1)
.setUserCol("user_id")
.setItemCol("product_id")
.setRatingCol("rating")
.setMaxIter(20)
.setColdStartStrategy("drop"))
alsModel = alsEstimator.fit(productRatings)
recommendForSubsetDF = alsModel.recommendForUserSubset(TargetUsers, 40)
recommendationsDF = (recommendForSubsetDF
.select("user_id", explode("recommendations")
.alias("recommendation"))
.select("user_id", "recommendation.*")
)
display(recommendationsDF)
e.g. Scala:
import org.apache.spark.ml.recommendation.ALS
import org.apache.spark.sql.functions.explode
val alsEstimator = new ALS().setRank(1)
.setUserCol("user_id")
.setItemCol("product_id")
.setRatingCol("rating")
.setMaxIter(20)
.setColdStartStrategy("drop")
val alsModel = alsEstimator.fit(productRatings)
val recommendForSubsetDF = alsModel.recommendForUserSubset(sampleTargetUsers, 40)
val recommendationsDF = recommendForSubsetDF
.select($"user_id", explode($"recommendations").alias("recommendation"))
.select($"user_id", $"recommendation.*")
display(recommendationsDF)
Here is what I did to get recommendations for a specific user with spark.ml:
import com.github.fommil.netlib.BLAS.{getInstance => blas}
userFactors.lookup(userId).headOption.fold(Map.empty[String, Float]) { user =>
val ratings = itemFactors.map { case (id, features) =>
val rating = blas.sdot(features.length, user, 1, features, 1)
(id, rating)
}
ratings.sortBy(_._2).take(numResults).toMap
}
Both userFactors and itemFactors in my case are RDD[(String, Array[Float])] but you should be able to do something similar with DataFrames.

how to parse a CSV file with dataVec using a schema?

I am trying to load a CSV data set with canova/datavec, and can not find the "idiomatic" way of doing it. I struggle a bit since I feel that there is an evolution of the framework, which makes it difficult for me to determine what is relevant and what is not.
object S extends App{
val recordReader:RecordReader = new CSVRecordReader(0, ",")
recordReader.initialize(new FileSplit(new File("./src/main/resources/CSVdataSet.csv")))
val iter:DataSetIterator = new RecordReaderDataSetIterator(recordReader, 100)
while(iter.hasNext){
println(iter.next())
}
}
I have a csv file that starts with a header description, and thus my output is an exception
(java.lang.NumberFormatException: For input string: "iid":)
I started looking into the schema builder, since I get an exception because of schema/the header. So I was thinking to add a schema like this;
val schema = new Schema.Builder()
.addColumnInteger("iid")
.build()
From my point of view, the noob-view, the BasicDataVec-examples are not completely clear because they link it to spark etc. From the IrisAnalysisExample (https://github.com/deeplearning4j/dl4j-examples/blob/master/datavec-examples/src/main/java/org/datavec/transform/analysis/IrisAnalysis.java).
I assume that the file content is first read into JavaRDD (potentially a Stream) and then treated afterwards. The schema is not used except for the DataAnalysis.
So, could someone help with making me understand how I parse (as a stream or iterator, a CSV-file with a header description as the first line?
I understand from their book (Deep learning:A practitioners Approach) that spark is needed for data transformation (which a schema is used for). I thus rewrote my code to;
object S extends App{
val schema: Schema = new Schema.Builder()
.addColumnInteger("iid")
.build
val recordReader = new CSVRecordReader(0, ",")
val f = new File("./src/main/resources/CSVdataSet.csv")
recordReader.initialize(new FileSplit(f))
val sparkConf:SparkConf = new SparkConf()
sparkConf.setMaster("local[*]");
sparkConf.setAppName("DataVec Example");
val sc:JavaSparkContext = new JavaSparkContext(sparkConf)
val lines = sc.textFile(f.getAbsolutePath);
val examples = lines.map(new StringToWritablesFunction(new CSVRecordReader()))
val process = new TransformProcess.Builder(schema).build()
val executor = new SparkTransformExecutor()
val processed = executor.execute(examples, process)
println(processed.first())
}
I thought now that the schema would dictate that I only would have the iid-column, but the output is:
[iid, id, gender, idg, .....]
It might be considered bad practice to answer my own question, but I will keep my question (and now answer) for a while to see if it was informative and useful for others.
I understand how to use a schema on data where I can create corresponding schema attribute for all of the features. I originally wanted to work on a dataset with more than 200 feature values in each vector. Having to declare a static schema containing a column attribute for all 200 features made it impractical to use. However, there is probably a more dynamic way of creating schemas, and I just have not found that yet. I decided to test my code on the Iris.csv data set. Here the file contains row attributes for;
Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
Which would be implemented as a schema:
val schema: Schema = new Schema.Builder()
.addColumnInteger("Id")
.addColumnDouble("SepalLengthCm")
.addColumnDouble("SepalWidthCm")
.addColumnDouble("PetalLengthCm")
.addColumnDouble("PetalWidthCm")
.addColumnString("Species")
.build
I feel that one of the motives behind using a schema is to be able to transform the data. Thus, I would like to perform a transform operation. A TransformProcess defines a sequence of operations to perform on our data (Using DataVec appendix F page 405 DeepLearning: A practitioners Approach).
A TransformProcess is constructed by specifying two things:
• The Schema of the initial input data
• The set of operations we wish to execute Using DataVec
I decided to see if I could remove a column from the read data:
val process = new TransformProcess.Builder(schema)
.removeColumns("Id")
.build()
Thus, my code became:
import org.datavec.api.records.reader.impl.csv.CSVRecordReader
import org.datavec.api.transform.{DataAction, TransformProcess}
import org.datavec.api.transform.schema.Schema
import java.io.File
import org.apache.spark.api.java.JavaSparkContext
import org.datavec.spark.transform.misc.StringToWritablesFunction
import org.apache.spark.SparkConf
import org.datavec.api.split.FileSplit
import org.datavec.spark.transform.SparkTransformExecutor
object S extends App{
val schema: Schema = new Schema.Builder()
.addColumnInteger("Id")
.addColumnDouble("SepalLengthCm")
.addColumnDouble("SepalWidthCm")
.addColumnDouble("PetalLengthCm")
.addColumnDouble("PetalWidthCm")
.addColumnString("Species")
.build
val recordReader = new CSVRecordReader(0, ",")
val f = new File("./src/main/resources/Iris.csv")
recordReader.initialize(new FileSplit(f))
println(recordReader.next())
val sparkConf:SparkConf = new SparkConf()
sparkConf.setMaster("local[*]");
sparkConf.setAppName("DataVec Example");
val sc:JavaSparkContext = new JavaSparkContext(sparkConf)
val lines = sc.textFile(f.getAbsolutePath);
val examples = lines.map(new StringToWritablesFunction(new CSVRecordReader()))
val process = new TransformProcess.Builder(schema)
.removeColumns("Id")
.build()
val executor = new SparkTransformExecutor()
val processed = executor.execute(examples, process)
println(processed.first())
}
The first prints:
[Id, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm, Species]
the second prints
[SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm, Species]
Edit: I see that I get a crash with
"org.deeplearning4j" % "deeplearning4j-core" % "0.6.0" as my libraryDependency
while with an old dependency it works
"org.deeplearning4j" % "deeplearning4j-core" % "0.0.3.2.7"
libraryDependencies ++= Seq(
"org.datavec" % "datavec-spark_2.11" % "0.5.0",
"org.datavec" % "datavec-api" % "0.5.0",
"org.deeplearning4j" % "deeplearning4j-core" % "0.0.3.2.7"
//"org.deeplearning4j" % "deeplearning4j-core" % "0.6.0"
)

MLLIb: Saving and loading a model

I'm using LinearRegressionWithSGD and then I save the model weights and intercept.
File that contains weights has this format:
1.20455
0.1356
0.000456
Intercept is 0 since I am using train not setting the intercept so it can be ignored for the moment. I would now like to initialize a new model object and using these saved weights from the above file. We are using CDH 5.1
Something along these lines:
// Here is the code the load data and train the model on it.
val weights = sc.textFile("linear-weights");
val model = new LinearRegressionWithSGD(weights);
then use is as:
// Here is where I want to use the trained model to predict on new data.
val valuesAndPreds = testData.map { point =>
// Predicting on new data.
val prediction = model.predict(point.features)
(point.label, prediction)
}
Any pointers to how do I do that?
It appears you are duplicating the training portion of the LinearRegressionWithSGD - which takes a LibSVM file as input.
Are you certain that you want to provide your own weights - instead of allowing the library to do its job in the training phase?
if so, then you can create your own LinearRegressionWithSGD and override the createModel
Here would be your steps given you already have calculated your desired weights / performed the training your own way:
// Stick in your weights below ..
var model = algorithm.createModel(weights, 0.0)
// Now you can run the last steps of the 'normal' process
val prediction = model.predict(test.map(_.features))
val predictionAndLabel = prediction.zip(test.map(_.label))
BTW for reference here is the more 'standard' approach that includes the training steps:
val data = MLUtils.loadLibSVMFile(sc, inputFile).cache()
val splits = examples.randomSplit(Array(0.8, 0.2))
val training = splits(0).cache()
val test = splits(1).cache()
val updater = params.regType match {
case NONE => new SimpleUpdater()
case L1 => new L1Updater()
case L2 => new SquaredL2Updater()
}
val algorithm = new LinearRegressionWithSGD()
val algorithm = new LinearRegressionWithSGD()
algorithm.optimizer
.setNumIterations(params.numIterations)
.setStepSize(params.stepSize)
.setUpdater(updater)
.setRegParam(params.regParam)
val model = algorithm.run(training)
val prediction = model.predict(test.map(_.features))
val predictionAndLabel = prediction.zip(test.map(_.label))