Spark scala running - scala

Hi I am new to spark and scala. I am running scala code in spark scala prompt. The program is fine, it's showing "defined module MLlib" but its not printing anything on screen. What have I done wrong? Is there any other way to run this program spark in scala shell and get the output?
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LabeledPoint
object MLlib {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName(s"Book example: Scala")
val sc = new SparkContext(conf)
// Load 2 types of emails from text files: spam and ham (non-spam).
// Each line has text from one email.
val spam = sc.textFile("/home/training/Spam.txt")
val ham = sc.textFile("/home/training/Ham.txt")
// Create a HashingTF instance to map email text to vectors of 100 features.
val tf = new HashingTF(numFeatures = 100)
// Each email is split into words, and each word is mapped to one feature.
val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
val hamFeatures = ham.map(email => tf.transform(email.split(" ")))
// Create LabeledPoint datasets for positive (spam) and negative (ham) examples.
val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features))
val negativeExamples = hamFeatures.map(features => LabeledPoint(0, features))
val trainingData = positiveExamples ++ negativeExamples
trainingData.cache() // Cache data since Logistic Regression is an iterative algorithm.
// Create a Logistic Regression learner which uses the LBFGS optimizer.
val lrLearner = new LogisticRegressionWithSGD()
// Run the actual learning algorithm on the training data.
val model = lrLearner.run(trainingData)
// Test on a positive example (spam) and a negative one (ham).
// First apply the same HashingTF feature transformation used on the training data.
val posTestExample = tf.transform("O M G GET cheap stuff by sending money to ...".split(" "))
val negTestExample = tf.transform("Hi Dad, I started studying Spark the other ...".split(" "))
// Now use the learned model to predict spam/ham for new emails.
println(s"Prediction for positive test example: ${model.predict(posTestExample)}")
println(s"Prediction for negative test example: ${model.predict(negTestExample)}")
sc.stop()
}
}

A couple of things:
You defined your object in the the Spark shell, so the main class won't get called immediately. You'll have to call it explicitly after you define the object:
MLlib.main(Array())
In fact, if you continue to work on the shell/REPL you can do away with the object altogether; you can define the function directly. For example:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LabeledPoint
def MLlib {
//the rest of your code
}
However, you shouldn't initialize SparkContext it within the shell. From the documentation:
In the Spark shell, a special interpreter-aware SparkContext is
already created for you, in the variable called sc. Making your own
SparkContext will not work
So, you have to either remove that bit from your code, or compile it into a jar and run it using spark-submit

Related

Spark DataFrame to RDD and back

I am writing an Apache Spark application using Scala. To handle and store data I use DataFrames. I have a nice pipeline with feature extraction and a MultiLayerPerceptron classifier, using the ML API.
I also want to use SVM (for comparison purposes). The thing is (and correct me if I am mistaken) only the MLLib provides SVM. And MLLib is not ready to handle DataFrames, only RDDs.
So I figured I can maintain the core of my application using DataFrames and to use SVM 1) I just convert the DataFrame's columns I need to an RDD[LabeledPoint] and 2) after the classification add the SVMs prediction to the DataFrame as a new column.
The first part I handled with a small function:
private def dataFrameToRDD(dataFrame : DataFrame) : RDD[LabeledPoint] = {
val rddMl = dataFrame.select("label", "features").rdd.map(r => (r.getInt(0).toDouble, r.getAs[org.apache.spark.ml.linalg.SparseVector](1)))
rddMl.map(r => new LabeledPoint(r._1, Vectors.dense(r._2.toArray)))
}
I have to specify and convert the type of vector since the feature extraction method uses ML API and not MLLib.
Then, this RDD[LabeledPoint] is fed to the SVM and classification goes smoothly, no issues. At the end and following spark's example I get an RDD[Double]:
val predictions = rdd.map(point => model.predict(point.features))
Now, I want to add the prediction score as column to the original DataFrame and return it. This is where I got stuck. I can convert the RDD[Double] to a DataFrame using
(sql context ommited)
import sqlContext.implicits._
val plDF = predictions.toDF("prediction")
But how do I join the two DataFrames where the second DataFrame becomes a column of the original one? I tried to use methods join and union but got SQL exceptions as the DataFrames have no equal columns to join or unite on.
EDIT
I tried
data.withColumn("prediction", plDF.col("prediction"))
But I get an Analysis Exception :(
I haven't figured out how to do it without recurring to RDDs, but anyway here's how I solved it with RDD. Added the rest of the code so that anyone can understand the complete logic. Any suggestions are appreciated.
package stuff
import java.util.logging.{Level, Logger}
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Row, SQLContext}
/**
* Created by camandros on 10-03-2017.
*/
class LinearSVMClassifier extends Classifier with Serializable{
#transient lazy val log: Logger = Logger.getLogger(getClass.getName)
private var model : SVMModel = _
override def train(data : DataFrame): Unit = {
val rdd = dataFrameToRDD(data)
// Run training algorithm to build the model
val numIter : Int = 100
val step = Osint.properties(Osint.SVM_STEPSIZE).toDouble
val c = Osint.properties(Osint.SVM_C).toDouble
log.log(Level.INFO, "Initiating SVM training with parameters: C="+c+", step="+step)
model = SVMWithSGD.train(rdd, numIterations = numIter, stepSize = step, regParam = c)
log.log(Level.INFO, "Model training finished")
// Clear the default threshold.
model.clearThreshold()
}
override def classify(data : DataFrame): DataFrame = {
log.log(Level.INFO, "Converting DataFrame to RDD")
val rdd = dataFrameToRDD(data)
log.log(Level.INFO, "Conversion finished; beginning classification")
// Compute raw scores on the test set.
val predictions = rdd.map(point => model.predict(point.features))
log.log(Level.INFO, "Classification finished; Transforming RDD to DataFrame")
val sqlContext : SQLContext = Osint.spark.sqlContext
val tupleRDD = data.rdd.zip(predictions).map(t => Row.fromSeq(t._1.toSeq ++ Seq(t._2)))
sqlContext.createDataFrame(tupleRDD, data.schema.add("predictions", "Double"))
//TODO this should work it doesn't since this "withColumn" method seems to be applicable only to add
// new columns using information from the same dataframe; therefore I am using the horrible rdd conversion
//val sqlContext : SQLContext = Osint.spark.sqlContext
//import sqlContext.implicits._
//val plDF = predictions.toDF("predictions")
//data.withColumn("prediction", plDF.col("predictions"))
}
private def dataFrameToRDD(dataFrame : DataFrame) : RDD[LabeledPoint] = {
val rddMl = dataFrame.select("label", "features").rdd.map(r => (r.getInt(0).toDouble, r.getAs[org.apache.spark.ml.linalg.SparseVector](1)))
rddMl.map(r => new LabeledPoint(r._1, Vectors.dense(r._2.toArray)))
}
}

Running Mlib via Spark Job Server

I was practising developing sample model using online resources provided in spark website. I managed to create the model and run it for sample data using Spark-Shell , But how to do actually run the model in production environment ? Is it via Spark Job server ?
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
The above code works perfect when i run it in spark-shell , But i have no idea how do we actually run model in production environment. I tried to run it via spark jobserver but i get error ,
curl -d "input.string = 1, 2, 3, 4, 5, 6, 7, 8, 9" 'ptfhadoop01v:8090/jobs?appName=SQL&classPath=spark.jobserver.SparkPredict'
I am sure its because am passing a String value whereas the program expects it be vector elements , Can someone guide me on how to achieve this . And also is this how the data being passed to Model in production environment ? Or is it some other way.
Spark Job-server is used in production use-cases, where you want to design pipelines of Spark jobs, and also (optionally) use the SparkContext across jobs, over a REST API. Sparkplug is an alternative to Spark Job-server, providing similar constructs.
However, to answer your question on how to run a (singular) Spark job in production environments, the answer is you do not need a third-party library to do so. You only need to construct a SparkContext object, and use it to trigger Spark jobs. For instance, for your code snippet, all that is needed is;
package runner
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import com.typesafe.config.{ConfigFactory, Config}
import org.apache.spark.{SparkConf, SparkContext}
/**
*
*/
object SparkRunner {
def main (args: Array[String]){
val config: Config = ConfigFactory.load("app-default-config") /*Use a library to read a config file*/
val sc: SparkContext = constructSparkContext(config)
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
}
def constructSparkContext(config: Config): SparkContext = {
val conf = new SparkConf()
conf
.setMaster(config.getString("spark.master"))
.setAppName(config.getString("app.name"))
/*Set more configuration values here*/
new SparkContext(conf)
}
}
Optionally, you can also use the wrapper for spark-submit script, SparkSubmit, provided in the Spark library itself.

How can I call a UDF in a UDF?

Hopefully, my title is the correct description of what I am trying to accomplish. I have weather data that is aggregated by week, with each row being one weak and this data is sorted by time. I then have a mathematical expression that I evaluate using this weather data in a Spark UDF. The expressions are evaluated using dynamically generated code that is then injected back into the jvm, I wanted to eventually replace this with a Scala macro, but for now this uses Janino and SimpleCompiler to cook the code and reload the class back in.
Sometimes in these model strings there are variables and functions. The variables are easy to put in since they can be string replaced in the generated code, and the functions for the most part are easy too, because if their names map to an existing static function than it will just execute that when the model is evaluated. For instance an exponent maps to Math.pow in scala.Math.
So my issue is specifically is implementing a lag and lead function for this analysis. Spark has these 2 functions built in, but they are in the above Dataframe layer while this function would be called inside of a UDF, so I am having trouble trying to be able to reference this data back from the top.
So I have this code
import org.apache.spark.sql.expressions.{Window, WindowSpec}
import org.apache.spark.sql.functions.{lag => slag, udf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.{Column, DataFrame}
import org.apache.spark.{SparkConf, SparkContext}
object Functions {
val conf: SparkConf = new SparkConf().setAppName("Blah").setMaster("local[*]")
val ctx: SparkContext = new SparkContext(conf)
val hctx: HiveContext = new HiveContext(ctx)
import hctx.implicits._
def lag(x: Double, window: Int): Double = {
x
}
def lag(c: Column, window: Int = 1)(implicit windowSpec: WindowSpec): Column = {
slag(c, window).over(windowSpec).as(c.toString() + "_lag")
}
def main(args: Array[String]): Unit = {
val funcUdf = udf((f: Column) => lag(f))
val data: DataFrame = ctx.parallelize(Seq(0, 1, 2, 3, 4, 5)).toDF("value")
implicit val spec: WindowSpec = Window.orderBy($"value")
data.select(funcUdf($"value")).show()
}
}
Is there a way to accomplish this? This code doesn't work because of a forward reference. Is there some way or do I have to compute lag windows ahead of time and pass them all around?

How to extract variable weight from spark pipeline logistic model?

I am currently trying to learn Spark Pipeline (Spark 1.6.0). I imported datasets (train and test) as oas.sql.DataFrame objects. After executing the following codes, the produced model is a oas.ml.tuning.CrossValidatorModel.
You can use model.transform (test) to predict based on the test data in Spark. However, I would like to compare the weights that model used to predict with that from R. How to extract the weights of the predictors and intercept (if any) of model? The Scala codes are:
import sqlContext.implicits._
import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.SparkContext
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel}
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
val conTrain = sc.textFile("AbsolutePath2Train.txt")
val conTest = sc.textFile("AbsolutePath2Test.txt")
// parse text and convert to sql.DataFrame
val train = conTrain.map { line =>
val parts = line.split(",")
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(" +").map(_.toDouble)))
}.toDF()
val test =conTest.map{ line =>
val parts = line.split(",")
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(" +").map(_.toDouble)))
}.toDF()
// set parameter space and evaluation method
val lr = new LogisticRegression().setMaxIter(400)
val pipeline = new Pipeline().setStages(Array(lr))
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)
// fit logistic model
val model = cv.fit(train)
// If you want to predict with test
val pred = model.transform(test)
My spark environment is not accessible. Thus, these codes are retyped and rechecked. I hope they are correct. So far, I have tried searching on webs, asking others. About my coding, welcome suggestions, and criticisms.
// set parameter space and evaluation method
val lr = new LogisticRegression().setMaxIter(400)
val pipeline = new Pipeline().setStages(Array(lr))
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)
// you can print lr model coefficients as below
val model = cv.bestModel.asInstanceOf[PipelineModel]
val lrModel = model.stages(0).asInstanceOf[LogisticRegressionModel]
println(s"LR Model coefficients:\n${lrModel.coefficients.toArray.mkString("\n")}")
Two steps:
Get the best pipeline from cross validation result.
Get the LR Model from the best pipeline. It's the first stage in your code example.
I was looking for exactly the same thing. You might already have the answer, but anyway, here it is.
import org.apache.spark.ml.classification.LogisticRegressionModel
val lrmodel = model.bestModel.asInstanceOf[LogisticRegressionModel]
print(model.weight, model.intercept)
I am still not sure about how to extract weights from "model" above. But by restructuring the process towards the official tutorial, the following works on spark-1.6.0:
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}
val lr = new LogisticRegression().setMaxIter(400)
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).build()
val trainValidationSplit = new TrainValidationSplit().setEstimator(lr).setEvaluator(new BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setTrainRatio(0.8)
val restructuredModel = trainValidationSplit.fit(train)
val lrmodel = restructuredModel.bestModel.asInstanceOf[LogisticRegressionModel]
lrmodel.weigths
lrmodel.intercept
I noticed the difference between "lrmodel" here and "model" generated above:
model.bestModel --> gives oas.ml.Model[_] = pipeline_****
restructuredModel.bestModel --> gives oas.ml.Model[_] = logreg_****
That's why we can cast resturcturedModel.bestModel as LogisticRegressionModel but not that of model.bestModel. I'll add more when I understand the reason of the differences.

Can I convert an incoming stream of data into an array?

I'm trying to learn streaming data and manipulating it with the telecom churn dataset provided here. I've written a method to calculate this in batch:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD, LogisticRegressionWithLBFGS, LogisticRegressionModel, NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
object batchChurn{
def main(args: Array[String]): Unit = {
//setting spark context
val conf = new SparkConf().setAppName("churn")
val sc = new SparkContext(conf)
//loading and mapping data into RDD
val csv = sc.textFile("file://filename.csv")
val data = csv.map {line =>
val parts = line.split(",").map(_.trim)
val stringvec = Array(parts(1)) ++ parts.slice(4,20)
val label = parts(20).toDouble
val vec = stringvec.map(_.toDouble)
LabeledPoint(label, Vectors.dense(vec))
}
val splits = data.randomSplit(Array(0.7,0.3))
val (training, testing) = (splits(0),splits(1))
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 6
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 7
val maxBins = 32
val model = RandomForest.trainClassifier(training, numClasses, categoricalFeaturesInfo,numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
val labelAndPreds = testing.map {point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
}
}
I've had no problems with this. Now, I looked at the NetworkWordCount example provided on the spark website, and changed the code slightly to see how it would behave.
val ssc = new StreamingContext(sc, Seconds(5))
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val data = lines.flatMap(_.split(","))
My question is: is it possible to convert this DStream to an array which I can input into my analysis code? Currently when I try to convert to Array using val data = lines.flatMap(_.split(",")), it clearly says that:error: value toArray is not a member of org.apache.spark.streaming.dstream.DStream[String]
Your DStream contains many RDDs you can get access to the RDDs using foreachRDD function.
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/streaming/dstream/DStream.html#foreachRDD(scala.Function1)
then each RDD can be converted to array using collect function.
this has already been shown here
For each RDD in a DStream how do I convert this to an array or some other typical Java data type?
DStream.foreachRDD gives you an RDD[String] for each interval of
course, you could collect in an array
val arr = new ArrayBuffer[String]();
data.foreachRDD {
arr ++= _.collect()
}
Also keep in mind you could end up having way more data than you want in your driver since a DStream can be huge.
To limit the data for your analysis , I would do this way
data.slice(new Time(fromMillis), new Time(toMillis)).flatMap(_.collect()).toSet
You cannot put all the elements of a DStream in an array because those elements will keep being read over the wire, and your array would have to be indefinitely extensible.
The adaptation of this decision tree model to a streaming mode, where training and testing data arrives continuously, is not trivial for algorithmical reasons — while the answers mentioning collect are technically correct, they're not the appropriate solution to what you're trying to do.
If you want to run decision trees on a Stream in Spark, you may want to look at Hoeffding trees.