Running Mlib via Spark Job Server - scala

I was practising developing sample model using online resources provided in spark website. I managed to create the model and run it for sample data using Spark-Shell , But how to do actually run the model in production environment ? Is it via Spark Job server ?
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
The above code works perfect when i run it in spark-shell , But i have no idea how do we actually run model in production environment. I tried to run it via spark jobserver but i get error ,
curl -d "input.string = 1, 2, 3, 4, 5, 6, 7, 8, 9" 'ptfhadoop01v:8090/jobs?appName=SQL&classPath=spark.jobserver.SparkPredict'
I am sure its because am passing a String value whereas the program expects it be vector elements , Can someone guide me on how to achieve this . And also is this how the data being passed to Model in production environment ? Or is it some other way.

Spark Job-server is used in production use-cases, where you want to design pipelines of Spark jobs, and also (optionally) use the SparkContext across jobs, over a REST API. Sparkplug is an alternative to Spark Job-server, providing similar constructs.
However, to answer your question on how to run a (singular) Spark job in production environments, the answer is you do not need a third-party library to do so. You only need to construct a SparkContext object, and use it to trigger Spark jobs. For instance, for your code snippet, all that is needed is;
package runner
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import com.typesafe.config.{ConfigFactory, Config}
import org.apache.spark.{SparkConf, SparkContext}
/**
*
*/
object SparkRunner {
def main (args: Array[String]){
val config: Config = ConfigFactory.load("app-default-config") /*Use a library to read a config file*/
val sc: SparkContext = constructSparkContext(config)
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
}
def constructSparkContext(config: Config): SparkContext = {
val conf = new SparkConf()
conf
.setMaster(config.getString("spark.master"))
.setAppName(config.getString("app.name"))
/*Set more configuration values here*/
new SparkContext(conf)
}
}
Optionally, you can also use the wrapper for spark-submit script, SparkSubmit, provided in the Spark library itself.

Related

How to Run Apache Tika on Apache Spark

I am trying to run Apache Tika on Apache Spark on AWS EMR to perform distributed text extraction on a large collection of documents. I have built the Tika JAR with shaded dependencies as explained in https://forums.databricks.com/questions/28378/trying-to-use-apache-tika-on-databricks.html and the job works correctly in local mode. However when running the job in clustered mode, the extracted text always comes out as an empty string. This problem is outlined in Tika's documentation (https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-NoContentExtracted), but I haven't been able to debug the issue. Since the code works for me in local mode it has to be something with the classpath or JARs, and I can't figure it out.
Here is sample Scala code for my Spark job:
/* TikaTest.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.tika.parser._
import org.apache.tika.sax.BodyContentHandler
import org.apache.tika.metadata.Metadata
import java.io.DataInputStream
// The first argument must be an S3 path to a directory with documents for text extraction.
// The second argument must be an S3 path to a directory where extracted text will be written.
object TikaTest {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Tika Test")
val sc = new SparkContext(conf)
val binRDD = sc.binaryFiles(args(0))
val textRDD = binRDD.map(file => {parseFile(file._2.open( ))})
textRDD.saveAsTextFile(args(1))
sc.stop()
}
def parseFile(stream: DataInputStream): String = {
val parser = new AutoDetectParser()
val handler = new BodyContentHandler()
val metadata = new Metadata()
val context = new ParseContext()
parser.parse(stream, handler, metadata, context)
return handler.toString()
}
}

Setting UP Intellij to run apache spark with remote master

I have a project setup with H2o. I am able to run the code with apache toree and I set the spark master as spark://xxxx.yyyy.zzzz:port.
It works fine and I can see the output in spark UI.
I am trying to run the same code as application in intellij with but I get error java.lang.ClassNotFoundException: org.apache.spark.h2o.utils.NodeDesc. but I see the application in Spark UI for a short amount of Time.
I tried running simple application with hello world and that worked as well as I am able to see the application in Spark UI,
import java.io.File
import hex.tree.gbm.GBM
import hex.tree.gbm.GBMModel.GBMParameters
import org.apache.spark.h2o.{StringHolder, H2OContext}
import org.apache.spark.{SparkFiles, SparkContext, SparkConf}
import water.fvec.H2OFrame
/**
* Example of Sparkling Water based application.
*/
object SparklingWaterDroplet {
def main(args: Array[String]) {
// Create Spark Context
val conf = configure("Sparkling Water Droplet")
val sc = new SparkContext(conf)
// Create H2O Context
val h2oContext = H2OContext.getOrCreate(sc)
import h2oContext.implicits._
// Register file to be available on all nodes
sc.addFile(this.getClass.getClassLoader.getResource("iris.csv").getPath)
// Load data and parse it via h2o parser
val irisTable = new H2OFrame(new File(SparkFiles.get("iris.csv")))
// Build GBM model
val gbmParams = new GBMParameters()
gbmParams._train = irisTable
gbmParams._response_column = 'class
gbmParams._ntrees = 5
val gbm = new GBM(gbmParams)
val gbmModel = gbm.trainModel.get
// Make prediction on train data
val predict = gbmModel.score(irisTable)('predict)
// Compute number of mispredictions with help of Spark API
val trainRDD = h2oContext.asRDD[StringHolder](irisTable('class))
val predictRDD = h2oContext.asRDD[StringHolder](predict)
// Make sure that both RDDs has the same number of elements
assert(trainRDD.count() == predictRDD.count)
val numMispredictions = trainRDD.zip(predictRDD).filter( i => {
val act = i._1
val pred = i._2
act.result != pred.result
}).collect()
println(
s"""
|Number of mispredictions: ${numMispredictions.length}
|
|Mispredictions:
|
|actual X predicted
|------------------
|${numMispredictions.map(i => i._1.result.get + " X " + i._2.result.get).mkString("\n")}
""".stripMargin)
// Shutdown application
sc.stop()
}
def configure(appName:String = "Sparkling Water Demo"):SparkConf = {
val conf = new SparkConf().setAppName(appName)
.setMaster("spark://xxx.yyy.zz.aaaa:oooo")
conf
}
}
I also tried exporting jars as compile from the dependencies menu ,
is there anything I am missing from Intellij setup.?
It looks like the external libraries are not getting pushed to the master

Apache Ignite Scala Program brings up Ignite Shell and does not progress

This very simple Apache Ignite Scala program is bringing up Ignite Shell and is not progressing further beyond the IgniteContext line ; It just waits, typical of a REPL shell ; What change do I need to make to not bring up Ignite Shell? All I want to do is store data to ignite cache and then read data from ignite cache from within a scala/spark program ...
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.ignite.spark._
import org.apache.ignite.configuration._
object IgniteIt {
def main(args: Array[String]) {
println("\n==========\nIgnite!\n==========\n")
val cf = new SparkConf().setAppName("Ignite")
val sc = new SparkContext(cf)
val igniteContext = new IgniteContext(sc, "cfg/example-cache.xml")
val cacheRdd: org.apache.ignite.spark.IgniteRDD[Int,String] = igniteContext.fromCache("partitioned")
val data = Array((1,"One"),(2,"two"),(3,"three"),(4,"four"),(5,"five"))
val distData = sc.parallelize(data)
cacheRdd.savePairs(distData)
val result = cacheRdd.filter(_._2.contains("three")).collect()
result.foreach(println)
igniteContext.close(false)
println("\n==========\nDone!\n==========\n")
}
}
I think, you do not start Ignite.sh before IgniteContest invoked.
You need to do:
cd $IGNITE_HOME
bin/ignite.sh

Will standalone scala program takes advantage of distributed/parallel processing? or does spark Scala require separate code?

First of all sorry for asking the basic doubt over here, but still explanation for the below will be appreciable..
i am very new to scala and spark, so my doubt is if i write a standalone scala program, and execute it on spark(1 master 3 worker), will the scala program takes advantage of disturbed/parallel processing, or should i need to write a separate program to get an advantage of distributed processing??
For example, we have a scala code that process a particular formatted file to comma separated file, it takes a directory as input and parses all file and write an output to single file(each file will be usually 100-200MB). So here is the code.
import scala.io.Source
import java.io.File
import java.io.PrintWriter
import scala.collection.mutable.ListBuffer
import java.util.Calendar
//import scala.io.Source
//import org.apache.spark.SparkContext
//import org.apache.spark.SparkContext._
//import org.apache.spark.SparkConf
object Parser {
def main(args:Array[String]) {
//val conf = new SparkConf().setAppName("fileParsing").setMaster("local[*]")
//val sc = new SparkContext(conf)
var inp = new File(args(0))
var ext: String = ""
if(args.length == 1)
{ ext = "log" } else { ext = args(1) }
var files: List[String] = List("")
if (inp.exists && inp.isDirectory) {
files = getListOfFiles(inp,ext)
}
else if(inp.exists ) {
files = List(inp.toString)
}
else
{
println("Enter the correct Directory/File name");
System.exit(0);
}
if(files.length <=0 )
{
println(s"No file found with extention '.$ext'")
}
else{
var out_file_name = "output_"+Calendar.getInstance().getTime.toString.replace(" ","-").replace(":","-")+".log"
var data = getHeader(files(0))
var writer=new PrintWriter(new File(out_file_name))
var record_count = 0
//var allrecords = data.mkString(",")+("\n")
//writer.write(allrecords)
for(eachFile <- files)
{
record_count += parseFile(writer,data,eachFile)
}
writer.close()
println(record_count +s" processed into $out_file_name")
}
//all func are defined here.
}
Files from the specific dir are read using scala.io
Source.fromFile(file).getLines
So my doubt is will the above code(standalone prg) can be executed on distributed spark system? will i get an advantage of parallel processing??
ok, how about using sc to read file, will it then uses distributed processing
val conf = new SparkConf().setAppName("fileParsing").setMaster("local[*]")
val sc = new SparkContext(conf)
...
...
for(eachFile <- files)
{
record_count += parseFile(sc,writer,data,eachFile)
}
------------------------------------
def parseFile(......)
sc.textFile(file).getLines
So if i edit the top code to make use of sc then will it process on distributes spark system.
No it won't. To make use of distributed computing using Spark, you need to use SparkContext.
If you run the application you have provided using spark-submit you will not be using the Spark cluster at all. You have to rewrite it to use the SparkContext. Please read through the Spark Programming Guide.
It is extremely helpful to watch some introductory videos on Youtube for getting to know how Apache Spark works in general.
For example, these:
https://www.youtube.com/watch?v=7k4yDKBYOcw
https://www.youtube.com/watch?v=rvDpBTV89AM&list=PLF6snu5Jy-v-WRAcCfWNHks7lcNO-zrTI&index=4
Is is very important to understand it for using Spark.
"advantage of distributed processing"
Using Spark can give you advantages of distributing processing on multiple server cluster. So if you are going to move your application later to the cluster, it makes sense to develop application using Spark model and corresponding API.
Well, you can run Spark application locally on your local machine but in this case you won't get all the advantages the Spark can provide.
Anyway, as it is said before, Spark is a special framework with its own libraries for developtment. So you have to rewrite your application using Spark context and Spark API, i.e. special objects like RDDs or Dataframes and corresponding methods.

Spark scala running

Hi I am new to spark and scala. I am running scala code in spark scala prompt. The program is fine, it's showing "defined module MLlib" but its not printing anything on screen. What have I done wrong? Is there any other way to run this program spark in scala shell and get the output?
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LabeledPoint
object MLlib {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName(s"Book example: Scala")
val sc = new SparkContext(conf)
// Load 2 types of emails from text files: spam and ham (non-spam).
// Each line has text from one email.
val spam = sc.textFile("/home/training/Spam.txt")
val ham = sc.textFile("/home/training/Ham.txt")
// Create a HashingTF instance to map email text to vectors of 100 features.
val tf = new HashingTF(numFeatures = 100)
// Each email is split into words, and each word is mapped to one feature.
val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
val hamFeatures = ham.map(email => tf.transform(email.split(" ")))
// Create LabeledPoint datasets for positive (spam) and negative (ham) examples.
val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features))
val negativeExamples = hamFeatures.map(features => LabeledPoint(0, features))
val trainingData = positiveExamples ++ negativeExamples
trainingData.cache() // Cache data since Logistic Regression is an iterative algorithm.
// Create a Logistic Regression learner which uses the LBFGS optimizer.
val lrLearner = new LogisticRegressionWithSGD()
// Run the actual learning algorithm on the training data.
val model = lrLearner.run(trainingData)
// Test on a positive example (spam) and a negative one (ham).
// First apply the same HashingTF feature transformation used on the training data.
val posTestExample = tf.transform("O M G GET cheap stuff by sending money to ...".split(" "))
val negTestExample = tf.transform("Hi Dad, I started studying Spark the other ...".split(" "))
// Now use the learned model to predict spam/ham for new emails.
println(s"Prediction for positive test example: ${model.predict(posTestExample)}")
println(s"Prediction for negative test example: ${model.predict(negTestExample)}")
sc.stop()
}
}
A couple of things:
You defined your object in the the Spark shell, so the main class won't get called immediately. You'll have to call it explicitly after you define the object:
MLlib.main(Array())
In fact, if you continue to work on the shell/REPL you can do away with the object altogether; you can define the function directly. For example:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LabeledPoint
def MLlib {
//the rest of your code
}
However, you shouldn't initialize SparkContext it within the shell. From the documentation:
In the Spark shell, a special interpreter-aware SparkContext is
already created for you, in the variable called sc. Making your own
SparkContext will not work
So, you have to either remove that bit from your code, or compile it into a jar and run it using spark-submit