Compilation error saving model written in Scala, Apache Spark - scala

I am running the example source code provided by Apache Spark to create an FPGrowth model. I want to save the model for future use, therefore I wrote the ending line of this code (model.save):
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.fpm.FPGrowth
import org.apache.spark.mllib.util._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import java.io._
import scala.collection.mutable.Set
object App {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("prediction").setMaster("local[*]")
val sc = new SparkContext(conf)
val data = sc.textFile("FPFeatureSeries.txt")
val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))
val fpg = new FPGrowth()
.setMinSupport(0.1)
.setNumPartitions(10)
val model = fpg.run(transactions)
val minConfidence = 0.8
model.generateAssociationRules(minConfidence).collect().foreach { rule =>
if(rule.confidence>minConfidence){
println(
rule.antecedent.mkString("[", ",", "]")
+ " => " + rule.consequent .mkString("[", ",", "]")
+ ", " + rule.confidence)
}
}
model.save(sc, "FPGrowthModel");
}
}
The problem is that I get a compilation error: value save is not a member of org.apache.spark.mllib.fpm.FPGrowth
I have tried including libraries and copying the exact examples from the documentation but I am still getting the same error.
I am using Spark 2.0.0 and Scala 2.10.

i had the same issue.
used this to save model
sc.parallelize(Seq(model), 1).saveAsObjectFile("path")
and to load model
val linRegModel = sc.objectFile[LinearRegressionModel]("path").first()
this might help..
what-is-the-right-way-to-save-load-models-in-spark-pyspark

Related

why the error is given while reading the text file in spark

The path given to the text file is correct still I am getting error " Input path does not exist: file:/C:/Users/cmpil/Downloads/hunger_games.txt". Why is it happening
import org.apache.spark.sql._
import org.apache.log4j._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object WordCountDataSet {
case class Book(value:String)
def main(args:Array[String]): Unit ={
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.appName("WordCount")
.master("local[*]")
.getOrCreate()
import spark.implicits._
//Another way of doing it
val bookRDD = spark.sparkContext.textFile("C:/Users/cmpil/Downloads/hunger_games.txt")
val wordsRDD = bookRDD.flatMap(x => x.split("\\W+"))
val wordsDS = wordsRDD.toDS()
val lowercaseWordsDS = wordsDS.select(lower($"value").alias("word"))
val wordCountsDS = lowercaseWordsDS.groupBy("word").count()
val wordCountsSortedDS = wordCountsDS.sort("count")
wordCountsSortedDS.show(wordCountsSortedDS.count().toInt)
}
}
on windows you have to use '\\' in place of '/'
try using "C:\\Users\\cmpil\\Downloads\\hunger_games.txt"

How to check if the batches are empty in Spark streaming (wordcount with socketTextStream)

I working on simple SparkStreaming wordcount example to to count the number of words in text data received from a data server listening on a TCP socket.
I would like to check if the batch from streaming source is empty or not before I save the content of every transformation to a text files. Currently, I am using Spark Shell. This is my code
I have tried this code, and it works fine without checking if the batch is empty or not:
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import org.apache.log4j.{Level, Logger}
Logger.getRootLogger.setLevel(Level.WARN)
val ssc = new StreamingContext(sc, Seconds(2))
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_AND_DISK_SER)
lines.saveAsTextFiles("/stream_test/testLine.txt")
val words = lines.flatMap(_.split(" "))
words.saveAsTextFiles("/stream_test/testWords.txt")
val pairs = words.map((_, 1))
pairs.saveAsTextFiles("/stream_test/testPairs.txt")
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.saveAsTextFiles("/stream_test/testWordsCounts.txt")
wordCounts.print()
ssc.start()
I have tried to use foreachRDD but it gives me an error error: value saveAsTextFiles is not a member of org.apache.spark.rdd.RDD[String]
This is my code
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import org.apache.log4j.{Level, Logger}
Logger.getRootLogger.setLevel(Level.WARN)
val ssc = new StreamingContext(sc, Seconds(3))
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_AND_DISK_SER)
lines.foreachRDD(rdd => {
if(!rdd.partitions.isEmpty)
{
lines.saveAsTextFiles("/stream_test/testLine.txt")
val words = lines.flatMap(_.split(" "))
words.saveAsTextFiles("/stream_test/testWords.txt")
val pairs = words.map((_, 1))
pairs.saveAsTextFiles("/stream_test/testPairs.txt")
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.saveAsTextFiles("/stream_test/testWordsCounts.txt")
wordCounts.print()
}
})
ssc.start()
I need to to check if the batch from streaming source is empty or not before I save the content text files. I appreciate your help
I used to do it using following code. I will loop each rdd in stream and then use rdd.count() to judge if a rdd is empty. if all rdds is empty, nothing happened, hope it can help you.
kafkaStream.foreachRDD(rdd -> {
if(rdd.count() > 0) {
// do something
}
})
You can try the below code snippet to check your streaming batches are empty or not:
if(!rdd.partitions.isEmpty)
rdd.saveAsTextFile(outputDir)

Read the data from HDFS using Scala

I am new to Scala. How can I read a file from HDFS using Scala (not using Spark)?
When I googled it I only found writing option to HDFS.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import java.io.PrintWriter;
/**
* #author ${user.name}
*/
object App {
//def foo(x : Array[String]) = x.foldLeft("")((a,b) => a + b)
def main(args : Array[String]) {
println( "Trying to write to HDFS..." )
val conf = new Configuration()
//conf.set("fs.defaultFS", "hdfs://quickstart.cloudera:8020")
conf.set("fs.defaultFS", "hdfs://192.168.30.147:8020")
val fs= FileSystem.get(conf)
val output = fs.create(new Path("/tmp/mySample.txt"))
val writer = new PrintWriter(output)
try {
writer.write("this is a test")
writer.write("\n")
}
finally {
writer.close()
println("Closed!")
}
println("Done!")
}
}
Please help me.How can read the file or load file from HDFS using scala.
One of the ways (kinda in functional style) could be like this:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import java.net.URI
import scala.collection.immutable.Stream
val hdfs = FileSystem.get(new URI("hdfs://yourUrl:port/"), new Configuration())
val path = new Path("/path/to/file/")
val stream = hdfs.open(path)
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))
//This example checks line for null and prints every existing line consequentally
readLines.takeWhile(_ != null).foreach(line => println(line))
Also you could take a look this article or here and here, these questions look related to yours and contain working (but more Java-like) code examples if you're interested.

display the content of clusters after clustering in streaming-k-means.scala code source in spark

i want to run the streaming k-means-example.scala code source (mllib) on spark , someone tell me how i can how I can display the content of clusters after clustering (for example i want to clustering data into 3 clusters , how i can display the cntent of the 3 clusters in 3 files and the content of centers in file.txt)
package org.apache.spark.examples.mllib
import org.apache.spark.SparkConf
import org.apache.spark.mllib.clustering.StreamingKMeans
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.streaming.{Seconds, StreamingContext}
object StreamingKMeansExample {
def main(args: Array[String]) {
if (args.length != 5) {
System.err.println( "Usage: StreamingKMeansExample " +
"<trainingDir> <testDir> <batchDuration> <numClusters> <numDimensions>")
System.exit(1)
}
val conf = new SparkConf().setMaster("localhost").setAppName
("StreamingKMeansExample")
val ssc = new StreamingContext(conf, Seconds(args(2).toLong))
val trainingData = ssc.textFileStream(args(0)).map(Vectors.parse)
val testData = ssc.textFileStream(args(1)).map(LabeledPoint.parse)
val model = new StreamingKMeans().setK(args(3).toInt)
.setDecayFactor(1.0)
.setRandomCenters(args(4).toInt, 0.0)
model.trainOn(trainingData)
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
ssc.start()
ssc.awaitTermination()
You would have to use the predict method on your RDD( look here for reference)
Then you could zip your Rdd containing values and your RDD of predicted clusters they fall in.

Unable to convert Spark RDD to Schema RDD

I am trying to execute the example provided in Spark programming guide.
https://spark.apache.org/docs/1.1.0/sql-programming-guide.html
But I am facing the compilation error.
(I am a Scala newbie)
Below is my code:
import org.apache.spark.{SparkContext,SparkConf}
import org.apache.spark.sql._
import org.apache.spark.sql
object Temp {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setMaster("local").setAppName("SPARK SQL example")
val sc= new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
case class Person(name: String, age: Int)
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
people.registerTempTable("people")
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
}
}
I am facing the compilation error No TypeTag available for Person at the line people.registerTempTable("people").
How to resolve this error?
It is failing because the Person class is defined inside of the function and as such the Scala compiler will not create a TypeTag for the class. As Paul suggested you can move it out of the function to the top level.
I'll add that there is a JIRA to relax this restriction: https://issues.apache.org/jira/browse/SPARK-4842