How do you deserialize immutable collection using Kryo? - scala

How do you deserialize immutable collection using Kryo ? Do I need to register something in addition to what I've done ?
Here is my sample code
import com.esotericsoftware.kryo.Kryo
import com.esotericsoftware.kryo.io.Input
import com.esotericsoftware.kryo.io.Output
import com.romix.scala.serialization.kryo._
val kryo = new Kryo
// Serialization of Scala maps like Trees, etc
kryo.addDefaultSerializer(classOf[scala.collection.Map[_,_]], classOf[ScalaMapSerializer])
kryo.addDefaultSerializer(classOf[scala.collection.generic.MapFactory[scala.collection.Map]], classOf[ScalaMapSerializer])
// Serialization of Scala sets
kryo.addDefaultSerializer(classOf[scala.collection.Set[_]], classOf[ScalaSetSerializer])
kryo.addDefaultSerializer(classOf[scala.collection.generic.SetFactory[scala.collection.Set]], classOf[ScalaSetSerializer])
// Serialization of all Traversable Scala collections like Lists, Vectors, etc
kryo.addDefaultSerializer(classOf[scala.collection.Traversable[_]], classOf[ScalaCollectionSerializer])
val filename = "c:\\aaa.bin"
val ofile = new FileOutputStream(filename)
val output2 = new BufferedOutputStream(ofile)
val output = new Output(output2)
kryo.writeClassAndObject(output, List("Test1", "Test2"))
output.close()
val ifile = new FileInputStream(filename)
val input = new Input(new BufferedInputStream(ifile))
val deserialized = kryo.readClassAndObject(input)
input.close()
It throws exception
Exception in thread "main" com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): scala.collection.immutable.$colon$colon

Try this out as it worked for me:
import com.esotericsoftware.kryo.Kryo
import com.esotericsoftware.kryo.io.Input
import com.esotericsoftware.kryo.io.Output
import com.romix.scala.serialization.kryo._
import org.objenesis.strategy.StdInstantiatorStrategy
val kryo = new Kryo
kryo.setRegistrationRequired(false)
kryo.setInstantiatorStrategy(new StdInstantiatorStrategy());
kryo.register(classOf[scala.collection.immutable.$colon$colon[_]],60)
kryo.register(classOf[scala.collection.immutable.Nil$],61)
kryo.addDefaultSerializer(classOf[scala.Enumeration#Value], classOf[EnumerationSerializer])
val filename = "c:\\aaa.bin"
val ofile = new FileOutputStream(filename)
val output2 = new BufferedOutputStream(ofile)
val output = new Output(output2)
kryo.writeClassAndObject(output, List("Test1", "Test2"))
output.close()
val ifile = new FileInputStream(filename)
val input = new Input(new BufferedInputStream(ifile))
val deserialized = kryo.readClassAndObject(input)
input.close()
As an FYI, I got this working by looking at the unit tests for the romix library and then doing what they were doing.

Related

Getting Task not Serializable error on trying to create Decision Tree using Spark with Scala while executing in my local machine

I am trying to create Fraud Transaction Detector using spark with scala. My code works fine with normal Spark logic. However when I try the solution using decision tree approach I get task not Serializable error. From my understanding I get this error when I try fitting my training data into pipeline. I tried a few solutions like extending to Serializable, converting my indexed data back to string etc. but none of them worked.
Can someone please help in understanding what I am doing wrong
Below is my code
package com.vinspark.frauddetection
object FraudDet_ML extends Serializable{
def main(args:Array[String]){
Logger.getLogger("org").setLevel(Level.ERROR)
val spark=SparkSession
.builder()
.appName("FraudDetection")
.master("local[*]")
.config("spark.sql.warehouse.dir","file:///C:/temp")
.getOrCreate()
import spark.sqlContext.implicits._
var df=spark.read.format("csv").option("header", "true").option("mode",
"DROPMALFORMED").option("inferSchema", "true").load("../PS_20174392719_1491204439457_log.csv")
df= df.withColumn("orgDiff", col("newbalanceOrig") -
col("oldbalanceOrg")).withColumn("destDiff",
col("newbalanceDest") - col("oldbalanceDest"))
df= df.withColumn("label",
when((col("oldbalanceOrg") <=56900 && col("type")=="TRANSFER" && col("newbalanceDest") <= 105)
||(col("oldbalanceOrg") >56900 && col("newbalanceOrig")<=12)
||(col("oldbalanceOrg") >56900 && col("newbalanceOrig")>12 && col("amount")>1160000),
1)
.otherwise(0)
)
df.createOrReplaceTempView("transaction")
val indexer = new StringIndexer().setInputCol("type").setOutputCol("typeIndexed")
println("indexer created")
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel").fit(df)
val splitDataSet: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]]
=df.randomSplit(Array(0.8, 0.2), 12345L)
val trainDataFrame = splitDataSet(0)val testDataFrame = splitDataSet(1)
val train = splitDataSet(0)
val test = splitDataSet(1)
println("train: "+train.count()+" test: "+test.count())
val va = new VectorAssembler().setInputCols(Array("typeIndexed", "amount", "oldbalanceOrg",
"newbalanceOrig", "oldbalanceDest", "newbalanceDest", "orgDiff",
"destDiff")).setOutputCol("features")
println("vector assembler created")
val dt = new DecisionTreeClassifier()
.setLabelCol("label")
.setFeaturesCol("features").setSeed(54321).setMaxDepth(5)
println("decision tree created")
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)
val pipeline = new Pipeline().setStages(Array(indexer,labelIndexer, va, dt))
println("pipeline created")
val pipeConst=pipeline
val model = pipeConst.fit(train)
val model1 = model
println("model created")
val prediction=model1.transform(test)
println("indexer created")
prediction.collect().foreach(println)
}
}

How to use feature extraction with DStream in Apache Spark

I have data that arrive from Kafka through DStream. I want to perform feature extraction in order to obtain some keywords.
I do not want to wait for arrival of all data (as it is intended to be continuous stream that potentially never ends), so I hope to perform extraction in chunks - it doesn't matter to me if the accuracy will suffer a bit.
So far I put together something like that:
def extractKeywords(stream: DStream[Data]): Unit = {
val spark: SparkSession = SparkSession.builder.getOrCreate
val streamWithWords: DStream[(Data, Seq[String])] = stream map extractWordsFromData
val streamWithFeatures: DStream[(Data, Array[String])] = streamWithWords transform extractFeatures(spark) _
val streamWithKeywords: DStream[DataWithKeywords] = streamWithFeatures map addKeywordsToData
streamWithFeatures.print()
}
def extractFeatures(spark: SparkSession)
(rdd: RDD[(Data, Seq[String])]): RDD[(Data, Array[String])] = {
val df = spark.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(numOfFeatures)
val rawFeatures = hashingTF.transform(df)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(rawFeatures)
val rescaledData = idfModel.transform(rawFeature)
import spark.implicits._
rescaledData.select("data", "features").as[(Data, Array[String])].rdd
}
However, I received java.lang.IllegalStateException: Haven't seen any document yet. - I am not surprised as I just try out to scrap things together, and I understand that since I am not waiting for an arrival of some data, the generated model might be empty when I try to use it on data.
What would be the right approach for this problem?
I used advises from comments and split the procedure into 2 runs:
one that calculated IDF model and saves it to file
def trainFeatures(idfModelFile: File, rdd: RDD[(String, Seq[String])]) = {
val session: SparkSession = SparkSession.builder.getOrCreate
val wordsDf = session.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val featurizedDf = hashingTF.transform(wordsDf)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedDf)
idfModel.write.save(idfModelFile.getAbsolutePath)
}
one that reads IDF model from file and simply runs it on all incoming information
val idfModel = IDFModel.load(idfModelFile.getAbsolutePath)
val documentDf = spark.createDataFrame(rdd).toDF("update", "document")
val tokenizer = new Tokenizer().setInputCol("document").setOutputCol("words")
val wordsDf = tokenizer.transform(documentDf)
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val featurizedDf = hashingTF.transform(wordsDf)
val extractor = idfModel.setInputCol("rawFeatures").setOutputCol("features")
val featuresDf = extractor.transform(featurizedDf)
featuresDf.select("update", "features")

How to write Iterator[String] result from mapPartitions into one file?

I am new to Spark and Scala that is why I am having quite a hard time to get through this.
What I intend to do is to pre-process my data with Stanford CoreNLP using Spark. I understand that I have to use mapPartitions in order to have one StanfordCoreNLP instance per partition as suggested in this thread. However, I lack of knowledge/understanding how to proceed from here.
In the end I want to train word vectors on this data but for now I would be happy to find out how I can get my processed data from here and write it into another file.
This is what I got so far:
import java.util.Properties
import com.google.gson.Gson
import edu.stanford.nlp.ling.CoreAnnotations.{LemmaAnnotation, SentencesAnnotation, TokensAnnotation}
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}
import edu.stanford.nlp.util.CoreMap
import masterthesis.code.wordvectors.Review
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.JavaConversions._
object ReviewPreprocessing {
def main(args: Array[String]) {
val resourceUrl = getClass.getResource("amazon-reviews/reviews_Electronics.json")
val file = sc.textFile(resourceUrl.getPath)
val linesPerPartition = file.mapPartitions( lineIterator => {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val sentencesAsTextList : List[String] = List()
val pipeline = new StanfordCoreNLP(props)
val gson = new Gson()
while(lineIterator.hasNext) {
val line = lineIterator.next
val review = gson.fromJson(line, classOf[Review])
val doc = new Annotation(review.getReviewText)
pipeline.annotate(doc)
val sentences : java.util.List[CoreMap] = doc.get(classOf[SentencesAnnotation])
val sb = new StringBuilder();
sentences.foreach( sentence => {
val tokens = sentence.get(classOf[TokensAnnotation])
tokens.foreach( token => {
sb.append(token.get(classOf[LemmaAnnotation]))
sb.append(" ")
})
})
sb.setLength(sb.length - 1)
sentencesAsTextList.add(sb.toString)
}
sentencesAsTextList.iterator
})
System.exit(0)
}
}
How would I e.g. write this result into one single file? The ordering does not matter here - I guess the ordering is lost at this point anyway.
In case you'd use saveAsTextFile right on your RDD, you'd end up having as many output files as many partitions you have. In order to have just one you can either coalesce everything into one partition like
sc.textFile("/path/to/file")
.mapPartitions(someFunc())
.coalesce(1)
.saveAsTextFile("/path/to/another/file")
Or (just for fun) you could get all partitions to driver one by one and save all data yourself.
val it = sc.textFile("/path/to/file")
.mapPartitions(someFunc())
.toLocalIterator
while(it.hasNext) {
writeToFile(it.next())
}

Can I convert an incoming stream of data into an array?

I'm trying to learn streaming data and manipulating it with the telecom churn dataset provided here. I've written a method to calculate this in batch:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD, LogisticRegressionWithLBFGS, LogisticRegressionModel, NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
object batchChurn{
def main(args: Array[String]): Unit = {
//setting spark context
val conf = new SparkConf().setAppName("churn")
val sc = new SparkContext(conf)
//loading and mapping data into RDD
val csv = sc.textFile("file://filename.csv")
val data = csv.map {line =>
val parts = line.split(",").map(_.trim)
val stringvec = Array(parts(1)) ++ parts.slice(4,20)
val label = parts(20).toDouble
val vec = stringvec.map(_.toDouble)
LabeledPoint(label, Vectors.dense(vec))
}
val splits = data.randomSplit(Array(0.7,0.3))
val (training, testing) = (splits(0),splits(1))
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 6
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 7
val maxBins = 32
val model = RandomForest.trainClassifier(training, numClasses, categoricalFeaturesInfo,numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
val labelAndPreds = testing.map {point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
}
}
I've had no problems with this. Now, I looked at the NetworkWordCount example provided on the spark website, and changed the code slightly to see how it would behave.
val ssc = new StreamingContext(sc, Seconds(5))
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val data = lines.flatMap(_.split(","))
My question is: is it possible to convert this DStream to an array which I can input into my analysis code? Currently when I try to convert to Array using val data = lines.flatMap(_.split(",")), it clearly says that:error: value toArray is not a member of org.apache.spark.streaming.dstream.DStream[String]
Your DStream contains many RDDs you can get access to the RDDs using foreachRDD function.
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/streaming/dstream/DStream.html#foreachRDD(scala.Function1)
then each RDD can be converted to array using collect function.
this has already been shown here
For each RDD in a DStream how do I convert this to an array or some other typical Java data type?
DStream.foreachRDD gives you an RDD[String] for each interval of
course, you could collect in an array
val arr = new ArrayBuffer[String]();
data.foreachRDD {
arr ++= _.collect()
}
Also keep in mind you could end up having way more data than you want in your driver since a DStream can be huge.
To limit the data for your analysis , I would do this way
data.slice(new Time(fromMillis), new Time(toMillis)).flatMap(_.collect()).toSet
You cannot put all the elements of a DStream in an array because those elements will keep being read over the wire, and your array would have to be indefinitely extensible.
The adaptation of this decision tree model to a streaming mode, where training and testing data arrives continuously, is not trivial for algorithmical reasons — while the answers mentioning collect are technically correct, they're not the appropriate solution to what you're trying to do.
If you want to run decision trees on a Stream in Spark, you may want to look at Hoeffding trees.

Wiki xml parser - org.apache.spark.SparkException: Task not serializable

I am newbie to both scala and spark, and trying some of the tutorials, this one is from Advanced Analytics with Spark. The following code is supposed to work:
import com.cloudera.datascience.common.XmlInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io._
val path = "/home/petr/Downloads/wiki/wiki"
val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, "<page>")
conf.set(XmlInputFormat.END_TAG_KEY, "</page>")
val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat],
classOf[LongWritable], classOf[Text], conf)
val rawXmls = kvs.map(p => p._2.toString)
import edu.umd.cloud9.collection.wikipedia.language._
import edu.umd.cloud9.collection.wikipedia._
def wikiXmlToPlainText(xml: String): Option[(String, String)] = {
val page = new EnglishWikipediaPage()
WikipediaPage.readPage(page, xml)
if (page.isEmpty) None
else Some((page.getTitle, page.getContent))
}
val plainText = rawXmls.flatMap(wikiXmlToPlainText)
But it gives
scala> val plainText = rawXmls.flatMap(wikiXmlToPlainText)
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1622)
at org.apache.spark.rdd.RDD.flatMap(RDD.scala:295)
...
Running Spark v1.3.0 on a local (and I have loaded only about a 21MB of the wiki articles, just to test it).
All of https://stackoverflow.com/search?q=org.apache.spark.SparkException%3A+Task+not+serializable didn't get me any clue...
Thanks.
try
import com.cloudera.datascience.common.XmlInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io._
val path = "/home/terrapin/Downloads/enwiki-20150304-pages-articles1.xml-p000000010p000010000"
val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, "<page>")
conf.set(XmlInputFormat.END_TAG_KEY, "</page>")
val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat],
classOf[LongWritable], classOf[Text], conf)
val rawXmls = kvs.map(p => p._2.toString)
import edu.umd.cloud9.collection.wikipedia.language._
import edu.umd.cloud9.collection.wikipedia._
val plainText = rawXmls.flatMap{line =>
val page = new EnglishWikipediaPage()
WikipediaPage.readPage(page, line)
if (page.isEmpty) None
else Some((page.getTitle, page.getContent))
}
The first guess which comes to mind is that: all your code is wrapped in the object where SparkContext is defined. Spark tries to serialize this object to transfer wikiXmlToPlainText function to nodes. Try to create different object with the only one function wikiXmlToPlainText.