Eclipse Autocomplete not suggesting the method in Spark/Scala - eclipse

I am newbie in Scala & writing a word count program to find the no of occurrences of each unique word in a file using Spark API. Find below the code
val sc = new SparkContext(conf)
//Load Data from File
val input = sc.textFile(args(0))
//Split into words
val words = input.flatMap { line => line.split(" ") }
//Assign unit to each word
val units = words.map { word => (word, 1) }
//Reduce each word
val counts = units.reduceByKey { case (x, y) => x + y }
...
Although the Application compiles successfully, The issue I have is, when I type units. in Eclipse, the autocomplete is not suggesting the method reduceByKey. For other functions the Autocomplete works perfect. Is there any specific reason for this?

This is probably due to reduceByKey only being available via implicits. That method is not on the RDD, but on PairRDDFunctions. I had thought that implicit autocompletion was working in eclipse...but I would guess that to be your issue. You can verify by explicitly wrapping units in a PairRDDFunctions

Related

Return type to assign to val for RDDs

I am playing around with spark code to know more about shuffling. I wrote the following code to see how are stages formed if there is a if-else statement. I have declared val result so that the result could be assigned to it later in the if statement. But I am not sure about the return type to assign to it.
Is there an abstract class that goes with all the RDDs?
val conf = new SparkConf().setMaster("local").setAppName("spark shuffle")
val sc = new SparkContext(conf)
val d = sc.parallelize(0 until 1000).map(i => (i%1000, i))
val x = d.reduceByKey(_+_)
val count = 1
val result: RDD // What is the correct return type here?
if(count == 1)
{
result= d.rightOuterJoin(x)
result.collect()
}
d is a RDD[(Int, Int)]
Then doing a reduce by key gives the same thing but reduced down
Doing a right outer join then gives you RDD of (Int, (Option[Int], Int)) - ie for each key the L and R value (with the L option being optional if not there)
So doing a collect gives you an array of the same thing
The API documentation is not easy to follow for all these functions, there is a lot of generic types, and a lot of implicit types. I would recommend that you either use an IDE which will hint the types for you, or else use a tool that gives you a console that you can try snippets in.
you can avoid assignment to var (it should be var, not val)
val conf = new SparkConf().setMaster("local").setAppName("spark shuffle")
val sc = new SparkContext(conf)
val d = sc.parallelize(0 until 1000).map(i => (i%1000, i))
val x = d.reduceByKey(_+_)
val count = 1
if (count == 1) {
d.rightOuterJoin(x).collect()
}

how to save a value in a file using Scala

I am trying to save a value in a file, but keep getting an error
I have tried
.saveAsTextFile("/home/amel/timer")`
REDUCER Function
val startReduce = System.currentTimeMillis()
val y = sc.textFile("/home/amel/10MB").filter(!_.contains("NULL")).filter(!_.contains("Null"))
val er = x.map(row => {
val cols = row.split(",")
(cols(1).split("-")(0) + "," + cols(2) + "," + cols(3), 1)
}).reduceByKey(_ + _).map(x => x._1 + "," + x._2)
er.collect.foreach(println)
val endReduce = System.currentTimeMillis()
val durationReduce = ((endReduce-startReduce)/1000).saveAsTextFile("home/amel/timer/")
the error I'm receiving is on this line
val durationReduce = ((endReduce-startReduce)/1000).saveAsTextFile("home/amel/timer/")
it says: saveAsTextFile is not a member of Long
The output I want is a number
Long does not have a method named saveAsTextFile If you want to write a Long value, there are many ways a simple way is to use java PrintWriter
val duration = ((endReduce-startReduce)/1000)
new PrintWriter("ome/amel/timer/time") { write(duation.toString); close }
If you still want to use spark RDD saveAsTextFile then you can use
sc.parallelize(Seq(duration)).saveAsTextFile("path")
But this does not make sense just to write a single value.
saveAsTextFile is a method on the class org.apache.spark.rdd.RDD (docs)
The expression ((endReduce-startReduce)/1000) is of type Long, so it does not have this method, hence the error you are seeing "saveAsTextFile is not a member of Long"
This answer is applicable here: https://stackoverflow.com/a/32105659/8261
Basically the situation is that you have an Int and you want to write it to a file. Your first thought is to create a distributed collection across a cluster of machines, that only contains this Int and let those machines write the Int to a set of files in a distributed way.
I'd argue this is not the right approach. Do not use Spark for saving an Int into a file. Instead you can use a PrintWriter:
val out = new java.io.PrintWriter("filename.txt")
out.println(finalvalue)
out.close()

How can I construct a String with the contents of a given DataFrame in Scala

Consider I have a dataframe. How can I retrieve the contents of that dataframe and represent it as a string.
Consider I try to do that with the below example code.
val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278)
val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0)
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]");
val sc = new SparkContext(conf)
val df = sc.parallelize(tvalues zip pvalues)
val sb = StringBuilder.newBuilder
df.foreach(x => {
println("x = ", x)
sb.append(x)
})
println("sb = ", sb)
The output of the code shows the example dataframe has contents:
(x = ,(1.866393526974307,0.064020056478447))
(x = ,(7.876169953355888,7.489564524121306E-13))
(x = ,(2.864048126935307,0.004808399479386827))
(x = ,(4.032486069215076,8.914865448939047E-5))
(x = ,(4.875333799256043,2.8363794106756046E-6))
However, the final stringbuilder contains an empty string.
Any thoughts how to retrieve a String for a given dataframe in Scala?
Many thanks
UPD: as mentioned by #user8371915, solution below will work only in single JVM in development (local) mode. In fact we cant modify broadcast variables like globals. You can use accumulators, but it will be quite inefficient. Also you can read an answer about read/write global vars here. Hope it will help you.
I think you should read topic about shared variables in Spark. Link here
Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.
Let's have a look at broadcast variables. I edited your code:
val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278)
val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0)
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]");
val sc = new SparkContext(conf)
val df = sc.parallelize(tvalues zip pvalues)
val sb = StringBuilder.newBuilder
val broadcastVar = sc.broadcast(sb)
df.foreach(x => {
println("x = ", x)
broadcastVar.value.append(x)
})
println("sb = ", broadcastVar.value)
Here I used broadcastVar as a container for a StringBuilder variable sb.
Here is output:
(x = ,(1.866393526974307,0.064020056478447))
(x = ,(2.864048126935307,0.004808399479386827))
(x = ,(4.032486069215076,8.914865448939047E-5))
(x = ,(7.876169953355888,7.489564524121306E-13))
(x = ,(4.875333799256043,2.8363794106756046E-6))
(x = ,(14.316322626848278,0.0))
(sb = ,(7.876169953355888,7.489564524121306E-13)(1.866393526974307,0.064020056478447)(4.875333799256043,2.8363794106756046E-6)(2.864048126935307,0.004808399479386827)(14.316322626848278,0.0)(4.032486069215076,8.914865448939047E-5))
Hope this helps.
Does the output of df.show(false) help? If yes, then this SO answer helps: Is there any way to get the output of Spark's Dataset.show() method as a string?
Thanks everybody for the feedback and for understanding this slightly better.
The combination of responses result in the below. The requirements have changed slightly in that I represent my df as a list of jsons. The code below does this, without the use of the broadcast.
class HandleDf(df: DataFrame, limit: Int) extends java.io.Serializable {
val jsons = df.limit(limit).collect.map(rowToJson(_))
def rowToJson(r: org.apache.spark.sql.Row) : JSONObject = {
try { JSONObject(r.getValuesMap(r.schema.fieldNames)) }
catch { case t: Throwable =>
JSONObject.apply(Map("Row with error" -> t.toString))
}
}
}
The class I use here...
val jsons = new HandleDf(df, 100).jsons

Scala Convert [Seq[string] to [String]? (TF-IDF after lemmatization)

I try to learn scala and specificaly text minning (lemmatization ,TF-IDF matrix and LSA).
I have some texts i want to lemmatize and make a classification (LSA). I use spark on cloudera.
So i used the stanfordCore NLP fonction:
def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <-sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
After that, i try to make an TF-IDF matrix but here is my problem:
The Stanford fonction make an RDD in [Seq[string] form.
But, i have an error.
I need to use a RDD in [String] form (not the [Seq[string]] form).
val (termDocMatrix, termIds, docIds, idfs) = termDocumentMatrix(lemmatized-text, stopWords, numTerms, sc)
Someone know how convert a [Seq[string]] to [String]?
Or i need to change one of my request?.
Thanks for the help.
Sorry if it's a dumb question and for the english.
Bye
I am not sure what this lemmatization thingy is, but as far as making a string out of a sequence, you can just do seq.mkString("\n") (or replace "\n" with whatever other separator you want), or just seq.mkString if you want it merged without any separator.
Also, don't use mutable structures, it's bad taste in scala:
val lemmas = sentences
.map(_.get(classOf[TokensAnnotation]))
.map(_.get(classOf[LemmaAnnotation]))
.filter(_.length > 2)
.filterNot(stopWords)
.mkString

scala.MatchError: null on spark RDDs

I am relatively new to both spark and scala.
I was trying to implement collaborative filtering using scala on spark.
Below is the code
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
val data = sc.textFile("/user/amohammed/CB/input-cb.txt")
val distinctUsers = data.map(x => x.split(",")(0)).distinct().map(x => x.toInt)
val distinctKeywords = data.map(x => x.split(",")(1)).distinct().map(x => x.toInt)
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) => Rating(user.toInt,item.toInt, rate.toDouble)
})
val model = ALS.train(ratings, 1, 20, 0.01)
val keywords = distinctKeywords collect
distinctUsers.map(x => {(x, keywords.map(y => model.predict(x,y)))}).collect()
It throws a scala.MatchError: null
org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:571) at the last line
Thw code works fine if I collect the distinctUsers rdd into an array and execute the same code:
val users = distinctUsers collect
users.map(x => {(x, keywords.map(y => model.predict(x, y)))})
Where am I getting it wrong when dealing with RDDs?
Spark Version : 1.0.0
Scala Version : 2.10.4
Going one call further back in the stack trace, line 43 of the MatrixFactorizationModel source says:
val userVector = new DoubleMatrix(userFeatures.lookup(user).head)
Note that the userFeatures field of model is itself another RDD; I believe it isn't getting serialized properly when the anonymous function block closes over model, and thus the lookup method on it is failing. I also tried placing both model and keywords into broadcast variables, but that didn't work either.
Instead of falling back to Scala collections and losing the benefits of Spark, it's probably better to stick with RDDs and take advantage of other ways of transforming them.
I'd start with this:
val ratings = data.map(_.split(',') match {
case Array(user, keyword, rate) => Rating(user.toInt, keyword.toInt, rate.toDouble)
})
// instead of parsing the original RDD's strings three separate times,
// you can map the "user" and "product" fields of the Rating case class
val distinctUsers = ratings.map(_.user).distinct()
val distinctKeywords = ratings.map(_.product).distinct()
val model = ALS.train(ratings, 1, 20, 0.01)
Then, instead of calculating each prediction one by one, we can obtain the Cartesian product of all possible user-keyword pairs as an RDD and use the other predict method in MatrixFactorizationModel, which takes an RDD of such pairs as its argument.
val userKeywords = distinctUsers.cartesian(distinctKeywords)
val predictions = model.predict(userKeywords).map { case Rating(user, keyword, rate) =>
(user, Map(keyword -> rate))
}.reduceByKey { _ ++ _ }
Now predictions has an immutable map for each user that can be queried for the predicted rating of a particular keyword. If you specifically want arrays as in your original example, you can do:
val keywords = distinctKeywords.collect() // add .sorted if you want them in order
val predictionArrays = predictions.mapValues(keywords.map(_))
Caveat: I tested this with Spark 1.0.1 as it's what I had installed, but it should work with 1.0.0 as well.