getOrElse method not being found in Scala Spark - scala

Attempting to follow example in Sandy Ryza's book Advanced Analytics with Spark, coding using IntelliJ. Below I seem to have imported all the right libraries, but why is it not recognizing getOrElse?
Error:(84, 28) value getOrElse is not a member of org.apache.spark.rdd.RDD[String]
bArtistAlias.value.getOrElse(artistID, artistID)
^
Code:
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd._
import org.apache.spark.rdd.PairRDDFunctions
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.recommendation._
val trainData = rawUserArtistData.map { line =>
val Array(userID, artistID, count) = line.split(' ').map(_.toInt)
val finalArtistID = bArtistAlias.value.getOrElse(artistID, artistID)
Rating(userID, finalArtistID, count)
}.cache()

I can only make an assumption as the code listed is missing pieces, but my guess is that bArtistAlias is supposed to be a Map that SHOULD be broadcast, but isnt.
I went and found the piece of code in Sandy's book and it corroborates my guess. So, you seem to be missing this piece:
val bArtistAlias = sc.broadcast(artistAlias)
I am not even sure what you did without the code, but it looks like you broadcast an RDD[String], thus the error.....this would not even work anyway as you cannot work with another RDD inside of an RDD

Related

reading second word of every line using spark scala

I want to read/print second word of every line.
input->>people are not as beautiful as they look,
as they walk or as they talk.
they are only as beautiful as they love,
as they care as they share.
output->>
are
they
are
they
Please check this :
val myDF=spark.read.text("<path>")
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rdd=myDF.rdd.map(_.mkString("")).map(f=> Row(f.split(" ")(1)))
val schema:StructType = (new StructType).add("values",StringType )
val result=spark.createDataFrame(rdd, schema)
result.show()

Change Java List to Scala Seq?

I have the following list from my configuration:
val markets = Configuration.getStringList("markets");
To create a sequence out of it I write this code:
JavaConverters.asScalaIteratorConverter(markets.iterator()).asScala.toSeq
I wish I could do it in a less verbose way, such as:
markets.toSeq
And then from that list I get the sequence. I will have more configuration in the near future; is there a solution that provides this kind of simplicity?
I want a sequence regardless of the configuration library I am using. I don't want to have the stated verbose solution with the JavaConverters.
JavaConversions is deprecated since Scala 2.12.0. Use JavaConverters; you can import scala.collection.JavaConverters._ to make it less verbose:
import scala.collection.JavaConverters._
val javaList = java.util.Arrays.asList("one", "two")
val scalaSeq = javaList.asScala.toSeq
Yes. Just import implicit conversions:
import java.util
import scala.collection.JavaConversions._
val jlist = new util.ArrayList[String]()
jlist.toSeq

Converting RDD to DataFrame scala - NoSuchMethodError

I am trying to convert an RDD to a DataFrame in scala as follows
val posts = spark.textFile("~/allPosts/part-02064.xml.gz")
import org.apache.spark.SparkContext._
import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.SparkContext._
val sqlContext = new org.apache.spark.sql.SQLContext(spark)
import sqlContext.implicits._
posts.map(identity).toDF()
When I do this I get the following error.
java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext$implicits$.stringRddToDataFrameHolder(Lorg/apache/spark/rdd/RDD;)Lorg/apache/spark/sql/DataFrameHolder;
I can't for the life of me figure out what I'm doing wrong.
you need to define a schema to convert a RDD to Dataframes either by Reflection method or via programmatically.
One very important point about Dataframes- Dataframe is a RDD with a schema. In your case define a case class and map the values of a file to that class. Hope it will help

One simple spark program in scala : println out all the element in the RDD

I wrote one simple spark in eclipse, I want to println out all the element in the RDD:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object WordCount {
def main(args:Array[String]): Unit = {
val conf = new SparkConf().setMaster("local");
val sc = new SparkContext(conf);
val data = sc.parallelize(List(1,2,3,4,5));
data.collect().foreach(println);
sc.stop();
}
}
And the result is like this:
<console>:16: error: not found: value sc
val data = sc.parallelize(List(1,2,3,4,5));
I searched and tried more than three solutions but still cannot solve this. Anyone can help me with this? Thanks a lot!
I don't know the exact cause of whatever is troubling you since you don't mention how you set it all up, but you said that you can run it in spark-shell in linux so it's not about the code. It's most likely about the config and setup.
Perhaps my short guide can help you. It's minimalistic, but it's all I had to do in order to get the Spark "hello world" to run in Eclipse.

reduceByKey method not being found in IntelliJ

Here is code I'm trying out for reduceByKey :
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import scala.math.random
import org.apache.spark._
import org.apache.spark.storage.StorageLevel
object MapReduce {
def main(args: Array[String]) {
val sc = new SparkContext("local[4]" , "")
val file = sc.textFile("c:/data-files/myfile.txt")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
}
}
Is giving compiler error : "cannot resolve symbol reduceByKey"
When I hover over implementation of reduceByKey it gives three possible implementations so it appears it is being found ?:
You need to add the following import to your file:
import org.apache.spark.SparkContext._
Spark documentation:
"In Scala, these operations are automatically available on RDDs containing Tuple2 objects (the built-in
tuples in the language, created by simply writing (a, b)), as long as you import org.apache.spark.SparkContext._ in your program to enable Spark’s implicit conversions. The key-value pair operations are available in the PairRDDFunctions class, which automatically wraps around an RDD of tuples if you import the conversions."
It seems as if the documented behavior has changed in Spark 1.4.x. To have IntelliJ recognize the implicit conversions you now have to add the following import:
import org.apache.spark.rdd.RDD._
I have noticed that at times IJ is unable to resolve methods that are imported implicitly via PairRDDFunctions https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala .
The methods implicitly imported include the reduceByKey* and reduceByKeyAndWindow* methods. I do not have a general solution at this time -except that yes you can safely ignore the intellisense errors