rdd.map() does not call the specified function - scala

I have a dataset of 3 items. I call a function on each item using map() but the function is never called.
object MyProgram {
val events = Seq("A","B","C")
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("MyApp")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
val eventsDS = events.toDS()
System.out.println("Before")
val tempDS = eventsDS.rdd.map(x => doSomething(x))
System.out.println("After")
}
def doSomething(event: String) : Unit = {
System.out.println("Do Something!")
}
}
Output:
Before
After

map is lazily evaluated, you need to call an action like foreach to perform the computation:
eventsDS.foreach(doSomething _)

Related

Cannot resolve symbol mapValue

I am getting this error, where am I going wrong. Please help as I am new to spark. How will I use mapValues on an RDD
package com.udemyexamples
import org.apache.spark.sql.SparkSession
object AverageFriendByAge {
def parseFile(line:String): Unit =
{
val field= line.split(",")
val age=field(2).toInt
val friend=field(3).toInt
(age,friend)
}
def main(args: Array[String]): Unit = {
val spark=SparkSession.builder()
.master("local")
.appName("AverageFriendAge")
.getOrCreate()
val sc=spark.sparkContext
.textFile("C:\\SparkScala\\SparkScalaStudy\\fakefriend.csv")
val rdd=sc.map(parseFile)
val y= rdd.**mapValues**(x => (x, 1))
}
}
you first need an instance of SparkSession, your code should look like :
val spark = SparkSession
.builder()
.appName("dataFrameExample")
.master("local")
.getOrCreate()
import spark.implicits._

Map Partition Iterator return

Can anyone help in accepting the returning Iterator listWords() method to mapPartitions.
object MapPartitionExample {
def main(args: Array[String]): Unit = {
val conf= new SparkConf().setAppName("MapPartitionExample").setMaster("local[*]")
val sc= new SparkContext(conf)
val input:RDD[String] = sc.parallelize(List("ABC","DEF","GHU","YHG"))
val x= input.mapPartitions(word => listWords(word))
}
def listWords(words: Iterator[String]) : util.Iterator[String] = {
val arrList = new util.ArrayList[String]()
while( words.hasNext ) {
arrList.add( words.next())
}
return arrList.iterator()
}
}
Return type of the function used in mapPartitions should be scala.collection.Iterator, not java.util.Iterator. I don't see much point of your current code, but you can use Scala mutable collections:
import scala.collection.mutable.ArrayBuffer
def listWords(words: Iterator[String]) : Iterator[String] = {
val arr = ArrayBuffer[String]()
while( words.hasNext ) {
arr += words.next()
}
arr.toIterator
}
Personally I'd just map:
def listWords(words: Iterator[String]) : Iterator[String] = {
// Some init code
words.map(someFunction)
}
Iterable[NotInferU] is expected but you are returning java.util.Iterator[String]
You would need to convert the java.util.Iterator to scala Iterator by importing scala.collection.JavaConversions._ as below
def listWords(words: Iterator[String]) : Iterator[String] = {
val arrList = new util.ArrayList[String]()
while( words.hasNext ) {
arrList.add( words.next())
}
import scala.collection.JavaConversions._
return arrList.toList.iterator
}
Rest of the codes are as it is.
I hope the answer is helpful

Implicits in a Spark Scala program not working

I am not able to perform an implicit conversion from an RDD to a Dataframe in a Scala program although I am importing spark.implicits._.
Any help would be appreciated.
Main Program with the implicits:
object spark1 {
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("e1").config("o1", "sv").getOrCreate()
import spark.implicits._
val conf = new SparkConf().setMaster("local").setAppName("My App")
val sc = spark.sparkContext
val data = sc.textFile("/TestDataB.txt")
val allSplit = data.map(line => line.split(","))
case class CC1(LAT: Double, LONG: Double)
val allData = allSplit.map( p => CC1( p(0).trim.toDouble, p(1).trim.toDouble))
val allDF = allData.toDF()
// ... other code
}
}
Error is as follows:
Error:(40, 25) value toDF is not a member of org.apache.spark.rdd.RDD[CC1]
val allDF = allData.toDF()
When you define the case class CC1 inside the main method, you hit https://issues.scala-lang.org/browse/SI-6649; toDF() then fails to locate the appropriate implicit TypeTag for that class at compile time.
You can see this in this simple example:
case class Out()
object TestImplicits {
def main(args: Array[String]) {
case class In()
val typeTagOut = implicitly[TypeTag[Out]] // compiles
val typeTagIn = implicitly[TypeTag[In]] // does not compile: Error:(23, 31) No TypeTag available for In
}
}
Spark's relevant implicit conversion has this type parameter: [T <: Product : TypeTag] (see newProductEncoder here), which means an implicit TypeTag[CC1] is required.
To fix this - simply move the definition of CC1 out of the method, or out of object entirely:
case class CC1(LAT: Double, LONG: Double)
object spark1 {
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("e1").config("o1", "sv").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.textFile("/TestDataB.txt")
val allSplit = data.map(line => line.split(","))
val allData = allSplit.map( p => CC1( p(0).trim.toDouble, p(1).trim.toDouble))
val allDF = allData.toDF()
// ... other code
}
}
I thought the toDF is in sqlContext.implicits._ so you need to import that not spark.implicits._. At least that is the case in spark 1.6

Transform data using scala in spark

I am trying to transform the input text file into a Key/Value RDD, but the code below doesn't work.(The text file is a tab separated file.) I am really new to Scala and Spark so I would really appreciate your help.
import org.apache.spark.{SparkConf, SparkContext}
import scala.io.Source
object shortTwitter {
def main(args: Array[String]): Unit = {
for (line <- Source.fromFile(args(1).txt).getLines()) {
val newLine = line.map(line =>
val p = line.split("\t")
(p(0).toString, p(1).toInt)
)
}
val sparkConf = new SparkConf().setAppName("ShortTwitterAnalysis").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text = sc.textFile(args(0))
val counts = text.flatMap(line => line.split("\t"))
}
}
I'm assuming you want the resulting RDD to have the type RDD[(String, Int)], so -
You should use map (which transforms each record into a single new record) and not flatMap (which transform each record into multiple records)
You should map the result of the split into a tuple
Altogether:
val counts = text
.map(line => line.split("\t"))
.map(arr => (arr(0), arr(1).toInt))
EDIT per clarification in comment: if you're also interested in fixing the non-Spark part (which reads the file sequentially), you have some errors in the for-comprehension syntax, here's the entire thing:
def main(args: Array[String]): Unit = {
// read the file without Spark (not necessary when using Spark):
val countsWithoutSpark: Iterator[(String, Int)] = for {
line <- Source.fromFile(args(1)).getLines()
} yield {
val p = line.split("\t")
(p(0), p(1).toInt)
}
// equivalent code using Spark:
val sparkConf = new SparkConf().setAppName("ShortTwitterAnalysis").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val counts: RDD[(String, Int)] = sc.textFile(args(0))
.map(line => line.split("\t"))
.map(arr => (arr(0), arr(1).toInt))
}

How to pass hiveContext as argument to functions spark scala

I have created a hiveContext in main() function in Scala and I need to pass through parameters this hiveContext to other functions, this is the structure:
object Project {
def main(name: String): Int = {
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
...
}
def read (streamId: Int, hc:hiveContext): Array[Byte] = {
...
}
def close (): Unit = {
...
}
}
but it doesn't work. Function read() is called inside main().
any idea?
I'm declaring hiveContext as implicit, this is working for me
implicit val sqlContext: HiveContext = new HiveContext(sc)
MyJob.run(conf)
Defined in MyJob:
override def run(config: Config)(implicit sqlContext: SQLContext): Unit = ...
But if you don't want it implicit, this should be the same
val sqlContext: HiveContext = new HiveContext(sc)
MyJob.run(conf)(sqlContext)
override def run(config: Config)(sqlContext: SQLContext): Unit = ...
Also, your function read should receive HiveContext as the type for the parameter hc, and not hiveContext
def read (streamId: Int, hc:HiveContext): Array[Byte] =
I tried several options, this is what worked eventually for me..
object SomeName extends App {
val conf = new SparkConf()...
val sc = new SparkContext(conf)
implicit val sqlC = SQLContext.getOrCreate(sc)
getDF1(sqlC)
def getDF1(sqlCo: SQLContext): Unit = {
val query1 = SomeQuery here
val df1 = sqlCo.read.format("jdbc").options(Map("url" -> dbUrl,"dbtable" -> query1)).load.cache()
//iterate through df1 and retrieve the 2nd DataFrame based on some values in the Row of the first DataFrame
df1.foreach(x => {
getDF2(x.getString(0), x.getDecimal(1).toString, x.getDecimal(3).doubleValue) (sqlCo)
})
}
def getDF2(a: String, b: String, c: Double)(implicit sqlCont: SQLContext) : Unit = {
val query2 = Somequery
val sqlcc = SQLContext.getOrCreate(sc)
//val sqlcc = sqlCont //Did not work for me. Also, omitting (implicit sqlCont: SQLContext) altogether did not work
val df2 = sqlcc.read.format("jdbc").options(Map("url" -> dbURL, "dbtable" -> query2)).load().cache()
.
.
.
}
}
Note: In the above code, if I omitted (implicit sqlCont: SQLContext) parameter from getDF2 method signature, it would not work. I tried several other options of passing the sqlContext from one method to the other, it always gave me NullPointerException or Task not serializable Excpetion.
Good thins is it eventually worked this way, and I could retrieve parameters from a row of the DataFrame1 and use those values in loading the DataFrame 2.