How do I see if SparkContext has contents executing and when everything finish I stop it? Because currently I am waiting 30 seconds before to call SparkContext.stop, otherwise my app throws an error.
import org.apache.log4j.Level
import org.apache.log4j.Logger
import org.apache.spark.SparkContext
object RatingsCounter extends App {
// set the log level to print only errors
// create a SparkContext using every core of the local machine, named RatingsCounter
val sc = new SparkContext("local[*]", "RatingsCounter")
// load up each line of the ratings data into an RDD (Resilient Distributed Dataset)
val lines = sc.textFile("src/main/resource/", 0)
// convert each line to s string, split it out by tabs and extract the third field.
// The file format is userID, movieID, rating, timestamp
val ratings = => x.toString().split("\t")(2))
// count up how many times each value occurs
val results = ratings.countByValue()
// sort the resulting map of (rating, count) tuples
val sortedResults = results.toSeq.sortBy(_._1)
// print each result on its own line.
sortedResults.foreach { case (key, value) => println("movie ID: " + key + " - rating times: " + value) }

Spark applications should define a main() method instead of extending scala.App. Subclasses of scala.App may not work correctly.
And since you are extending App, you are getting an unexpected behavior.
You can read more about it in the official documentation about Self Contained Applications.
App uses DelayedInit and can cause initialization issues. With the main method you know what's going on. Excerpt from reddit.
object HelloWorld extends App {
var a = 1
a + 1
override def main(args: Array[String]) {
println(a) // guess what's the value of a ?


Error: Could not find or load main class com.sundogsoftware.spark.RatingsCounte

I dont know what is wrong here. When I run I keep getting "Error: Could not find or load main class com.sundogsoftware.spark.RatingsCounter" in my scala IDE.
this is my scala code
package com.sundogsoftware.spark
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.log4j._
/** Count up how many of each star rating exists in the MovieLens 100K
data set. */
object RatingsCounter {
/** Our main function where the action happens */
def main(args: Array[String]) {
// Set the log level to only print errors
// Create a SparkContext using every core of the local machine, named RatingsCounter
val sc = new SparkContext("local[*]", "RatingsCounter")
// Load up each line of the ratings data into an RDD
val lines = sc.textFile("../ml-100k/")
// Convert each line to a string, split it out by tabs, and extract the third field.
// (The file format is userID, movieID, rating, timestamp)
val ratings = => x.toString().split("\t")(2))
// Count up how many times each value (rating) occurs
val results = ratings.countByValue()
// Sort the resulting map of (rating, count) tuples
val sortedResults = results.toSeq.sortBy(_._1)
// Print each result on its own line.
here is my project structure
Here is my run configuration
Here is my scala compiler option selected.
Trying to debug this for a few hours now, nothing seems to be working.
Any pointers will help.
Check out I had to change the vm arg in my eclipse.ini file, and for my JRE options I selected 'Use default JRE (currently 'Java SE 8 [1.8.0_172]')' when I created the scala project. That fixed this error for me.
I am using OS X, so I had to add
above -vmargs

Scala - Remove header from Pair RDD

I am new in Scala and want to remove header from data. I have below data
and I am using below code to read
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object PAN {
def main(args: Array[String]) {
case class income(recordid : Int, income : Int)
val sc = new SparkContext(new SparkConf().setAppName("income").setMaster("local[2]"))
val income_data = sc.textFile("file:///home/user/Documents/income_info.txt").map(_.split(","))
val income_recs = => (r(0).toInt, income(r(0).toInt, r(1).toInt)))
I want to remove header from pair RDD but not getting how.
I was playing with below code
val header = income_data.first()
val a = income_data.filter(row => row != header)
a.foreach { println }
but it return below output
You technique to remove the header by filtering it out will work fine. The problem is how you are trying to print the array.
Arrays in Scala do not override toString so when you try to print one it uses the default string representation which is just the name and hashcode and usually not very useful.
If you want to print an array, turn it into a string first using the mkString method on string, or use foreach(println)
a.foreach {array => println(array.mkString("[",", ","]")}
a.foreach {array => array.foreach{println}}
Will both print out the elements of your array so you can see what they contain.
Keep in mind that when working with Spark, printing inside transformation and actions only works in local mode. Once you move to the cluster, the work will be done on remote executors so you won't be able to see and console output from them.
val income_data = sc.textFile("file:///home/user/Documents/income_info.txt")
When you create an RDD it will return RDD[String] , then when you collect() on top of it it will return Array[String], drop(number of elements) is a function on top of Array to remove those many rows from RDD.

Parallelizing access to S3 objects in Scala

I'm writing a Scala program to read objects matching a certain prefix on S3.
At the moment, I'm testing it on my Macbook Pro and it takes 270ms (avg. over 1000 trials) to hit S3, retrieve the 10 objects (avg. size of object 150Kb) and process it to print the output.
Here's my code:
val myBucket = "my-test-bucket"
val myPrefix = "t"
val startTime = System.currentTimeMillis()
//Can I make listObject parallel?
val listObjRequest: ListObjectsRequest = new ListObjectsRequest().withBucketName(myBucket)
val listObjResult: Seq[String] = s3.listObjects(listObjRequest) matches s"./.*${myPrefix}.*/*")
//Can I make forEach parallel?
listObjResult foreach println //Could be any function
println(s"Total time: ${System.currentTimeMillis() - startTime}ms")
In the big scheme of things, I've got to sift through 50Gb of data (approx. 350K nested objects) and delete objects following a certain prefix (approx. 40K objects).
Hardware considerations aside, what can I do to optimize my code?
A possible solution would be to batch the request objects and send a request for batch deletion in S3. You can group the objects to delete and then parallalize the mapping over the parallel collection:
import{DeleteObjectsRequest, DeleteObjectsResult}
import scala.collection.JavaConverters._
import scala.concurrent.Future
import scala.concurrent._
import scala.util.Try
object AmazonBatchDeletion {
def main(args: Array[String]): Unit = {
val filesToDelete: List[String] = ???
val numOfGroups: Int = ???
val deletionAttempts: Iterator[Future[Try[DeleteObjectsResult]]] =
.map(groupToDelete => Future {
blocking {
deleteFilesInBatch(groupToDelete, "bucketName")
val result: Future[Iterator[Try[DeleteObjectsResult]]] =
// TODO: make sure deletion was successful.
// Recover if needed form faulted futures.
def deleteFilesInBatch(filesToDelete: List[String],
bucketName: String): Try[DeleteObjectsResult] = {
val amazonClient = new AmazonS3Client()
val deleteObjectsRequest = new DeleteObjectsRequest(bucketName)
deleteObjectsRequest.setKeys( KeyVersion(_)).asJava)
Try {

String filter using Spark UDF

What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad ={r => parseLine(r)}
I get the following exception:
Please suggest where I went wrong
Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF ="\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()"id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
Use UDF in withCoulmn()"id","pName").withColumn("Found",findP($"id",$"pName")).show()
For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")"|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
Your output:
What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.

Spark Streaming - Issue with Passing parameters

Please take a look at the following spark streaming code written in scala:
object HBase {
var hbaseTable = ""
val hConf = new HBaseConfiguration()
hConf.set("hbase.zookeeper.quorum", "zookeeperhost")
def init(input: (String)) {
hbaseTable = input
def display() {
def insertHbase(row: (String)) {
val hTable = new HTable(hConf,hbaseTable)
object mainHbase {
def main(args : Array[String]) {
if (args.length < 5) {
System.err.println("Usage: MetricAggregatorHBase <zkQuorum> <group> <topics> <numThreads> <hbaseTable>")
val Array(zkQuorum, group, topics, numThreads, hbaseTable) = args
val sparkConf = new SparkConf().setAppName("mainHbase")
val ssc = new StreamingContext(sparkConf, Seconds(10))
val topicpMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap).map(_._2)
val storeStg = lines.foreachRDD(rdd => rdd.foreach(HBase.insertHbase))
I am trying to initialize the parameter hbaseTable in the object HBase by calling HBase.init method. It was setting the parameter properly. I confirmed that by calling the HBase.display method in the next line.
However when HBase.insertHbase method in the foreachRDD is called, its throwing error that hbaseTable is not set.
Update with exception:
java.lang.IllegalArgumentException: Table qualifier must not be empty
Can you please let me know how to make this code work.
"Where is this code running" - that's the question that we need to ask in order to understand what's going on.
HBase is a Scala object - by definition it's a singleton construct that gets initialized with 'only once' semantics in the JVM.
At the initialization point, HBase.init(hbaseTable) is executed in the driver of this Spark application, initializing this object with the given value in the VM of the driver.
But when we do: rdd.foreach(HBase.insertHbase), the closure is executed as a task on each executor that hosts a partition for the given RDD. At that point, the object HBase is initialized on each VM for each executor. As we can see, no initialization has happened on this object at that point.
There're two options:
We can add some checking "isInitialized" to the HBase object and add the -now conditional- call to initialize on each call to foreach.
Another option would be to use
rdd.foreachPartitition{partition =>
partition.foreach(elem => HBase.insert(elem))
This construction will amortize any initialization by the amount of element in each partition. It's also possible to combine it with an initialization check to prevent unnecessary bootstrap work.