Creating a broadcast variable with SparkSession ? Spark 2.0 - scala

Is it possible to create broadcast variables with the sparkContext provided by SparkSession ? I keep getting an error under sc.broadcast , however in a different project when using the SparkContext from org.apache.spark.SparkContext I have no problems.
import org.apache.spark.sql.SparkSession
object MyApp {
def main(args: Array[String]){
val spark = SparkSession.builder()
.appName("My App")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
.setLogLevel("ERROR")
val path = "C:\\Boxes\\github-archive\\2015-03-01-0.json"
val ghLog = spark.read.json(path)
val pushes = ghLog.filter("type = 'PushEvent'")
pushes.printSchema()
println("All events: "+ ghLog.count)
println("Only pushes: "+pushes.count)
pushes.show(5)
val grouped = pushes.groupBy("actor.login").count()
grouped.show(5)
val ordered = grouped.orderBy(grouped("count").desc)
ordered.show(5)
import scala.io.Source.fromFile
val fileName= "ghEmployees.txt"
val employees = Set() ++ (
for {
line <- fromFile(fileName).getLines()
} yield line.trim
)
val bcEmployees = sc.broadcast(employees)
}
}
Or is it a problem of using a Set () instead of a Seq object ?
Thanks for any help
Edit:
I keep getting a "cannot resolve symbol broadcast" error msg in intellij
after complying I get an error of:
Error:(47, 28) value broadcast is not a member of Unit
val bcEmployees = sc.broadcast(employees)
^

Your sc variable has type Unit because, according to the docs, setLogLevel has return type Unit. Do this instead:
val sc: SparkContext = spark.sparkContext
sc.setLogLevel("ERROR")
It is important to keep track of the types of your variables to catch errors earlier.

Related

why the error is given while reading the text file in spark

The path given to the text file is correct still I am getting error " Input path does not exist: file:/C:/Users/cmpil/Downloads/hunger_games.txt". Why is it happening
import org.apache.spark.sql._
import org.apache.log4j._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object WordCountDataSet {
case class Book(value:String)
def main(args:Array[String]): Unit ={
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.appName("WordCount")
.master("local[*]")
.getOrCreate()
import spark.implicits._
//Another way of doing it
val bookRDD = spark.sparkContext.textFile("C:/Users/cmpil/Downloads/hunger_games.txt")
val wordsRDD = bookRDD.flatMap(x => x.split("\\W+"))
val wordsDS = wordsRDD.toDS()
val lowercaseWordsDS = wordsDS.select(lower($"value").alias("word"))
val wordCountsDS = lowercaseWordsDS.groupBy("word").count()
val wordCountsSortedDS = wordCountsDS.sort("count")
wordCountsSortedDS.show(wordCountsSortedDS.count().toInt)
}
}
on windows you have to use '\\' in place of '/'
try using "C:\\Users\\cmpil\\Downloads\\hunger_games.txt"

H2O fails on H2OContext.getOrCreate

I'm trying to write a sample program in Scala/Spark/H2O. The program compiles, but throws an exception in H2OContext.getOrCreate:
object App1 extends App{
val conf = new SparkConf()
conf.setAppName("AppTest")
conf.setMaster("local[1]")
conf.set("spark.executor.memory","1g");
val sc = new SparkContext(conf)
val spark = SparkSession.builder
.master("local")
.appName("ApplicationController")
.getOrCreate()
import spark.implicits._
val h2oContext = H2OContext.getOrCreate(sess) // <--- error here
import h2oContext.implicits._
val rawData = sc.textFile("c:\\spark\\data.csv")
val data = rawData.map(line => line.split(',').map(_.toDouble))
val response: RDD[Int] = data.map(row => row(0).toInt)
val str = "count: " + response.count()
val h2oResponse: H2OFrame = response.toDF
sc.stop
spark.stop
}
This is the exception log:
Exception in thread "main"
java.lang.RuntimeException: When using the Sparkling Water as Spark
package via --packages option, the 'no.priv.garshol.duke:duke:1.2'
dependency has to be specified explicitly due to a bug in Spark
dependency resolution. at
org.apache.spark.h2o.H2OContext.init(H2OContext.scala:117)

Error with spark Row.fromSeq for a text file

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
object fixedLength {
def main(args:Array[String]) {
def getRow(x : String) : Row={
val columnArray = new Array[String](4)
columnArray(0)=x.substring(0,3)
columnArray(1)=x.substring(3,13)
columnArray(2)=x.substring(13,18)
columnArray(3)=x.substring(18,22)
Row.fromSeq(columnArray)
}
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().master("local").appName("ReadingCSV").getOrCreate()
val conf = new SparkConf().setAppName("FixedLength").setMaster("local[*]").set("spark.driver.allowMultipleContexts", "true");
val sc = new SparkContext(conf)
val fruits = sc.textFile("in/fruits.txt")
val schemaString = "id,fruitName,isAvailable,unitPrice";
val fields = schemaString.split(",").map( field => StructField(field,StringType,nullable=true))
val schema = StructType(fields)
val df = spark.createDataFrame(fruits.map { x => getRow(x)} , schema)
df.show() // Error
println("End of the program")
}
}
I'm getting error in the df.show() command.
My file content is
56 apple TRUE 0.56
45 pear FALSE1.34
34 raspberry TRUE 2.43
34 plum TRUE 1.31
53 cherry TRUE 1.4
23 orange FALSE2.34
56 persimmon FALSE23.2
ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to [B
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:81)
Can you please help?
You are creating rdd in old way SparkContext(conf)
val conf = new SparkConf().setAppName("FixedLength").setMaster("local[*]").set("spark.driver.allowMultipleContexts", "true");
val sc = new SparkContext(conf)
val fruits = sc.textFile("in/fruits.txt")
whereas you are creating dataframe in new way using SparkSession
val spark = SparkSession.builder().master("local").appName("ReadingCSV").getOrCreate()
val df = spark.createDataFrame(fruits.map { x => getRow(x)} , schema)
Ultimately you are mixing rdd created with old sparkContext functions with dataframe created by using new sparkSession.
I would suggest you to use only one way.
I guess thats the reason for the issue
Update
doing the following should work for you
def getRow(x : String) : Row={
val columnArray = new Array[String](4)
columnArray(0)=x.substring(0,3)
columnArray(1)=x.substring(3,13)
columnArray(2)=x.substring(13,18)
columnArray(3)=x.substring(18,22)
Row.fromSeq(columnArray)
}
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().master("local").appName("ReadingCSV").getOrCreate()
val fruits = spark.sparkContext.textFile("in/fruits.txt")
val schemaString = "id,fruitName,isAvailable,unitPrice";
val fields = schemaString.split(",").map( field => StructField(field,StringType,nullable=true))
val schema = StructType(fields)
val df = spark.createDataFrame(fruits.map { x => getRow(x)} , schema)

How to pass HiveContext as an argument from one function to another function using spark scala

I have a sceneria where I need to pass HiveContext as an argument to another function. Below is my code and where I am stuck with issue:
Object Sample {
def main(args:Array[String]){
val fileName = "SampleFile.txt"
val conf = new SparkConf().setMaster("local").setAppName("LoadToHivePart")
conf.set("spark.ui.port","4041")
val sc=new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hc = new org.apache.spark.sql.hive.HiveContext(sc)
hc.setConf("hive.metastore.uris","thrift://127.0.0.1:9083")
test(hc,fileName)
sc.stop()
}
def test(hc:String, fileName: String){
//code.....
}
}
As per above code I am unable to pass a HiveContext variable "hc" from main to another function. Also tried with:
def test(hc:HiveContext, fileName:String){}
but it is showing error for both.
def test(hc:HiveContext, fileName: String){
//code.....
}
Note: Hive Context available in org.apache.spark.sql.hive.HiveContext
so import it using import org.apache.spark.sql.hive.HiveContext

Spark-submit cannot access local file system

Really simple Scala code files at the first count() method call.
def main(args: Array[String]) {
// create Spark context with Spark configuration
val sc = new SparkContext(new SparkConf().setAppName("Spark File Count"))
val fileList = recursiveListFiles(new File("C:/data")).filter(_.isFile).map(file => file.getName())
val filesRDD = sc.parallelize(fileList)
val linesRDD = sc.textFile("file:///temp/dataset.txt")
val lines = linesRDD.count()
val files = filesRDD.count()
}
I don't want to set up a HDFS installation for this right now. How do I configure Spark to use the local file system? This works with spark-shell.
To read the file from local filesystem(From Windows directory) you need to use below pattern.
val fileRDD = sc.textFile("C:\\Users\\Sandeep\\Documents\\test\\test.txt");
Please see below sample working program to read data from local file system.
package com.scala.example
import org.apache.spark._
object Test extends Serializable {
val conf = new SparkConf().setAppName("read local file")
conf.set("spark.executor.memory", "100M")
conf.setMaster("local");
val sc = new SparkContext(conf)
val input = "C:\\Users\\Sandeep\\Documents\\test\\test.txt"
def main(args: Array[String]): Unit = {
val fileRDD = sc.textFile(input);
val counts = fileRDD.flatMap(line => line.split(","))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.collect().foreach(println)
//Stop the Spark context
sc.stop
}
}
val sc = new SparkContext(new SparkConf().setAppName("Spark File
Count")).setMaster("local[8]")
might help