H2O fails on H2OContext.getOrCreate - scala

I'm trying to write a sample program in Scala/Spark/H2O. The program compiles, but throws an exception in H2OContext.getOrCreate:
object App1 extends App{
val conf = new SparkConf()
conf.setAppName("AppTest")
conf.setMaster("local[1]")
conf.set("spark.executor.memory","1g");
val sc = new SparkContext(conf)
val spark = SparkSession.builder
.master("local")
.appName("ApplicationController")
.getOrCreate()
import spark.implicits._
val h2oContext = H2OContext.getOrCreate(sess) // <--- error here
import h2oContext.implicits._
val rawData = sc.textFile("c:\\spark\\data.csv")
val data = rawData.map(line => line.split(',').map(_.toDouble))
val response: RDD[Int] = data.map(row => row(0).toInt)
val str = "count: " + response.count()
val h2oResponse: H2OFrame = response.toDF
sc.stop
spark.stop
}
This is the exception log:
Exception in thread "main"
java.lang.RuntimeException: When using the Sparkling Water as Spark
package via --packages option, the 'no.priv.garshol.duke:duke:1.2'
dependency has to be specified explicitly due to a bug in Spark
dependency resolution. at
org.apache.spark.h2o.H2OContext.init(H2OContext.scala:117)

Related

Adding Mongo config to active spark session

I am trying to add the configuraions to an active spark session. Below is my code
val spark = SparkSession.getActiveSession.get
spark.conf.set("spark.mongodb.input.uri",
"mongodb://hello_admin:hello123#localhost:27017/testdb.products?authSource=admin")
spark.conf.set("spark.mongodb.input.partitioner" ,"MongoPaginateBySizePartitioner")
import com.mongodb.spark._
val customRdd = MongoSpark.load(sc)
println(customRdd.count())
println(customRdd.first.toJson)
println(customRdd.collect().foreach(println))
But I am getting an error:
java.lang.IllegalArgumentException: Missing database name. Set via the
'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property
While when I write the code
val spark = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("spark.mongodb.input.uri", "mongodb://hello_admin:hello123#localhost:27017/testdb.products?authSource=admin")
// .config("spark.mongodb.output.uri", "mongodb://hello_admin:hello123#localhost:27017/testdb.products?authSource=admin")
.config("spark.mongodb.input.partitioner" ,"MongoPaginateBySizePartitioner")
.getOrCreate()
val sc = spark.sparkContext
val customRdd = MongoSpark.load(sc)
println(customRdd.count())
println(customRdd.first.toJson)
println(customRdd.collect().foreach(println))
My code is excecuting fine.
Kindly let me know what changes i need in the first code
You can define sparkSession like this with SparkConf. ( i don't know if this helps you )
def sparkSession(conf: SparkConf): SparkSession = SparkSession
.builder()
.config(conf)
.getOrCreate()
val sparkConf = new SparkConf()
sparkConf.set("prop","value")
val ss = sparkSession(sparkConf)
Or you can try to use SparkEnv ( i'm using sparkEnv for a lot of things to change props ):
SparkEnv.get.conf.set("prop", "value")

scala-submit java.lang.ClassNotFoundException

spark 2.7 scala 2.12.7 ,when i use spark-submit submit a simple project --WordCount, i ensure package and className is OK, but still have a error
java.lang.ClassNotFoundException
as my code:
1../bin/spark-submit --master spark://localhost.localdomain:7077 --class sparkTes.WordCount.scala /java/spark/scala.jar
2.enter image description here
3.spark code
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("wordcount");
val sc = new SparkContext(conf)
val input = sc.textFile("/java/text/scala.md", 2).cache()
val lines = input.flatMap(line=>line.split(" "))
val count = lines.map(word => (word,1)).reduceByKey{case (x,y)=>x+y}
val output = count.saveAsTextFile("/java/text/WordCount")
}

Error with spark Row.fromSeq for a text file

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
object fixedLength {
def main(args:Array[String]) {
def getRow(x : String) : Row={
val columnArray = new Array[String](4)
columnArray(0)=x.substring(0,3)
columnArray(1)=x.substring(3,13)
columnArray(2)=x.substring(13,18)
columnArray(3)=x.substring(18,22)
Row.fromSeq(columnArray)
}
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().master("local").appName("ReadingCSV").getOrCreate()
val conf = new SparkConf().setAppName("FixedLength").setMaster("local[*]").set("spark.driver.allowMultipleContexts", "true");
val sc = new SparkContext(conf)
val fruits = sc.textFile("in/fruits.txt")
val schemaString = "id,fruitName,isAvailable,unitPrice";
val fields = schemaString.split(",").map( field => StructField(field,StringType,nullable=true))
val schema = StructType(fields)
val df = spark.createDataFrame(fruits.map { x => getRow(x)} , schema)
df.show() // Error
println("End of the program")
}
}
I'm getting error in the df.show() command.
My file content is
56 apple TRUE 0.56
45 pear FALSE1.34
34 raspberry TRUE 2.43
34 plum TRUE 1.31
53 cherry TRUE 1.4
23 orange FALSE2.34
56 persimmon FALSE23.2
ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to [B
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:81)
Can you please help?
You are creating rdd in old way SparkContext(conf)
val conf = new SparkConf().setAppName("FixedLength").setMaster("local[*]").set("spark.driver.allowMultipleContexts", "true");
val sc = new SparkContext(conf)
val fruits = sc.textFile("in/fruits.txt")
whereas you are creating dataframe in new way using SparkSession
val spark = SparkSession.builder().master("local").appName("ReadingCSV").getOrCreate()
val df = spark.createDataFrame(fruits.map { x => getRow(x)} , schema)
Ultimately you are mixing rdd created with old sparkContext functions with dataframe created by using new sparkSession.
I would suggest you to use only one way.
I guess thats the reason for the issue
Update
doing the following should work for you
def getRow(x : String) : Row={
val columnArray = new Array[String](4)
columnArray(0)=x.substring(0,3)
columnArray(1)=x.substring(3,13)
columnArray(2)=x.substring(13,18)
columnArray(3)=x.substring(18,22)
Row.fromSeq(columnArray)
}
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().master("local").appName("ReadingCSV").getOrCreate()
val fruits = spark.sparkContext.textFile("in/fruits.txt")
val schemaString = "id,fruitName,isAvailable,unitPrice";
val fields = schemaString.split(",").map( field => StructField(field,StringType,nullable=true))
val schema = StructType(fields)
val df = spark.createDataFrame(fruits.map { x => getRow(x)} , schema)

Creating a broadcast variable with SparkSession ? Spark 2.0

Is it possible to create broadcast variables with the sparkContext provided by SparkSession ? I keep getting an error under sc.broadcast , however in a different project when using the SparkContext from org.apache.spark.SparkContext I have no problems.
import org.apache.spark.sql.SparkSession
object MyApp {
def main(args: Array[String]){
val spark = SparkSession.builder()
.appName("My App")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
.setLogLevel("ERROR")
val path = "C:\\Boxes\\github-archive\\2015-03-01-0.json"
val ghLog = spark.read.json(path)
val pushes = ghLog.filter("type = 'PushEvent'")
pushes.printSchema()
println("All events: "+ ghLog.count)
println("Only pushes: "+pushes.count)
pushes.show(5)
val grouped = pushes.groupBy("actor.login").count()
grouped.show(5)
val ordered = grouped.orderBy(grouped("count").desc)
ordered.show(5)
import scala.io.Source.fromFile
val fileName= "ghEmployees.txt"
val employees = Set() ++ (
for {
line <- fromFile(fileName).getLines()
} yield line.trim
)
val bcEmployees = sc.broadcast(employees)
}
}
Or is it a problem of using a Set () instead of a Seq object ?
Thanks for any help
Edit:
I keep getting a "cannot resolve symbol broadcast" error msg in intellij
after complying I get an error of:
Error:(47, 28) value broadcast is not a member of Unit
val bcEmployees = sc.broadcast(employees)
^
Your sc variable has type Unit because, according to the docs, setLogLevel has return type Unit. Do this instead:
val sc: SparkContext = spark.sparkContext
sc.setLogLevel("ERROR")
It is important to keep track of the types of your variables to catch errors earlier.

Multiclass Classification Evaluator field does not exist error - Apache Spark

I am new to Spark and trying a basic classifier in Scala.
I'm trying to get the accuracy, but when using MulticlassClassificationEvaluator it gives the error below:
Caused by: java.lang.IllegalArgumentException: Field "label" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:227)
at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)
at org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:76)
at com.classifier.classifier_app.App$.<init>(App.scala:90)
at com.classifier.classifier_app.App$.<clinit>(App.scala)
The code is as below:
val conf = new SparkConf().setMaster("local[*]").setAppName("Classifier")
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.appName("Email Classifier")
.config("spark.some.config.option", "some-value")
.getOrCreate()
import spark.implicits._
val spamInput = "TRAIN_00000_0.eml" //files to train model
val normalInput = "TRAIN_00002_1.eml"
val spamData = spark.read.textFile(spamInput)
val normalData = spark.read.textFile(normalInput)
case class Feature(index: Int, value: String)
val indexer = new StringIndexer()
.setInputCol("value")
.setOutputCol("label")
val regexTokenizer = new RegexTokenizer()
.setInputCol("value")
.setOutputCol("cleared")
.setPattern("\\w+").setGaps(false)
val remover = new StopWordsRemover()
.setInputCol("cleared")
.setOutputCol("filtered")
val hashingTF = new HashingTF()
.setInputCol("filtered").setOutputCol("features")
.setNumFeatures(100)
val nb = new NaiveBayes()
val indexedSpam = spamData.map(x=>Feature(0, x))
val indexedNormal = normalData.map(x=>Feature(1, x))
val trainingData = indexedSpam.union(indexedNormal)
val pipeline = new Pipeline().setStages(Array (indexer, regexTokenizer, remover, hashingTF, nb))
val model = pipeline.fit(trainingData)
model.write.overwrite().save("myNaiveBayesModel")
val spamTest = spark.read.textFile("TEST_00009_0.eml")
val normalTest = spark.read.textFile("TEST_00000_1.eml")
val sameModel = PipelineModel.load("myNaiveBayesModel")
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy")
Console.println("Spam Test")
val predictionSpam = sameModel.transform(spamTest).select("prediction")
predictionSpam.foreach(println(_))
val accuracy = evaluator.evaluate(predictionSpam)
println("Accuracy Spam: " + accuracy)
Console.println("Normal Test")
val predictionNorm = sameModel.transform(normalTest).select("prediction")
predictionNorm.foreach(println(_))
val accuracyNorm = evaluator.evaluate(predictionNorm)
println("Accuracy Normal: " + accuracyNorm)
The error occurs when initializing the MulticlassClassificationEvaluator. How should the column names be specified? Any help is appreciated.
The error is in this line:
val predictionSpam = sameModel.transform(spamTest).select("prediction")
Your dataframe contains only prediction column and no label column.