Running h2o in Jupyter scala notebook - scala

I'm trying to get h2o running on a Jupyter notebook with scala kernel, with no success so far. Maybe someone can give me a hint on what could be wrong? The code I'm executing at the moment is
classpath.add("ai.h2o" % "sparkling-water-core_2.10" % "1.6.5")
import org.apache.spark.h2o._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val conf = new SparkConf().setAppName("appName").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val h2oContext = new H2OContext(sc).start()
It fails on the last line with error
java.lang.NoClassDefFoundError: water/H2O
....
And prints out exception
java.lang.RuntimeException: Cannot launch H2O on executors: numOfExecutors=1, executorStatus=(driver,false) (Cannot launch H2O on executors: numOfExecutors=1, executorStatus=(driver,false))
org.apache.spark.h2o.H2OContextUtils$.startH2O(H2OContextUtils.scala:169)
org.apache.spark.h2o.H2OContext.start(H2OContext.scala:214)

If you use Toree,
In /usr/local/share/jupyter/kernels/apache_toree_scala/kernel.json
You should add --packages ai.h2o:sparkling-water-core_2.10:1.6.6 on __TOREE_SPARK_OPTS__ , something like
"__TOREE_SPARK_OPTS__": "--master local[*] --executor-memory 12g --driver-memory 12g --packages ai.h2o:sparkling-water-core_2.10:1.6.6",
Then, sc is created when you create your notebook. so you don't need to recreate sc.

Related

Pyspark in google colab

I am trying to use pyspark on google colab. Every tutorial follows a similar method
!pip install pyspark # Import SparkSession
from pyspark.sql import SparkSession # Create a Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate() # Check Spark Session Information
spark # Import a Spark function from library
from pyspark.sql.functions import col
But I get an error in
----> 4 spark = SparkSession.builder.master("local[*]").getOrCreate() # Check Spark Session Information
RuntimeError: Java gateway process exited before sending its port number
I tried installing java using something like this
# Download Java Virtual Machine (JVM)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
as suggested by the tutorials, but nothing seems to work.
This worked for me so i post in case someone needs it.
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
import pyspark
import pyspark.sql as pyspark_sql
import pyspark.sql.types as pyspark_types
import pyspark.sql.functions as pyspark_functions
from pyspark import SparkContext, SparkConf
# create the session
conf = SparkConf().set("spark.ui.port", "4050")
# create the context
sc = pyspark.SparkContext(conf=conf)
spark = pyspark_sql.SparkSession.builder.getOrCreate()

Error in running Scala in terminal: "object apache is not a member of package org"

I'm using sublime to write my first Scala program, and I'm using terminal to run it.
First I use scalac assignment2.scala command to compile it, but it show error message:"error: object apache is not a member of package org"
How can I do to fix it?
This is my code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object assignment2 {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("assignment2")
val sc = new SparkContext(conf)
val input = sc.parallelize(List(1, 2, 3, 4))
val result = input.map(x => x * x)
println(result.collect().mkString(","))
}
}
Where are you trying to submit the job. To run any spark application you need to submit it from bin/spark-submit in your spark installation directory or you need to have spark-home set in your environment, which you can refer while submitting.
Actually you can't run spark-scala file directly because for compilation your scala class, you need spark library. So for executing scala file you required spark-shell. For executing your spark scala file inside spark-shell, please find the below steps:
Open your spark-shell using next command-
'spark-shell --master yarn-client'
load your file with exact location-
':load File_Name_With_Absoulte_path'
Run you main method using class name- 'ClassName.main(null)'

Databricks-connect with Python + Scala UDF in JAR file not working locally

I'm trying to use a JAR file in python (using Databricks-connect) in Vs Code.
I already checked the path to the jar file.
I have the following code as example:
import datetime
import time
from pyspark.sql import SparkSession
from pyDataHub import LoadProcessorBase, ProcessItem
from pyspark.sql.functions import col, lit, sha1, concat, udf, array
from pyspark.sql import functions
from pyspark.sql.types import TimestampType, IntegerType, DoubleType, StringType
from pyspark import SparkContext
from pyspark.sql.functions import sha1, upper
from pyspark.sql.column import Column, _to_java_column, _to_seq
spark = SparkSession \
.builder \
.config("spark.jars", "/users/Phill/source/jar/DataHub_Core_Functions.jar") \
.getOrCreate()
sc = spark.sparkContext
def PhillHash(col):
f = sc._jvm.com.narato.datahub.core.HashContentGenerator.getGenerateHashUdf()
return upper(sha1(Column(f.apply(_to_seq(sc, [col], _to_java_column)))))
sc._jsc.addJar("/users/Phill/source/jar/DataHub_Core_Functions.jar")
spark.range(100).withColumn("test", PhillHash("id")).show()
Any help would be appreciated cause I'm out of options here...
The error I get is the following:
Exception has occurred: TypeError 'JavaPackage' object is not callable
Add the jar to a dbfs location and update the path accordingly. The workers cannot connect to your local filesystem.
Also make sure you are running version 5.4 of the databricks runtime (or higher).

Scala Word count jar is not running in spark

I'm very new to both Scala and Spark. I added the Scala IDE to Eclipse Luna. I created a maven project in the eclipse. I was to run the program within the eclipse with run as configuration option and able to get the output successfully. But when i create the jar for the following program and tried to run the spark shell am getting the following error.
error: ';' expected but 'class' found.
Command use to run the jar
spark-submit --class com.kirthi.spark.proj.sparkexamples.WordsCount --master local /home/cloudera/workspace/sparkwc1.jar hdfs://localhost:8020/kirthi3/dataset.txt hdfs://localhost:8020/kirthi3/sparkwco
The word count program which i tried
package com.kirthi.spark.proj.sparkexamples
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object WordsCount {
def main(args: Array[String]){
val conf = new SparkConf()
.setAppName("Word Count")
.setMaster("local")
val sc = new SparkContext(conf)
val textFile = sc.textFile(args(0))
val words = textFile.flatMap(line => line.split(","))
val counts = words.map(word => (word,1))
val wordcount = counts.reduceByKey(_+_)
val wordcount_sorted = wordcount.sortByKey()
wordcount_sorted.foreach(println)
wordcount_sorted.saveAsTextFile(args(1))
}
}
Kindly help me out in this as I am struck with initial program for spark.
I am using cloudera quickstart CDH 5.5
As shown in the comments, you ran the above command in the Scala REPL, and which you should run it from a regular linux shell.

WordCount on Azure using hadoop and spark

I have to run a simple wordcount on a cluster Hdinsight in Azure. I have created a cluster with hadoop and spark and i have already the jar file with the code, the problem that i don't know how to set-up the cluster and the right line of code to launch spark on Azure,I want to try different combination of nodes(workers , 2-4-8) to see how the program scale.
Every time i launch the app with spark-submit in mode yarn-client, it work but always with 2 executor and 1 core taking for 1gb input text file around 3 minute,also if i set more executor and more core he take the settings but he don't use that,so i think that the problem it's with the RDD, it don't split the input file in the right mode because it create only 2 task that start in 2 worknode and the other nodes remain inactive.
The jar file it's created with sbt package.
Command to launch Spark:
spark-submit --class "SimpleApp" --master yarn-client --num-executors 2 simpleapp_2.10-1.0.jar
WordCount Code:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import java.io._
import org.apache.hadoop.fs
import org.apache.spark.rdd.RDD
object SimpleApp {
def main(args: Array[String]){
//settingsparkcontext
val conf = new SparkConf().setAppName("SimpleApp")
val sc = new SparkContext(conf)
//settingthewordtosearch
val word = "word"
//settingtime
val now = System.nanoTime
//settingtheinputfile
val input = sc.textFile("wasb://xxx#storage.blob.core.windows.net/dizionario1gb.txt")
//wordlookup
val splittedLines = input.map(line=>line.split(""))
val find = System.nanoTime
val tot = splittedLines.map(x => x.equals(word)).count()
val w=(System.nanoTime-find)/1000000
val rw=(System.nanoTime-now)/1000000
//reportingtheresultofexecutioninatxtfile
val writer = new FileWriter("D:\\Users\\user\\Desktop\\File\\output.txt",true)
try {
writer.write("Word found "+tot+" time total "+rw+" mstimesearch "+w+" time read "+(rw-w)+"\n")
}
finally writer.close()
//terminatingthesparkserver
sc.stop()
}}
Level of Parallelism
"Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. Spark automatically sets the number of “map” tasks to run on each file according to its size (though you can control it through optional parameters to SparkContext.textFile, etc) You can pass the level of parallelism as a second argument (see the spark.PairRDDFunctions documentation), or set the config property spark.default.parallelism to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster."
Source:
https://spark.apache.org/docs/1.3.1/tuning.html