Run spark-shell from sbt - scala

The default way of getting spark shell seems to be to download the distribution from the website. Yet, this spark issue mentions that it can be installed via sbt. I could not find documentation on this. In a sbt project that uses spark-sql and spark-core, no spark-shell binary was found.
How do you run spark-shell from sbt?

From the following URL:
https://bzhangusc.wordpress.com/2015/11/20/use-sbt-console-as-spark-shell/
If you already using Sbt for your project, it’s very simple to setup Sbt Console to replace Spark-shell command.
Let’s start from the basic case. When you setup the project with sbt, you can simply run the console as sbt console
Within the console, you just need to initiate SparkContext and SQLContext to make it behave like Spark Shell
scala> val sc = new org.apache.spark.SparkContext("localhell")
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)

Related

Spark Shell not working after adding support for Iceberg

We are doing POC on Iceberg and evaluating it first time.
Spark Environment:
Spark Standalone Cluster Setup ( 1 master and 5 workers)
Spark: spark-3.1.2-bin-hadoop3.2
Scala: 2.12.10
Java: 1.8.0_321
Hadoop: 3.2.0
Iceberg 0.13.1
As suggested in Iceberg's official documentation, to add support for Iceberg in Spark shell, we are adding Iceberg dependency while launching the Spark shell as below,
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1
After launching the Spark shell with the above command, we are not able to use the Spark shell at all. For all the commands (even non Iceberg) we are getting the same exception as below,
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/plans/logical/BinaryCommand
Below simple command also throwing same exception.
val df : DataFrame = spark.read.json("/spark-3.1.2-bin-hadoop3.2/examples/src/main/resources/people.json")
df.show()
In Spark source code, BinaryCommand class belongs to Spark SQL module, so tried explicitly adding Spark SQL dependency while launching Spark shell as below, but still getting same exception.
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1,org.apache.spark:spark-sql_2.12:3.1.2
When we launch spark-shell normally i.e. without Iceberg dependency, then it is working properly.
Any pointer in the right direction for troubleshooting would be really helpful.
Thanks.
We are using the wrong Iceberg version, choose the spark 3.2 iceberg jar but running Spark 3.1. After using the correct dependency version (i.e. 3.1), we are able to launch the Spark shell with Iceberg. Also no need to specify org.apache.spark Spark jars using packages since all of that will be on the classpath anyway.
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:0.13.1

Run spark application on a different version of spark remotely

I have few spark tests that I am running fine remotely through maven on spark 1.6.0 and am using scala. Now I want to run these tests on spark2. The problem is cloudera which by default is using spark 1.6. Where is cloudera taking this version from and what do I need to do to change the default version of spark ? Also, spark 1.6 and spark 2 are present on same cluster. Both spark versions are present on top of yarn. The hadoop config files are present on the cluster which I am using to run the tests on the test environment and This is how I am getting spark context.
def getSparkContext(hadoopConfiguration: Configuration): SparkContext ={
val conf = new SparkConf().setAppName("SparkTest").setMaster("local")
hadoopConfiguration.set("hadoop.security.authentication", "Kerberos")
UserGroupInformation.loginUserFromKeytab("alice", "/etc/security/keytab/alice.keytab")
val sc=new SparkContext(conf)
return sc
}
Is there any way I can specify the version in the conf files or cloudera itself ?
When submitting a new Spark Job, there are two places where you have to change the Spark-Version:
Set SPARK_HOME to the (local) path that contains the correct Spark installation. (Sometimes - especially for minor release changes - the version in SPARK_HOME does not have to be 100% correct, although I would recommend to keep things clean.)
Inform your cluster where the Spark jars are located. By default, spark-submit will upload the jars in SPARK_HOME to your cluster (this is one of the reasons why you should not mix the versions). But you can skip this upload process by hinting the cluster manager to use jars located in the hdfs. As you are using Cloudera, I assume that your cluster manager is Yarn. In this case, either set spark.yarn.jars or spark.yarn.archive to the path where the jars for the correct Spark version are located. Example: --conf spark.yarn.jar=hdfs://server:port/<path to your jars with the desired Spark version>
In any case you should make sure that the Spark version that you are using at runtime is the same as at compile time. The version you specified in your Maven, Gradle or Sbt configuration should always match the version referenced by SPARK_HOME or spark.yarn.jars.
I was able to successfully run it for spark 2.3.0. The problem that I was unable to run it on spark 2.3.0 earlier was because I had added spark-core dependency in pom.xml for version 1.6. That's why no matter what jar location we specified, it by default took spark 1.6(still figuring out why). On changing the library version, I was able to run it.

spark-submit for a .scala file

I have been running some test spark scala code using probably a bad way of doing things with spark-shell:
spark-shell --conf spark.neo4j.bolt.password=Stuffffit --packages neo4j-contrib:neo4j-spark-connector:2.0.0-M2,graphframes:graphframes:0.2.0-spark2.0-s_2.11 -i neo4jsparkCluster.scala
This would execute my code on spark and pop into the shell when done.
Now that I am trying to run this on a cluster, I think I need to use spark-submit, to which I thought would be:
spark-submit --conf spark.neo4j.bolt.password=Stuffffit --packages neo4j-contrib:neo4j-spark-connector:2.0.0-M2,graphframes:graphframes:0.2.0-spark2.0-s_2.11 -i neo4jsparkCluster.scala
but it does not like the .scala file, somehow does it have to be compiled into a class? the scala code is a simple scala file with several helper classes defined in it and no real main class so to speak. I don't see int he help files but maybe I am missing it, can I just spark-submit a file or do I have to somehow give it the class? Thus changing my scala code?
I did add this to my scala code too:
went from this
val conf = new SparkConf.setMaster("local").setAppName("neo4jspark")
val sc = new SparkContext(conf)
To this:
val sc = new SparkContext(new SparkConf().setMaster("spark://192.20.0.71:7077")
There are 2 quick and dirty ways of doing this:
Without modifying the scala file
Simply use the spark shell with the -i flag:
$SPARK_HOME/bin/spark-shell -i neo4jsparkCluster.scala
Modifying the scala file to include a main method
a. Compile:
scalac -classpath <location of spark jars on your machine> neo4jsparkCluster
b. Submit it to your cluster:
/usr/lib/spark/bin/spark-submit --class <qualified class name> --master <> .
You will want to package your scala application with sbt and include Spark as a dependency within your build.sbt file.
See the self contained applications section of the quickstart guide for full instructions https://spark.apache.org/docs/latest/quick-start.html
You can take a look at the following Hello World example for Spark which packages your application as #zachdb86 already mentioned.
spark-hello-world

SparkContext cannot be initialized in 'yarn-client' mode called from Scala-IDE

I have installed Cloudera VM (Single node) and inside this VM i have Spark running on top of Yarn. I would like to use Eclipse IDE (with scala plugin) for testing/learning with Spark.
If i instantiate SparkContext as following, everything works as i expected
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext._
val sparkConf = new SparkConf().setAppName("TwitterPopularTags").setMaster("local[2]")
However, if i want now to connect to local server by changing the master to 'yarn-client' then it does not work:
val master = "yarn-client"
val sparkConf = new SparkConf().setAppName("TwitterPopularTags").setMaster(master)
Specifically im getting following errors:
Error details displayed in the Eclipse console:
Error details from the NodeManager logs:
Here are the things i have tried so far:
1. Dependencies
I added all the dependencies through Maven repository
Cloudera version is 5.5 and corresponding Hadoop version is 2.6.0 and Spark version is 1.5.0.
2. Configurations
I added 3 path variables into Eclipse classpath:
SPARK_CONF_DIR=/etc/spark/conf/
HADOOP_CONF_DIR=/usr/lib/hadoop/
YARN_CONF_DIR=/etc/hadoop/conf.cloudera.yarn/
Can anybody clarify me what is the problem here and ways to solve it?
I worked around it! I still don't understand what the exact problems is but i created a folder with my username in hadoop , i.e. /user/myusername directory and it worked. Anyway now i switched to Hortonworks distribution and i found it much more smoother to get started with than the Cloudera distribution.

error: not found: value sc

I am new to Scala and am trying to code read a file using the following code
scala> val textFile = sc.textFile("README.md")
scala> textFile.count()
But I keep getting the following error
error: not found: value sc
I have tried everything, but nothing seems to work. I am using Scala version 2.10.4 and Spark 1.1.0 (I have even tried Spark 1.2.0 but it doesn't work either). I have sbt installed and compiled yet not able to run sbt/sbt assembly. Is the error because of this?
You should run this code using ./spark-shell. It's scala repl with provided sparkContext. You can find it in your apache spark distribution in folder spark-1.4.1/bin.