How to add jar files in pyspark anaconda? - pyspark

from pyspark.sql import Row
from pyspark import SparkConf, SparkContext
conf=SparkConf().setAppName("2048roject").setMaster("local[*]")\
.set("spark.driver.maxResultSize", "80g").set("spark.executor.memory", "5g").set("spark.driver.memory", "60g")
sc=SparkContext.getOrCreate(conf)
dfv = sc.textFile("./part-001*.gz")
I have install pyspark thru anaconda and I can import pyspark in anaconda python. But I don't know how to add jar files in conf.
I tried
conf=SparkConf().setAppName("2048roject").setMaster("local[*]")\
.set("spark.driver.maxResultSize", "80g").set("spark.executor.memory", "5g").set("spark.driver.memory", "60g").set('spark.jars.packages','file:///XXX,jar')
but it doesn't work.
Any proper way to add jar file here ?

The docs say:
spark.jars.packages: Comma-separated list of Maven coordinates of jars to include on the driver and executor classpaths. The coordinates should be groupId:artifactId:version. If spark.jars.ivySettings is given artifacts will be resolved according to the configuration in the file, otherwise artifacts will be searched for in the local maven repo, then maven central and finally any additional remote repositories given by the command-line option --repositories. For more details, see Advanced Dependency Management.
Instead, you should simply use spark.jars:
spark.jars: Comma-separated list of jars to include on the driver and executor classpaths. Globs are allowed.
So:
conf=SparkConf().setAppName("2048roject").setMaster("local[*]")\
.set("spark.driver.maxResultSize", "80g").set("spark.executor.memory", "5g").set("spark.driver.memory", "60g").set('spark.jars.files','file:///XXX.jar')

Related

Unable to import cosmosDB packages in spark-shell

I am trying to upload some data from dataframe to azure cosmosDB.
I have downloaded the below jar files and added to my local folder along with eventHub_Jars.
azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar
azure-cosmosdb-2.0.0.jar
azure-documentdb-1.16.4.jar
documentdb-bulkexecutor-2.4.1.jar
Below is the script i used to open the shell script which is working.
shell-script --master local --jars eventHub_Jars/azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar
When I use the shell script along with eventHub jars or other jars as
spark-shell --master local --packages com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13 --jars eventHub_Jars/scala-library-2.11.12.jar, azure-eventhubs-spark_2.11-2.3.2.jar, azure-eventhubs-1.0.2.jar, proton-j-0.25.0.jar, scala-java8-compat_2.11-0.9.0.jar, slf4j-api-1.7.25.jar, azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar
Shell script is opening
But when I try to import
import com.microsoft.azure.cosmosdb.spark.config.Config
it is throwing the below error
error: object cosmosdb is not a member of package com.microsoft.azure
import com.microsoft.azure.cosmosdb.spark.config.Config
what could be the reason for the above error.?
Is there any syntax issue? It seems like the only first jar added is working. If we try to import any package from any other jars, it will throw the above error!
When I tried this I had an issue with the --jars option using the relative path to retrieve the jar files unless I added "file:///" to the start of the path where I had stored the jar files.
For example if a jar file was located in /usr/local/spark/jars_added/ (a folder I created) the required path for the --jars option is file:///usr/local/spark/jars_added/*.jar where "*" represents your jar name.
The following won't be the same on your machine, however, you get the idea for specifying the jar files.
spark-shell
--master local
--packages com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13
--jars file:///usr/local/spark/jars_added/eventHub_Jars/scala-library-2.11.12.jar,
file:///usr/local/spark/jars_added/azure-eventhubs-spark_2.11-2.3.2.jar,
file:///usr/local/spark/jars_added/azure-eventhubs-1.0.2.jar,
file:///usr/local/spark/jars_added/proton-j-0.25.0.jar,
file:///usr/local/spark/jars_added/scala-java8-compat_2.11-0.9.0.jar,
file:///usr/local/spark/jars_added/slf4j-api-1.7.25.jar,
file:///usr/local/spark/jars_added/azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar
Alternatively, you can copy the jar files to the default location where jar files are retrieved for each spark session (note if you have a jars folder in $SPARK_HOME this will override the default location. In case readers are unsure the $SPARK_HOME is most likely equal to /usr/local/spark). On my machine jars are retrieved from /usr/local/spark/assembly/target/scala-2.11/jars by default for example.
It is working when I specify the full path for each jars after --jars
spark-shell --master local --packages com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13 --jars eventHub_Jars/scala-library-2.11.12.jar,eventHub_Jars/azure-eventhubs-spark_2.11-2.3.2.jar,eventHub_Jars/azure-eventhubs-1.0.2.jar,eventHub_Jars/proton-j-0.25.0.jar,eventHub_Jars/scala-java8-compat_2.11-0.9.0.jar,eventHub_Jars/slf4j-api-1.7.25.jar,eventHub_Jars/azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar

How to integrate Jupyter notebook scala kernel with apache spark?

I have installed Scala kernel based on this doc: https://github.com/jupyter-scala/jupyter-scala
Kernel is there:
$ jupyter kernelspec list
Available kernels:
python3 /usr/local/homebrew/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel/resources
scala /Users/bobyfarell/Library/Jupyter/kernels/scala
When I try to use Spark in the notebook I get this:
val sparkHome = "/opt/spark-2.3.0-bin-hadoop2.7"
val scalaVersion = scala.util.Properties.versionNumberString
import org.apache.spark.ml.Pipeline
Compilation Failed
Main.scala:57: object apache is not a member of package org
; import org.apache.spark.ml.Pipeline
^
I tried:
Setting SPARK_HOME and CLASSPATH to the location of $SPARK_HOME/jars
Setting -cp option pointing to $SPARK_HOME/jars in kernel.json
Setting classpath.add call before imports
None of these helped. Please note I don't want to use Toree, I want to use standalone spark and Scala kernel with Jupyter. A similar issue is reported here too: https://github.com/jupyter-scala/jupyter-scala/issues/63
It doesn't look like you are following the jupyter-scala directions for using Spark. You have to load spark into the kernel using the special imports.

eclipse(set with scala envirnment) : object apache is not a member of package org

As shown in image, its giving error when i am importing the Spark packages. Please help. When i hover there, it shows "object apache is not a member of package org".
I searched on this error, it shows spark jars has not been imported. So, i imported "spark-assembly-1.4.1-hadoop2.2.0.jar" too. But still same error.Below is what i actually want to run:
import org.apache.spark.{SparkConf, SparkContext}
object ABC {
def main(args: Array[String]){
//Scala Main Method
println("Spark Configuration")
val conf = new SparkConf()
conf.setAppName("My First Spark Scala Application")
conf.setMaster("spark://ip-10-237-224-94:7077")
println("Creating Spark Context")
}
}
Adding spark-core jar in your classpath should resolve your issue. Also if you are using some build tools like Maven or Gradle (if not then you should because spark-core has lot many dependencies and you would keep getting such problem for different jars), try to use Eclipse task provided by these tools to properly set classpath in your project.
I was also receiving the same error, in my case it was compatibility issue. As Spark 2.2.1 is not compatible with Scala 2.12(it is compatible with 2.11.8) and my IDE was supporting Scala 2.12.3.
I resolved my error by
1) Importing the jar files from the basic folder of Spark. During the installation of Spark in our C drive we have a folder named Spark which contains Jars folder in it. In this folder one can get all the basic jar files.
Goto to Eclipse right click on the project -> properties-> Java Build Path. Under 'library' category we will get an option of ADD EXTERNAL JARs.. Select this option and import all the jar files of 'jars folder'. click on Apply.
2) Again goto properties -> Scala Compiler ->Scala Installation -> Latest 2.11 bundle (dynamic)*
*before selecting this option one should check the compatibility of SPARK and SCALA.
The problem is Scala is NOT backward compatible. Hence each Spark module is complied against specific Scala library. But when we run from eclipse, we have one SCALA VERSION which was used to compile and create the spark Dependency Jar which we add to the build path, and SECOND SCALA VERSION is there as the eclipse run time environment. Both may conflict.
This is a hard reality, although, we wish Scala to be ,backward compatible. Or at least a complied jar file created could be backward compatible.
Hence, the recommendation is , use Maven or similar where dependency version can be managed.
If you are doing this in the context of Scala within a Jupyter Notebook, you'll get this error. You have to install the Apache Toree kernel:
https://github.com/apache/incubator-toree
and create your notebooks with that kernel.
You also have to start the Jupyter Notebook with:
pyspark

How to import own scala package using spark-shell?

I have written a class for spark-ml library that uses another classes from it.
If to be clear, my class is a wrapper for RandomForestClassifier.
Now I want to have an opportunity to import this class from spark-shell.
So the question is: how to make package containing my own class that it will be able to be imported from spark-shell? Many thanks!
If you want to import uncompiled files like Hello.scala, do below in spark shell:
scala> :load ./src/main/scala/Hello.scala
Read the docs:
In the Spark shell, a special interpreter-aware SparkContext is already created for you, in the variable called sc. Making your own SparkContext will not work. You can set which master the context connects to using the --master argument, and you can add JARs to the classpath by passing a comma-separated list to the --jars argument. You can also add dependencies (e.g. Spark Packages) to your shell session by supplying a comma-separated list of maven coordinates to the --packages argument. Any additional repositories where dependencies might exist (e.g. SonaType) can be passed to the --repositories argument.

How to use Phantom in Scala IDE

I want to use phantom with my scala IDE.So for this i clone the git hub repository and created a .jar file of phantom using sbt -> compile -> package.I add this .jar file to build path in my Scala IDE but still while importing
import com.websudos.phantom.connectors._
is throwing error that
object connector is not a member of com.websudos.phantom.
While using auto complete function of scala ide it is showing only the import for
import com.websudos.phantom.example
.I don't know if the jar files got created for example then why it is not created for other.
I search in internet but all other option are given as to add dependency in sbt build path but i dont want to use it.
Use sbt-assebly instead to create a fat jar.
https://github.com/sbt/sbt-assembly