Usage of Junit in Spark shell

Usage of Junit in Spark shell - scala

I am struggling to make junit work in my Spark shell.
When trying to import Assert from junit I geet the following error message:
scala> import org.junit.Assert._
<console>:23: error: object junit is not a member of package org
import org.junit.Assert._
Any way of having this fixed? Any idea of how can I download org.junit from the scala shell?
EDIT:
After following the recomendation from zsxwing, I have used spark-shell --packages junit:junit:4.12 with the following output:
C:\spark>spark-shell --packages junit:junit:4.12
Ivy Default Cache set to: C:\xxx\.ivy2\cache
The jars for the packages stored in: C:\xxxx\.ivy2\jars
:: loading settings :: url = jar:file:/C:/spark/jars/ivy-2.4.0.jar!/org/apache/i
vy/core/settings/ivysettings.xml
junit#junit added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found junit#junit;4.12 in central
found org.hamcrest#hamcrest-core;1.3 in central
:: resolution report :: resolve 365ms :: artifacts dl 7ms
:: modules in use:
junit#junit;4.12 from central in [default]
org.hamcrest#hamcrest-core;1.3 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 0 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 2 already retrieved (0kB/20ms)
Setting default log level to "WARN".
Spark context Web UI available at http://xxxxx
Spark context available as 'sc' (master = local[*], app id = local-xxx
).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.0
/_/
However still facing the same issue when trying to import org.junit.Assert._

JUnit is a test dependency and it's not included in the Spark shell's classpath. You can use the --packages parameter add any dependency such as,
bin/spark-shell --packages junit:junit:4.12

Related

Unable to run PyPostal on Spark Cluster

I'm trying to run the PyPostal python bindings for the C program LibPostal to perform address standardisation on spark (AWS Glue to be precise). I have followed the guide https://github.com/openvenues/pypostal/issues/58 which links to https://github.com/openvenues/libpostal/issues/153 but to no avail.
The steps I have taken.
Compiled the libpostal into a single tar file joint.tar.gz
Compiled the libpostal data files into a single tar file libpostal_datadir.tar.gz
Created an egg of the PyPostal python bindings postal-1.1.9-py2.7-linux-x86_64.egg
Configured pyspark shell with the following command.
gluepyspark3 \
--conf "spark.yarn.dist.archives=s3://{some_S3_location}/libpostal/joint.tar.gz,s3://{some_S3_location}/libpostal/libpostal_datadir.tar.gz" \
--conf "spark.executorEnv.LD_LIBRARY_PATH=./joint.tar.gz" \
--conf "spark.executorEnv.LIBPOSTAL_DATA_DIR=./libpostal_datadir.tar.gz" \
--conf "spark.driver.extraLibraryPath=./joint.tar.gz" \
--conf "spark.driver.LibraryPath=./joint.tar.gz" \
--conf "spark.yarn.appMasterEnv.LD_LIBRARY_PATH=./joint.tar.gz" \
--conf "spark.yarn.appMasterEnv.LIBPOSTAL_DATA_DIR=./libpostal_datadir.tar.gz" \
--py-files "s3://{some_s3_location}/libpostal/postal-1.1.9-py2.7-linux-x86_64.egg"
Inside Spark shell from postal.parser import *
The following error is returned.
Python 3.6.12 (default, Aug 31 2020, 18:56:18)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
Type "help", "copyright", "credits" or "license" for more information.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/share/aws/glue/etl/jars/glue-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/02/19 15:12:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/02/19 15:12:19 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
21/02/19 15:12:32 WARN Client: Same path resource s3://safari-dev-assets/glue/jobs/kw_libpostal/postal-1.1.9-py2.7-linux-x86_64.egg added multiple times to distributed cache.
21/02/19 15:12:54 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.3
/_/
Using Python version 3.6.12 (default, Aug 31 2020 18:56:18)
SparkSession available as 'spark'.
>>> from postal.parser import *
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/mnt/tmp/spark-57e88948-952a-4a7b-b104-f7832af5f9c8/userFiles-cf4d4719-3866-4d63-8ef6-fbe3b4dd08ad/postal-1.1.9-py2.7-linux-x86_64.egg/postal/parser.py", line 2, in <module>
File "/mnt/tmp/spark-57e88948-952a-4a7b-b104-f7832af5f9c8/userFiles-cf4d4719-3866-4d63-8ef6-fbe3b4dd08ad/postal-1.1.9-py2.7-linux-x86_64.egg/postal/_parser.py", line 7, in <module>
File "/mnt/tmp/spark-57e88948-952a-4a7b-b104-f7832af5f9c8/userFiles-cf4d4719-3866-4d63-8ef6-fbe3b4dd08ad/postal-1.1.9-py2.7-linux-x86_64.egg/postal/_parser.py", line 6, in __bootstrap__
File "/usr/lib64/python3.6/imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: libpostal.so.1: cannot open shared object file: No such file or directory

How to run scala code in spark container using docker?

I have created a Spark container using the following Dockerfile:
FROM ubuntu:16.04
RUN apt-get update -y && apt-get install -y \
default-jdk \
nano \
wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN useradd --create-home --shell /bin/bash ubuntu
ENV HOME /home/ubuntu
ENV SPARK_VERSION 2.4.3
ENV HADOOP_VERSION 2.6
ENV MONGO_SPARK_VERSION 2.2.0
ENV SCALA_VERSION 2.11
WORKDIR ${HOME}
ENV SPARK_HOME ${HOME}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}
ENV PATH ${PATH}:${SPARK_HOME}/bin
COPY files/times.json /home/ubuntu/times.json
COPY files/README.md /home/ubuntu/README.md
COPY files/examples.scala /home/ubuntu/examples.scala
COPY files/initDocuments.scala /home/ubuntu/initDocuments.scala
RUN chown -R ubuntu:ubuntu /home/ubuntu/*
USER ubuntu
# get spark
RUN wget http://apache.mirror.digitalpacific.com.au/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && \
tar xvf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
RUN rm -fv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
I also have two files written in Scala programming language and that sounds new to me. The problem is that the container just knows java and doesn't have any other command installed. Is there any way to run the Scala without installing any program on container?
The file names are examples.scala and initDocuments.scala. Here is initDocuments.scala file:
import com.mongodb.spark._
import com.mongodb.spark.config._
import org.bson.Document
val rdd = MongoSpark.load(sc)
if (rdd.count<1){
val t = sc.textFile("times.json")
val converted = t.map((tuple)=>Document.parse(tuple))
converted.saveToMongoDB(WriteConfig(Map("uri"->"mongodb://mongodb/spark.times")))
println("Documents inserted.")
} else {
println("Database 'spark' collection 'times' is not empty. Maybe you've loaded a data into the collection previously ? skipping process. ")
}
System.exit(0);
I have also tried the following but it doesn't work.
spark-shell --conf "spark.mongodb.input.uri=mongodb://mongodb:27017/spark.times" --conf "spark.mongodb.output.uri=mongodb://mongodb/spark.output" --packages org.mongodb.spark:mongo-spark-connector_${SCALA_VERSION}:${MONGO_SPARK_VERSION} -i ./initDocuments.scala
Error:
Ivy Default Cache set to: /home/ubuntu/.ivy2/cache
The jars for the packages stored in: /home/ubuntu/.ivy2/jars
:: loading settings :: url = jar:file:/home/ubuntu/spark-2.4.3-bin-hadoop2.6/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-d0f95242-e9b9-4d49-8dde-42afc7c55e9a;1.0
confs: [default]
You probably access the destination server through a proxy server that is not well configured.
You probably access the destination server through a proxy server that is not well configured.
You probably access the destination server through a proxy server that is not well configured.
You probably access the destination server through a proxy server that is not well configured.
:: resolution report :: resolve 40879ms :: artifacts dl 0ms
:: modules in use:
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 0 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: WARNINGS
Host repo1.maven.org not found. url=https://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/mongo-spark-connector_2.11-2.2.0.pom
Host repo1.maven.org not found. url=https://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/mongo-spark-connector_2.11-2.2.0.jar
Host dl.bintray.com not found. url=https://dl.bintray.com/spark-packages/maven/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/mongo-spark-connector_2.11-2.2.0.pom
Host dl.bintray.com not found. url=https://dl.bintray.com/spark-packages/maven/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/mongo-spark-connector_2.11-2.2.0.jar
module not found: org.mongodb.spark#mongo-spark-connector_2.11;2.2.0
==== local-m2-cache: tried
file:/home/ubuntu/.m2/repository/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/mongo-spark-connector_2.11-2.2.0.pom
-- artifact org.mongodb.spark#mongo-spark-connector_2.11;2.2.0!mongo-spark-connector_2.11.jar:
file:/home/ubuntu/.m2/repository/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/mongo-spark-connector_2.11-2.2.0.jar
==== local-ivy-cache: tried
/home/ubuntu/.ivy2/local/org.mongodb.spark/mongo-spark-connector_2.11/2.2.0/ivys/ivy.xml
-- artifact org.mongodb.spark#mongo-spark-connector_2.11;2.2.0!mongo-spark-connector_2.11.jar:
/home/ubuntu/.ivy2/local/org.mongodb.spark/mongo-spark-connector_2.11/2.2.0/jars/mongo-spark-connector_2.11.jar
==== central: tried
https://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/mongo-spark-connector_2.11-2.2.0.pom
-- artifact org.mongodb.spark#mongo-spark-connector_2.11;2.2.0!mongo-spark-connector_2.11.jar:
https://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/mongo-spark-connector_2.11-2.2.0.jar
==== spark-packages: tried
https://dl.bintray.com/spark-packages/maven/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/mongo-spark-connector_2.11-2.2.0.pom
-- artifact org.mongodb.spark#mongo-spark-connector_2.11;2.2.0!mongo-spark-connector_2.11.jar:
https://dl.bintray.com/spark-packages/maven/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/mongo-spark-connector_2.11-2.2.0.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: org.mongodb.spark#mongo-spark-connector_2.11;2.2.0: not found
::::::::::::::::::::::::::::::::::::::::::::::
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: org.mongodb.spark#mongo-spark-connector_2.11;2.2.0: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1306)
at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:315)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
PS: I have tried to change the proxy address using the following command but I think that I don't have a good proxy for my usage. I would be thankful if anyone could help me to run a well configured proxy to solve my downloading problem.
export JAVA_OPTS="$JAVA_OPTS -Dhttp.proxyHost=yourserver -Dhttp.proxyPort=8080 -Dhttp.proxyUser=username -Dhttp.proxyPassword=password"

Based on the error message that you have below
:: org.mongodb.spark#mongo-spark-connector_2.11;2.2.0: not found
It indicates that the package is missing. Checking on currently available MongoDB Connector for Spark packages, confirms that the package is no longer available (replaced with patched v2.2.6).
You can check an updated example of MongoDB Spark connector with Docker on sindbach/mongodb-spark-docker.
Additional information:
spark-shell is a REPL (Read-Evaluate-Print Loop) tool. It is an interactive shell used by programmers to interact with a framework. You don't need to explicitly execute build for execution. When you specify --packages argument of spark-shell it will automatically fetch the package and include it in the environment of your shell.

module not found: com.databricks#spark-csv_2.10;1.5.0

I've tried the following in Jupyter in order to read in the CSV file in a table format.
pyspark --packages com.databricks:spark-csv_2.10:1.5.0
then I got the following error in the log, for more details about the log "i've listed separately in the next comment"
:::: WARNINGS
module not found: com.databricks#spark-csv_2.10;1.5.0
"I've checked spark-csv_2.10-1.5.0.jar", and "commons-csv-1.1.jar" are already exist
if i ignored the warning, i got this error "NameError: name 'sc' is not defined" when running the following
sqlContext = SQLContext(sc)
and I'm really stuck, thus any suggestion, please.
the target is to read in the CSV file as below
sqlContext = SQLContext(sc)
data = sqlContext.read.load('file:///path/file.csv', format='com.databricks.spark.csv', header='true',inferSchema='true')
Here is the Log:
pyspark --packages com.databricks:spark-csv_2.10:1.5.0
/home/cloudera/.local/lib/python3.5/site-packages/requests/init.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 3]) may cause slowdown.
warnings.warn(warning, RequestsDependencyWarning)
[I 10:32:29.300 NotebookApp] The port 8888 is already in use, trying another random port.
[I 10:32:29.311 NotebookApp] Serving notebooks from local directory: /home/cloudera/Downloads/coursera-master/big-data-4
[I 10:32:29.312 NotebookApp] 0 active kernels
[I 10:32:29.312 NotebookApp] The Jupyter Notebook is running at: http://localhost:8889/
[I 10:32:29.312 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
WARNING: content window passed to PrivateBrowsingUtils.isWindowPrivate. Use isContentWindowPrivate instead (but only for frame scripts).
pbu_isWindowPrivate#resource://gre/modules/PrivateBrowsingUtils.jsm:25:14
nsBrowserAccess.prototype.openURI#chrome://browser/content/browser.js:15192:21
NewNotebookWidget.prototype.new_notebook#http://localhost:8889/static/tree/js/main.min.js?v=cee9d5ded70fc8733bb888581c22f633:15194:17
.proxy/i#http://localhost:8889/static/tree/js/main.min.js?v=cee9d5ded70fc8733bb888581c22f633:4:5486
x.event.dispatch#http://localhost:8889/static/tree/js/main.min.js?v=cee9d5ded70fc8733bb888581c22f633:5:9954
x.event.add/y.handle#http://localhost:8889/static/tree/js/main.min.js?v=cee9d5ded70fc8733bb888581c22f633:5:6772
[I 10:32:35.674 NotebookApp] Creating new notebook in
[I 10:32:36.695 NotebookApp] Kernel started: 25ed0b47-e0f0-4191-b1bc-984679f2668c
Ivy Default Cache set to: /home/cloudera/.ivy2/cache
The jars for the packages stored in: /home/cloudera/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.16.0-hadoop2.6.0-cdh5.16.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
[W 10:32:47.059 NotebookApp] Timeout waiting for kernel_info reply from 25ed0b47-e0f0-4191-b1bc-984679f2668c
:: resolution report :: resolve 8250ms :: artifacts dl 0ms
:: modules in use:
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 0 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: WARNINGS
module not found: com.databricks#spark-csv_2.10;1.5.0
==== local-m2-cache: tried
file:/home/cloudera/.m2/repository/com/databricks/spark-csv_2.10/1.5.0/spark-csv_2.10-1.5.0.pom
-- artifact com.databricks#spark-csv_2.10;1.5.0!spark-csv_2.10.jar:
file:/home/cloudera/.m2/repository/com/databricks/spark-csv_2.10/1.5.0/spark-csv_2.10-1.5.0.jar
==== local-ivy-cache: tried
/home/cloudera/.ivy2/local/com.databricks/spark-csv_2.10/1.5.0/ivys/ivy.xml
==== central: tried
https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.5.0/spark-csv_2.10-1.5.0.pom
-- artifact com.databricks#spark-csv_2.10;1.5.0!spark-csv_2.10.jar:
https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.5.0/spark-csv_2.10-1.5.0.jar
==== spark-packages: tried
http://dl.bintray.com/spark-packages/maven/com/databricks/spark-csv_2.10/1.5.0/spark-csv_2.10-1.5.0.pom
-- artifact com.databricks#spark-csv_2.10;1.5.0!spark-csv_2.10.jar:
http://dl.bintray.com/spark-packages/maven/com/databricks/spark-csv_2.10/1.5.0/spark-csv_2.10-1.5.0.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: com.databricks#spark-csv_2.10;1.5.0: not found
::::::::::::::::::::::::::::::::::::::::::::::
:::: ERRORS
Server access error at url https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.5.0/spark-csv_2.10-1.5.0.pom (javax.net.ssl.SSLException: Received fatal alert: protocol_version)
Server access error at url https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.5.0/spark-csv_2.10-1.5.0.jar (javax.net.ssl.SSLException: Received fatal alert: protocol_version)
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.databricks#spark-csv_2.10;1.5.0: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1067)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:287)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:154)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /usr/lib/spark/python/pyspark/shell.py:

I think you can use another way to read csv files in pyspark by:
spark.read.csv("yourPath", header=True)
and do not need to import others packages.

For spark 2.x versions, this library has been inlined - https://github.com/databricks/spark-csv . If you you using 2.x version no need to import this library

Reading sbt dependency tree

I'm upgrading some libraries within a Play! project. During the process I'm trying to resolve errors like this:
java.lang.ClassNotFoundException: akka.event.slf4j.Slf4jLoggingFilter
Which I am assuming comes from incompatible transitive dependencies?
But I'm struggling to effectively use sbt-dependency-graph to help me track down the problem.
[info] +-ch.qos.logback:logback-classic:1.1.3
[info] | +-ch.qos.logback:logback-core:1.1.3
[info] | +-org.slf4j:slf4j-api:1.7.21
[info] | +-org.slf4j:slf4j-api:1.7.7 (evicted by: 1.7.21)
Why are there 2 versions of slf4j-api listed 👆🏻
I'm assuming, the newer version (1.7.21) is taking precedence over 1.7.7 But then, why in some instances, do I see as many as 5 different versions of the same dependency (all but 1 evicted):
| | | | +-org.slf4j:slf4j-ext:1.7.12
| | | | +-ch.qos.cal10n:cal10n-api:0.8.1
| | | | +-org.slf4j:slf4j-api:1.6.2 (evicted by: 1.7.21)
| | | | +-org.slf4j:slf4j-api:1.6.4 (evicted by: 1.7.21)
| | | | +-org.slf4j:slf4j-api:1.7.12 (evicted by: 1.7.21)
| | | | +-org.slf4j:slf4j-api:1.7.2 (evicted by: 1.7.21)
| | | | +-org.slf4j:slf4j-api:1.7.21
Once the conflict is found -- do I need to upgrade all dependencies to be using the same version?
Is there another approach I should be taking?

Discovered that "the SLF4J API is backward compatible for all versions". Despite java.lang.ClassNotFoundException being called from an slf4j. So digging a little deeper:
Akka Kindly provides binary compatibility rules which show that backwards compatibility is violated between major versions.
Looking at the dependency tree we were seeing different major versions of akka dependencies (2.3.x and 2.4.x): com.typesafe.akka:akka-actor_2.11:2.3.13 (evicted by: 2.4.11) and com.typesafe.akka:akka-slf4j_2.11:2.3.13
Standardizing all akka dependencies around a major version fixed it:
Our original dependencies only provided the "com.typesafe.akka" %% "akka-actor" % "2.4.11" and play was providing "akka-slf4j" % "2.3.13" transitively which was breaking binary compatibility.
+-com.typesafe.akka:akka-slf4j_2.11:2.3.13 [S]
| +-com.typesafe.akka:akka-actor_2.11:2.3.13 (evicted by: 2.4.11)
Providing both akka-slf4j and akka-actor with the same major version solved this issue.

How to load jar package such as JDBC in Kubernetes-Spark

I am following the instructions laid out on Kubernetes' Spark example. I can get to the step with launching the PySpark shell. However, I need to use PySpark with JDBC to connect to my Postgres database. Before I tried Kubernetes, I got the JDBC working with Spark using the spark-defaults.conf file:
spark.driver.extraClassPath /spark/postgresql-9.4.1209.jre7.jar
spark.executor.extraClassPath /spark/postgresql-9.4.1209.jre7.jar
I also had to download the driver into the location first. How do I achieve the same thing with Kubernetes? I don't think I can do
kubectl exec zeppelin-controller-xzlrf -it pyspark --jars /spark/postgresql-9.4.1209.jre7.jar
because the jar would have to be inside the container first. Therefore, maybe I can get it working if I can get the jar file inside the container, but how do I do that? Any thoughts or help is greatly appreciated.
UPDATE: I tried following #LostInOverflow's solution but encountered the following:
kubectl exec zeppelin-controller-2p3ew -it -- pyspark --packages org.postgresql:postgresql:9.4.1209.jre7.jar
which appears to boot up and recognizes the package argument but still fails:
Python 2.7.9 (default, Mar 1 2015, 12:57:24)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/opt/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.postgresql#postgresql added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
:: resolution report :: resolve 2294ms :: artifacts dl 0ms
:: modules in use:
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 0 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: WARNINGS
module not found: org.postgresql#postgresql;9.4.1209.jre7.jar
==== local-m2-cache: tried
file:/root/.m2/repository/org/postgresql/postgresql/9.4.1209.jre7.jar/postgresql-9.4.1209.jre7.jar.pom
-- artifact org.postgresql#postgresql;9.4.1209.jre7.jar!postgresql.jar:
file:/root/.m2/repository/org/postgresql/postgresql/9.4.1209.jre7.jar/postgresql-9.4.1209.jre7.jar.jar
==== local-ivy-cache: tried
/root/.ivy2/local/org.postgresql/postgresql/9.4.1209.jre7.jar/ivys/ivy.xml
==== central: tried
https://repo1.maven.org/maven2/org/postgresql/postgresql/9.4.1209.jre7.jar/postgresql-9.4.1209.jre7.jar.pom
-- artifact org.postgresql#postgresql;9.4.1209.jre7.jar!postgresql.jar:
https://repo1.maven.org/maven2/org/postgresql/postgresql/9.4.1209.jre7.jar/postgresql-9.4.1209.jre7.jar.jar
==== spark-packages: tried
http://dl.bintray.com/spark-packages/maven/org/postgresql/postgresql/9.4.1209.jre7.jar/postgresql-9.4.1209.jre7.jar.pom
-- artifact org.postgresql#postgresql;9.4.1209.jre7.jar!postgresql.jar:
http://dl.bintray.com/spark-packages/maven/org/postgresql/postgresql/9.4.1209.jre7.jar/postgresql-9.4.1209.jre7.jar.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: org.postgresql#postgresql;9.4.1209.jre7.jar: not found
::::::::::::::::::::::::::::::::::::::::::::::
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: org.postgresql#postgresql;9.4.1209.jre7.jar: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1011)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
File "/opt/spark/python/pyspark/shell.py", line 43, in <module>
sc = SparkContext(pyFiles=add_files)
File "/opt/spark/python/pyspark/context.py", line 110, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
File "/opt/spark/python/pyspark/context.py", line 234, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "/opt/spark/python/pyspark/java_gateway.py", line 94, in launch_gateway
raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number
>>>

You can use --packages with coordinates in place of --jars:
--packages org.postgresql:postgresql:9.4.1209.jre7.jar

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Usage of Junit in Spark shell - scala

JUnit is a test dependency and it's not included in the Spark shell's classpath. You can use the --packages parameter add any dependency such as, bin/spark-shell --packages junit:junit:4.12

Related

Unable to run PyPostal on Spark Cluster

How to run scala code in spark container using docker?

module not found: com.databricks#spark-csv_2.10;1.5.0

Reading sbt dependency tree

How to load jar package such as JDBC in Kubernetes-Spark

Categories

Resources