jar file for SQLite JDBC connectivity through pyspark and pycharm - pyspark

I am running below code on pycharm , this code is working properly if i provide --jars through command prompt
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("pySparksqLite_test").\
config('spark.jars.packages', "C:/jars/DataVisualization/sqlite-jdbc-3.20.0.jar").getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", "5")
df_flight_info = spark.read.format("jdbc").option(url="jdbc:sqlite:C:/sqlite-tools-win32-x86-3290000/my-sqlite.db",
driver="org.sqlite.JDBC",
dbtable="(select DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count from flight_info)")\
.load()
but with pycharm i am getting below error
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Provided Maven Coordinates must be in the form 'groupId:artifactId:version'. The coordinate provided is: C:/Users/jars/sqlite-jdbc-3.20.0.jar
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.deploy.SparkSubmitUtils$$anonfun$extractMavenCoordinates$1.apply(SparkSubmit.scala:1000)
at org.apache.spark.deploy.SparkSubmitUtils$$anonfun$extractMavenCoordinates$1.apply(SparkSubmit.scala:998)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.deploy.SparkSubmitUtils$.extractMavenCoordinates(SparkSubmit.scala:998)
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1220)
at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:49)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:350)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
File "C:/...../proj1/pySparksqLite.py", line 4, in <module>
config('spark.jars.packages', "C:/Users/jars/sqlite-jdbc-3.20.0.jar").getOrCreate()
File "C:\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "C:\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\context.py", line 331, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "C:\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\context.py", line 115, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "C:\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\context.py", line 280, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "C:\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\java_gateway.py", line 95, in launch_gateway
raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number
Process finished with exit code 1
I have also tried providing jar file path through environment variable and setting it through os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars C:/Users/jars/sqlite-jdbc-3.27.2.jar'
but even this is not working

The equivalent of the --jars submitting parameter is spark.jars which allows you to specify local jars to transfer them to the cluster. You used spark.jars.packages which allows you to download packages from maven by specifying the maven coordinates. The submitting parameter equivalent of that is --packages.
Have a look at the documentation for more information: configuration and submitting parameters

Related

PySpark not starting - Windows 10

I am trying to setup Spark for Python - on a windows 10 pro machine.
However, after following these steps:
Installed Anaconda with Python 3.7
Installed JDK 8
Installed pre-built Spark 2.4.6 with hadoop 2.7
Downloaded winutils.exe
Setup all environment variables - also the user path setup
Created a C:\tmp\hive folder
Used the winutils.exe chmod -R 777 C:\tmp\hive command successfully
When I try and launch pyspark via command prompt the following text is output and nothing happense thereafter - also no errors?
(base) C:\Spark\bin>pyspark
Python 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information.
20/08/03 07:49:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
FINALLY 1 + hour later this error is printed:
Traceback (most recent call last):
File "C:\Program Files\Python37\lib\socket.py", line 589, in readinto
return self._sock.recv_into(b)
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Spark\python\pyspark\shell.py", line 41, in <module>
spark = SparkSession._create_shell_session()
File "C:\Spark\python\pyspark\sql\session.py", line 573, in _create_shell_session
return SparkSession.builder\
File "C:\Spark\python\pyspark\sql\session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "C:\Spark\python\pyspark\context.py", line 367, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "C:\Spark\python\pyspark\context.py", line 136, in __init__
conf, jsc, profiler_cls)
File "C:\Spark\python\pyspark\context.py", line 198, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "C:\Spark\python\pyspark\context.py", line 306, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1523, in __call__
File "C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 985, in send_command
File "C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1152, in send_command
File "C:\Program Files\Python37\lib\socket.py", line 589, in readinto
return self._sock.recv_into(b)

Pyspark cluster mode exception - Java gateway process exited before sending the driver its port number

In apache airflow, I wrote a PythonOperator which use pyspark to run a job on yarn cluster mode. I initialize the sparksession object as follows.
spark = SparkSession \
.builder \
.appName("test python operator") \
.master("yarn") \
.config("spark.submit.deployMode","cluster") \
.getOrCreate()
However, when I run my dag, I get an Exception.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/airflow/models/taskinstance.py", line 983, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.8/dist-packages/airflow/operators/python_operator.py", line 113, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.8/dist-packages/airflow/operators/python_operator.py", line 118, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/catfish/dags/dags_dag_test_python_operator.py", line 39, in print_count
spark = SparkSession \
File "/usr/local/lib/python3.8/dist-packages/pyspark/sql/session.py", line 186, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/usr/local/lib/python3.8/dist-packages/pyspark/context.py", line 371, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/usr/local/lib/python3.8/dist-packages/pyspark/context.py", line 128, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/usr/local/lib/python3.8/dist-packages/pyspark/context.py", line 320, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/usr/local/lib/python3.8/dist-packages/pyspark/java_gateway.py", line 105, in launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
I also set PYSPARK_SUBMIT_ARGS, but it doesn't work for me!
You need to install spark on your ubuntu container.
RUN apt-get -y install default-jdk scala git curl wget
RUN wget --no-verbose https://downloads.apache.org/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz
RUN tar xvf spark-2.4.6-bin-hadoop2.7.tgz
RUN mv spark-2.4.6-bin-hadoop2.7 /opt/spark
ENV SPARK_HOME=/opt/spark
And unfortunately you cannot run spark on yarn with PythonOperator. I suggest you to use SparkSubmitOperator or BashOperator.

pyarrow through spark-submit in cluster mode fails

I have a simple Pyspark code
import pyarrow
fs = pyarrow.hdfs.connect()
If I run this using spark-submit in "client"mode, it works fine, but in "cluster" mode, throws the error
Traceback (most recent call last):
File "t3.py", line 17, in <module>
fs = pa.hdfs.connect()
File "/opt/anaconda/3.6/lib/python3.6/site-packages/pyarrow/hdfs.py", line 181, in connect
kerb_ticket=kerb_ticket, driver=driver)
File "/opt/anaconda/3.6/lib/python3.6/site-packages/pyarrow/hdfs.py", line 37, in __init__
self._connect(host, port, user, kerb_ticket, driver)
File "io-hdfs.pxi", line 99, in pyarrow.lib.HadoopFileSystem._connect
File "error.pxi", line 79, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS connection failed
All the necessary python libraries are installed on every node in my Hadoop cluster. I have verified by testing this code under pyspark every node individually.
But cannot make it work through spark-submit in cluster mode?
Any ideas?
shankar

Issue running psycopg2 inside AWS Lambda Function

I'm getting the following error when trying to run psycopg2 in a AWS Lambda:
/var/task/functions/../vendored/psycopg2/_psycopg.so: ELF file's phentsize not the expected size: ImportError
Traceback (most recent call last):
File "/var/task/functions/refresh_mv.py", line 64, in execute
session = SessionFactoryGraphQL.get_session(app=item['app'])
File "/var/task/lib/session_factory.py", line 22, in get_session
engine = create_engine(conn_string, poolclass=NullPool)
File "/var/task/functions/../vendored/sqlalchemy/engine/__init__.py", line 387, in create_engine
return strategy.create(*args, **kwargs)
File "/var/task/functions/../vendored/sqlalchemy/engine/strategies.py", line 80, in create
dbapi = dialect_cls.dbapi(**dbapi_args)
File "/var/task/functions/../vendored/sqlalchemy/dialects/postgresql/psycopg2.py", line 554, in dbapi
import psycopg2
File "/var/task/functions/../vendored/psycopg2/__init__.py", line 50, in <module>
from psycopg2._psycopg import ( # noqa
ImportError: /var/task/functions/../vendored/psycopg2/_psycopg.so: ELF file's phentsize not the expected size
The weird thing is: everything was working fine until yesterday (for more than 5 months), and suddenly stopped working. None of the libraries has been updated.
I tried to build from scratch, as in https://github.com/jkehler/awslambda-psycopg2, but still having the same error.
Can someone help me with it?
The problem is in the latest version of serverless framework. I assume that you are using serverless to deploy your lambda function.
serverless remove
npm install serverless#1.20.2 -g
This should work.

Error invoking plotly's init_notebook_mode with Jupyter (Apache Toree PySpark)

I'm running Jupyter (v4.2.1) with Apache Toree - PySpark. When I try to invoke plotly's init_notebook_mode function, I run into the following error :
import numpy as np
import pandas as pd
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()
Error :
Name: org.apache.toree.interpreter.broker.BrokerException
Message: Traceback (most recent call last):
File "/tmp/kernel-PySpark-6415c581-01c4-4c90-b4d9-81773c2bc03f/pyspark_runner.py", line 134, in <module>
eval(compiled_code)
File "<string>", line 7, in <module>
File "/usr/local/lib/python3.4/dist-packages/plotly/offline/offline.py", line 151, in init_notebook_mode
display(HTML(script_inject))
File "/usr/local/lib/python3.4/dist-packages/IPython/core/display.py", line 158, in display
format = InteractiveShell.instance().display_formatter.format
File "/usr/local/lib/python3.4/dist-packages/traitlets/config/configurable.py", line 412, in instance
inst = cls(*args, **kwargs)
File "/usr/local/lib/python3.4/dist-packages/IPython/core/interactiveshell.py", line 499, in __init__
self.init_io()
File "/usr/local/lib/python3.4/dist-packages/IPython/core/interactiveshell.py", line 658, in init_io
io.stdout = io.IOStream(sys.stdout)
File "/usr/local/lib/python3.4/dist-packages/IPython/utils/io.py", line 34, in __init__
raise ValueError("fallback required, but not specified")
ValueError: fallback required, but not specified
StackTrace: org.apache.toree.interpreter.broker.BrokerState$$anonfun$markFailure$1.apply(BrokerState.scala:140)
org.apache.toree.interpreter.broker.BrokerState$$anonfun$markFailure$1.apply(BrokerState.scala:140)
scala.Option.foreach(Option.scala:236)
org.apache.toree.interpreter.broker.BrokerState.markFailure(BrokerState.scala:139)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
py4j.Gateway.invoke(Gateway.java:259)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:209)
java.lang.Thread.run(Thread.java:745)
I'm unable to find any info about this on the web. When I digged into the code where this is failing - io.py in IPython utils, I see that the stream that is being passed must have both the attributes - write as well as flush. But for some reason, the stream passed in this case - sys.stdout has only the "write" attribute, and not the "flush" attribute.
I believe this happens because plotly's notebook mode assumes that it is running inside an IPython jupyter kernel doing the notebook communictation; you see in the stacktrace that it's trying to call into IPython packages.
Toree, however, is a different jupyter kernel and has its own protocol handling for communicating with the notebook server. Even when you use toree to run a PySpark interpreter, you get a "plain" PySpark (just like when you start it from a shell) and toree drives the input/output of that interpreter.
So the IPython machinery is not set up and calling init_notebook_mode() in that environment will fail, just like it would when you run in in a PySpark started directly from the shell, which knows nothing about notebooks.
To my knowledge, there is currently no way to get plotting output from a PySpark session run via toree -- we recently faced the same problem. Instead of running python via toree, you need to run an IPython kernel, import the PySpark libs there and connect to your Spark cluster. See https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook for a dockerized example to do that.