Getting IllegalArgumentException - pyspark

I am getting IllegalArgumentException while submitting spark job
C:\spark\spark-2.2.1-bin-hadoop2.7\hadoop\bin>pyspark
Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:25:58) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/C:/spark/spark-2.2.1-bin-hadoop2.7/jars/hadoop-auth-2.7.3.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
17/12/21 15:48:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "C:\spark\spark-2.2.1-bin-hadoop2.7\python\pyspark\shell.py", line 45, in <module>
spark = SparkSession.builder\
File "C:\spark\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\session.py", line 183, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
File "C:\spark\spark-2.2.1-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
File "C:\spark\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
Please advise how to resolve these errors

Check your permissions to /tmp/hive/. Use sudo chmod -R 777 /tmp/hive/. Also, please check whether you are using sqlContext to pass queries, otherwise you can change it to sparkSession.

Related

Can not find suitable configuration of distributed configuration store when initializing Patroni as sudo

I followed the example described here https://www.opsdash.com/blog/postgres-getting-started-patroni.html and everything works as expected.
While testing, I tried sudo patroni pg-2.yml. The output was:
2022-01-13 11:39:22,316 INFO: Failed to import patroni.dcs.consul
2022-01-13 11:39:22,316 INFO: Failed to import patroni.dcs.etcd
2022-01-13 11:39:22,317 INFO: Failed to import patroni.dcs.etcd3
2022-01-13 11:39:22,318 INFO: Failed to import patroni.dcs.exhibitor
2022-01-13 11:39:22,319 INFO: Failed to import patroni.dcs.raft
2022-01-13 11:39:22,319 INFO: Failed to import patroni.dcs.zookeeper
Traceback (most recent call last):
File "/usr/local/bin/patroni", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/patroni/__init__.py", line 182, in main
return patroni_main()
File "/usr/local/lib/python3.8/dist-packages/patroni/__init__.py", line 140, in patroni_main
abstract_main(Patroni, schema)
File "/usr/local/lib/python3.8/dist-packages/patroni/daemon.py", line 98, in abstract_main
controller = cls(config)
File "/usr/local/lib/python3.8/dist-packages/patroni/__init__.py", line 30, in __init__
self.dcs = get_dcs(self.config)
File "/usr/local/lib/python3.8/dist-packages/patroni/dcs/__init__.py", line 110, in get_dcs
raise PatroniFatalException("""Can not find suitable configuration of distributed configuration store
patroni.exceptions.PatroniFatalException: 'Can not find suitable configuration of distributed configuration store\nAvailable implementations: kubernetes'
Is there an explanation for that?

PySpark not starting - Windows 10

I am trying to setup Spark for Python - on a windows 10 pro machine.
However, after following these steps:
Installed Anaconda with Python 3.7
Installed JDK 8
Installed pre-built Spark 2.4.6 with hadoop 2.7
Downloaded winutils.exe
Setup all environment variables - also the user path setup
Created a C:\tmp\hive folder
Used the winutils.exe chmod -R 777 C:\tmp\hive command successfully
When I try and launch pyspark via command prompt the following text is output and nothing happense thereafter - also no errors?
(base) C:\Spark\bin>pyspark
Python 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information.
20/08/03 07:49:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
FINALLY 1 + hour later this error is printed:
Traceback (most recent call last):
File "C:\Program Files\Python37\lib\socket.py", line 589, in readinto
return self._sock.recv_into(b)
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Spark\python\pyspark\shell.py", line 41, in <module>
spark = SparkSession._create_shell_session()
File "C:\Spark\python\pyspark\sql\session.py", line 573, in _create_shell_session
return SparkSession.builder\
File "C:\Spark\python\pyspark\sql\session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "C:\Spark\python\pyspark\context.py", line 367, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "C:\Spark\python\pyspark\context.py", line 136, in __init__
conf, jsc, profiler_cls)
File "C:\Spark\python\pyspark\context.py", line 198, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "C:\Spark\python\pyspark\context.py", line 306, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1523, in __call__
File "C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 985, in send_command
File "C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1152, in send_command
File "C:\Program Files\Python37\lib\socket.py", line 589, in readinto
return self._sock.recv_into(b)

jar file for SQLite JDBC connectivity through pyspark and pycharm

I am running below code on pycharm , this code is working properly if i provide --jars through command prompt
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("pySparksqLite_test").\
config('spark.jars.packages', "C:/jars/DataVisualization/sqlite-jdbc-3.20.0.jar").getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", "5")
df_flight_info = spark.read.format("jdbc").option(url="jdbc:sqlite:C:/sqlite-tools-win32-x86-3290000/my-sqlite.db",
driver="org.sqlite.JDBC",
dbtable="(select DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count from flight_info)")\
.load()
but with pycharm i am getting below error
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Provided Maven Coordinates must be in the form 'groupId:artifactId:version'. The coordinate provided is: C:/Users/jars/sqlite-jdbc-3.20.0.jar
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.deploy.SparkSubmitUtils$$anonfun$extractMavenCoordinates$1.apply(SparkSubmit.scala:1000)
at org.apache.spark.deploy.SparkSubmitUtils$$anonfun$extractMavenCoordinates$1.apply(SparkSubmit.scala:998)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.deploy.SparkSubmitUtils$.extractMavenCoordinates(SparkSubmit.scala:998)
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1220)
at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:49)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:350)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
File "C:/...../proj1/pySparksqLite.py", line 4, in <module>
config('spark.jars.packages', "C:/Users/jars/sqlite-jdbc-3.20.0.jar").getOrCreate()
File "C:\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "C:\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\context.py", line 331, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "C:\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\context.py", line 115, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "C:\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\context.py", line 280, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "C:\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\java_gateway.py", line 95, in launch_gateway
raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number
Process finished with exit code 1
I have also tried providing jar file path through environment variable and setting it through os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars C:/Users/jars/sqlite-jdbc-3.27.2.jar'
but even this is not working
The equivalent of the --jars submitting parameter is spark.jars which allows you to specify local jars to transfer them to the cluster. You used spark.jars.packages which allows you to download packages from maven by specifying the maven coordinates. The submitting parameter equivalent of that is --packages.
Have a look at the documentation for more information: configuration and submitting parameters

iptest3 with IPython

I'm brand new to Python programming and trying to get myself a functional base from which I can run things like the IPython Notebook which looks pretty exciting.
Thus far I have both Python 2.7 and 3.3 from python.org installed in OS X 10.6 (Snow Leopard) as well as ActiveTcl 8.5.13. Almost everything that I've tried thus far works as expected. I'm focused on learning 3.3, but want to have the option of using 2.7 too. I read up in several documents that I need to start gaining access to PyPI packages using a Python package manager and that distribute is the one I should use for 3k. So I installed that according to the documentation I found and it seemed to work fine.
I also installed pip as directed, and a number of others.
At this point, I have:
$ pip freeze
distribute==0.6.34
ipython==0.13.1
nose==1.2.1 (installed after IPython)
pexpect==2.4 (installed after IPython)
pyflakes3k==0.4.3
readline==6.2.4.1 (installed after IPython)
At this point, I'm doing this from ipython.org guidance
And when I did $ easy_install pexpect, I got a bunch of errors:
$ easy_install pexpect
Searching for pexpect
Reading http://pypi.python.org/simple/pexpect/
Reading http://pexpect.sourceforge.net/
Reading http://sourceforge.net/project/showfiles.php?group_id=59762
Best match: pexpect 2.4
Downloading http://pypi.python.org/packages/source/p/pexpect/pexpect-2.4.tar.gz#md5=fe82d69be19ec96d3a6650af947d5665
Processing pexpect-2.4.tar.gz
Writing /var/folders/td/td0Sh8EfGFuMCnKex1v+q++++TI/-Tmp-/easy_install-s4dtyy/pexpect-2.4/setup.cfg
Running pexpect-2.4/setup.py -q bdist_egg --dist-dir /var/folders/td/td0Sh8EfGFuMCnKex1v+q++++TI/-Tmp-/easy_install-s4dtyy/pexpect-2.4/egg-dist-tmp-5h5cg4
File "build/bdist.macosx-10.6-intel/egg/fdpexpect.py", line 36
raise ExceptionPexpect, 'The fd argument is not a valid file descriptor.'
^
SyntaxError: invalid syntax
File "build/bdist.macosx-10.6-intel/egg/FSM.py", line 77
return `self.value`
^
SyntaxError: invalid syntax
File "build/bdist.macosx-10.6-intel/egg/pexpect.py", line 82
except ImportError, e:
^
SyntaxError: invalid syntax
zip_safe flag not set; analyzing archive contents...
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/pexpect-2.4-py3.3.egg/fdpexpect.py", line 36
raise ExceptionPexpect, 'The fd argument is not a valid file descriptor.'
^
SyntaxError: invalid syntax
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/pexpect-2.4-py3.3.egg/FSM.py", line 77
return `self.value`
^
SyntaxError: invalid syntax
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/pexpect-2.4-py3.3.egg/pexpect.py", line 82
except ImportError, e:
^
SyntaxError: invalid syntax
Adding pexpect 2.4 to easy-install.pth file
Installed /Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/pexpect-2.4-py3.3.egg
Processing dependencies for pexpect
Finished processing dependencies for pexpect
That looks bad to me (although I don't yet have the expertise to really interpret it), and so I'm not sure if I have a complete install of pexpect.
After installing nose (before pexpect as per URL above), I tried running iptest and iptest3 from the command line, and both failed to find the command, but after I did easy_install ipython again (after nose), I noticed that this install added iptest3 (as well as ipcluster3 and a few other scripts) to my path, and now my bash shell can find iptest3, but when I run it, I get some more bad-looking output:
$ iptest3
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.3/bin/iptest3", line 9, in <module>
load_entry_point('ipython==0.13.1', 'console_scripts', 'iptest3')()
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/distribute-0.6.34-py3.3.egg/pkg_resources.py", line 343, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/distribute-0.6.34-py3.3.egg/pkg_resources.py", line 2308, in load_entry_point
return ep.load()
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/distribute-0.6.34-py3.3.egg/pkg_resources.py", line 2014, in load
entry = __import__(self.module_name, globals(),globals(), ['__name__'])
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/IPython/__init__.py", line 43, in <module>
from .config.loader import Config
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/IPython/config/__init__.py", line 16, in <module>
from .application import *
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/IPython/config/application.py", line 31, in <module>
from IPython.config.configurable import SingletonConfigurable
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/IPython/config/configurable.py", line 26, in <module>
from .loader import Config
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/IPython/config/loader.py", line 27, in <module>
from IPython.utils.path import filefind, get_ipython_dir
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/IPython/utils/path.py", line 25, in <module>
from IPython.utils.process import system
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/IPython/utils/process.py", line 27, in <module>
from ._process_posix import _find_cmd, system, getoutput, arg_split
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/IPython/utils/_process_posix.py", line 22, in <module>
from IPython.external import pexpect
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/IPython/external/pexpect/__init__.py", line 2, in <module>
import pexpect
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/pexpect-2.4-py3.3.egg/pexpect.py", line 82
except ImportError, e:
^
SyntaxError: invalid syntax
After doing all that, I noted that my /Library/Frameworks/Python.framework/Versions/Current had been pointing to 2.7, and I guessed that that might be related to my problems and changed the symbolic link to point to 3.3, but iptest3 still fails with the error above.
Any other thoughts on what to do to fix this? It's clear that iptest is pretty important to doing anything else (like IPython Notebook) I want to do.
There is a py3-compatible fork of pexpect called pexpect-u (the u is for unicode, the main difference between the two). You need this to run the pexpect-based parts of IPython on Python 3.
Should be a simple
pip install pexpect-u
side note: pexpect-u is by IPython developer Thomas Kluyver, who did most of the heavy lifting bringing py3 compatibility to IPython.

What is the correct way to load a dll library in Postgres PL/Python?

The following gives an error
drop function testing();
CREATE FUNCTION testing()
RETURNS text
AS $$
import ctypes
try:
ctypes.windll.LoadLibrary("D:\\jcc.dll")
except:
import traceback
plpy.error(traceback.format_exc())
return ''
$$ LANGUAGE plpythonu;
select testing();
Error message:
ERROR: ('Traceback (most recent call last):\n File "<string>", line 5, in __plpython_procedure_testing_1517640\n File "D:\\Python26\\Lib\\ctypes\\__init__.py", line 431, in LoadLibrary\n return self._dlltype(name)\n File "D:\\Python26\\Lib\\ctypes\\__init__.py", line 353, in __init__\n self._handle = _dlopen(self._name, mode)\nWindowsError: [Error 126] The specified module could not be found\n',)
It works fine in a python interpretor.
Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import ctypes
>>> ctypes.windll.LoadLibrary("D:\\jcc.dll")
<WinDLL 'D:\jcc.dll', handle 410000 at 1d9cb10>
>>>
"The specified module could not be found" is one of those helpful error messages Windows emits that doesn't always mean what you think it means.
Windows will produce that message if the DLL you tried to load or any dll it depends on could not be found.
Since PostgreSQL runs in its own user account it has a different PATH to that which your interpreter runs in when you're testing. If jcc.dll depends on (say) c:\jccsupportfiles\aaa.dll and c:\jccsupportfiles is on your PATH but not the Pg server's PATH, that would explain your problem.
Try using Dependency Walker (depends.exe) to determine which DLLs your DLL requires and where they are. See if it's a PATH issue.
Rather than messing with the Pg server's PATH, consider just putting all the DLLs required by jcc.dll in the same directory as jcc.dll. IIRC Windows will always look in the same directory as the module it's loading first when it tries to load a module it depends on.