I am trying to fetch data from Postgresql in my spark application.But now I am confused how to install postgresql driver in my docker image. I also tried to install postgresql as apt-get install command as mentioned below (Dockerfile).
Dockerfile:
FROM python:3
ENV SPARK_VERSION 2.3.2
ENV SPARK_HADOOP_PROFILE 2.7
ENV SPARK_SRC_URL https://www.apache.org/dist/spark/spark-$SPARK_VERSION/spark-${SPARK_VERSION}-
bin-hadoop${SPARK_HADOOP_PROFILE}.tgz
ENV SPARK_HOME=/opt/spark
ENV PATH $PATH:$SPARK_HOME/bin
RUN wget ${SPARK_SRC_URL}
RUN tar -xzf spark-${SPARK_VERSION}-bin-hadoop${SPARK_HADOOP_PROFILE}.tgz
RUN mv spark-${SPARK_VERSION}-bin-hadoop${SPARK_HADOOP_PROFILE} /opt/spark
RUN rm -f spark-${SPARK_VERSION}-bin-hadoop${SPARK_HADOOP_PROFILE}.tgz
RUN apt-get update && \
apt-get install -y openjdk-8-jdk-headless \
postgresql && \
rm -rf /var/lib/apt/lists/*
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY my_script.py ./
CMD [ "python", "./my_script.py" ]
requirements.txt :
pyspark==2.3.2
numpy
my_script.py :
from pyspark import SparkContext
from pyspark import SparkConf
#spark conf
conf1 = SparkConf()
conf1.setMaster("local[*]")
conf1.setAppName('hamza')
print(conf1)
sc = SparkContext(conf = conf1)
print('hahahha')
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
print(sqlContext)
from pyspark.sql import DataFrameReader
url = 'postgresql://IP:PORT/INSTANCE'
properties = {'user': 'user', 'password': 'pass'}
df = DataFrameReader(sqlContext).jdbc(
url='jdbc:%s' % url, table=query, properties=properties
)
Getting this error :
Traceback (most recent call last):
File "./my_script.py", line 26, in <module>
, properties=properties
File "/usr/local/lib/python3.7/site-packages/pyspark/sql/readwriter.py", line 527, in jdbc
return self._df(self._jreader.jdbc(url, table, jprop))
File "/usr/local/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/local/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o28.jdbc.
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
Kindly guide me how to setup this driver
Thanks
Adding these lines in Dockerfile solved the issue :
ENV POST_URL https://jdbc.postgresql.org/download/postgresql-42.2.5.jar
RUN wget ${POST_URL}
RUN mv postgresql-42.2.5.jar /opt/spark/jars
Thanks everyone
This is not the Docker way of doing things. Docker approach is not having all services inside one container but splitting them into several, where each container should have one main process, like database, you application or etc.
Also, when using separate containers, you dont care about intalling all necessary stuff in your Dockerfile - you simply select ready-to-use containers with desired database types. By the way, if you are using python:3 docker image, how do you know, maintainers wont change the set of installed services, or even the OS type? They can do it easily because they only provide 'Python` service, everything else is not defined.
So, what I recommend is:
Split you project into different containers (Dockerfiles)
Use standard postgres image for you database - all services and drivers are already onboard
Use docker-compose (or whatever) for launching both containers and linking them together in one network.
Related
I need to install robotframework-autoitlibrary to use on my test cases.
My problem is when I try to install AutoIt Library through command line with the following command:
pip install -U robotframework-autoitlibrary --no-cache-dir --pre
I have this error:
C:\windows\system32>pip install -U robotframework-autoitlibrary --no-cache-
dir --pre
Collecting robotframework-autoitlibrary
Downloading
https://files.pythonhosted.org/packages/4e/a4/9e51fe35b1da7a006b773c9c234f78e89bcc4f267152c4e9fa8260631fa8/robotframework-autoitlibrary-1.2.2.zip (701kB)
100% |################################| 706kB 1.6MB/s
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "c:\users\user-a~1\appdata\local\temp\pip-install-
oro1ov\robotframework-autoitlibrary\setup.py", line 93, in <module>
destPath = os.path.normpath(os.path.join(os.getenv("HOMEDRIVE"),
r"\RobotFramework\Extensions\AutoItLibrary"))
File "c:\python27\lib\ntpath.py", line 65, in join
result_drive, result_path = splitdrive(path)
File "c:\python27\lib\ntpath.py", line 115, in splitdrive
if len(p) > 1:
TypeError: object of type 'NoneType' has no len()
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in
c:\users\user-a~1\appdata\local\temp\pip-install-oro1ov\robotframework-
autoitlibrary\
My currently installed python packages and their versions are:
Pillow==5.3.0
Pygments==2.3.1
pypiwin32==223
Pypubsub==4.0.0
pywin32==224
robotframework==3.1
robotframework-ride==1.5.2.1
robotframework-selenium2library==3.0.0
robotframework-seleniumlibrary==3.3.0
selenium==3.141.0
six==1.12.0
typing==3.6.6
urllib3==1.24.1
wxPython==4.0.3
I already try this command:
pip install --upgrade setuptools
When input echo %HOMEDRIVE%, the output is:
C:\Users\cmpeixoto>echo %HOMEDRIVE%
C:
Thanks for your help,
The error looks like the environment variable HOMEDRIVE is not set, even though it has a value according to your test (the library installer uses it to copy some files).
Can you try this - manually set it, and straight after that run the pip - in the same Command Prompt (cmd) session:
set HOMEDRIVE=C:
pip install -U robotframework-autoitlibrary --no-cache-dir --pre
These are the steps I am following:
mkdir spark_lib; cd spark_lib
pip install jsonpath_rw_ext==1.1.3 -t .
zip -r9 ../spark_lib.zip *
Initialize spark context variable sc
sc.addPyFile('spark_lib.zip')
def f(x):
import jsonpath_rw_ext
return jsonpath_rw_ext.match('$.a.b', x)
sc.parallelize([{"a":{"b":10}}]).map(f).collect()
This gives me a pbr versioning error
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<ipython-input-11-2619fc011e24>", line 6, in f
File "./spark_lib.zip/jsonpath_rw_ext/__init__.py", line 18, in <module>
File "./spark_lib.zip/pbr/version.py", line 467, in version_string
return self.semantic_version().brief_string()
File "./spark_lib.zip/pbr/version.py", line 462, in semantic_version
self._semantic = self._get_version_from_pkg_resources()
File "./spark_lib.zip/pbr/version.py", line 449, in _get_version_from_pkg_resources
result_string = packaging.get_version(self.package)
File "./spark_lib.zip/pbr/packaging.py", line 824, in get_version
name=package_name))
Exception: Versioning for this project requires either an sdist tarball,
or access to an upstream git repository. It's also possible that
there is a mismatch between the package name in setup.cfg
and the argument given to pbr.version.VersionInfo.
Project name jsonpath_rw_ext was given, but was not able to be found.
After reading through another similar bug https://bugs.launchpad.net/python-swiftclient/+bug/1379579, found out that pbr needs updated setuptools.
I included setuptools in my pip installation command and everything worked fine.
pip install jsonpath_rw_ext setuptools -t .
I get the following error when I try to run pytest repo/tests/test_file.py:
$ pytest repo/tests/test_file.py
Traceback (most recent call last):
File "/Users/marlo/anaconda3/envs/venv/lib/python3.6/site-packages/_pytest/config.py", line 329, in _getconftestmodules
return self._path2confmods[path]
KeyError: local('/Users/marlo/repo/tests/test_file.py')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/marlo/anaconda3/envs/venv/lib/python3.6/site-packages/_pytest/config.py", line 329, in _getconftestmodules
return self._path2confmods[path]
KeyError: local('/Users/marlo/repo/tests')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/marlo/anaconda3/envs/venv/lib/python3.6/site-packages/_pytest/config.py", line 362, in _importconftest
return self._conftestpath2mod[conftestpath]
KeyError: local('/Users/marlo/repo/conftest.py')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/marlo/anaconda3/envs/venv/lib/python3.6/site-packages/_pytest/config.py", line 368, in _importconftest
mod = conftestpath.pyimport()
File "/Users/marlo/anaconda3/envs/venv/lib/python3.6/site-packages/py/_path/local.py", line 686, in pyimport
raise self.ImportMismatchError(modname, modfile, self)
py._path.local.LocalPath.ImportMismatchError: ('conftest', '/home/venvuser/venv/conftest.py', local('/Users/marlo/repo/conftest.py'))
ERROR: could not load /Users/marlo/repo/conftest.py
My repo structure is
lib/
-tests/
-test_file.py
app/
-test_settings.py
pytest.ini
conftest.py
...
Other people have run this code fine, and according to this question (and this one), my structure is good and I am not missing any files. I can only conclude that something about my computer or project set-up is not right. If you have any suggestions or insights that I may be missing, please send them my way!
-------------------------------MORE DETAILS------------------------------
test_file.py:
def func(x):
return x + 1
def test_answer():
assert func(3) == 5
pytest.ini:
[pytest]
DJANGO_SETTINGS_MODULE = app.test_settings
python_files = tests.py test_* *_tests.py *test.py
I have docker as well as running pytest outside of docker too, and for me a much lower-impact fix whenever this crops up is to delete all the compiled python files
find . -name \*.pyc -delete
I figured it out and I'll answer in case others have the same issue:
I didn't even take into consideration that I had a docker container (of the same app) in the repo directory and, although I was not running the docker container, it was influencing the filepaths somehow.
To fix this:
I re-cloned the repo from the remote source into a new folder so that nothing from the old repo could "contaminate" it.
Updated my virtual environment with the .yml specifications of the clean repo
$ conda env update --name project --file project.yml
My project uses a postgres database, so I dropped it and created a new one
$ dropdb projectdb
$ createdb projectdb
Since my project uses mongo, I also dropped that database
$ mongo projectdb --eval "db.dropDatabase()"
Installed a clean pytest
$ pip uninstall pytest
$ pip install pytest
...and voilĂ ! I could run pytest.
Many thanks to #hoefling and others who helped me debug.
I was running docker as well, but it seems my problem was different.
I was using an old version of pytest:
platform linux -- Python 3.9.7, pytest-3.7.2, py-1.10.0, pluggy-1.0.0
which stopped working after my ubuntu image was pulling python 3.10 by default.
my solution was to update (and fix) the dockerfile image to use:
FROM python:3.10
instead of python:latest, and update the pytest version as well.
I am trying to learn yocto by following the video tutorials on their main website. I installed the poky-rocko-18.0.0 and after setting up the build environment I tried to build the linux image using the following command:
bitbake core-image-minimal
However, I am getting the following error:
I am unsure how to start the bitbake server and so far have not found any good references for the same.
We also faced same issue with our bitbake server. It will worked after remove bitbacke.lock
file. Use below command for solution.
rm -rf bitbake.lock
###/build$ bitbake core-image-sato
Loading cache: 100% |#########################################################################################################################################################################################################| Time: 0:00:01
Loaded 3867 entries from dependency cache.
Parsing recipes: 100%
My problem was some missing packages on my build system.
Fixed it by installing the following packages (Debian):
sudo apt-get install chrpath
sudo apt-get install texinfo
On my Arch system:
sudo pacman -S rpcsvc-proto chrpath texinfo cpio diffstat
Just try this in your build folder:
rm -rf bitbake.lock
this shoud work
The reason for this is the state of the bitbake is locked during last bitbake execution. Once you stop intermittently, we need to remove the bitbake.lock
In my case it was solved with this answer from https://stackoverflow.com/a/45880855/5350353 (Unable to connect to bitbake server):
This is because new function findTopdir (Submitted on July 18, 2017) does not handle errors. For example, the lack of BBPATH environment variable and the inability to find conf/bblayers.conf in BBPATH. findTopdir just returns None in case of that errors.
Maybe caused by the absence of host application(s), like gawk, chrpath and texinfo.
Below is one example.
ERROR: Unable to start bitbake server (None)
ERROR: Server log for this session (/home/zephyr/workspace/w031/openembedded-core/build/bitbake-cookerdaemon.log):
--- Starting bitbake server pid 22675 at 2019-03-16 00:28:44.447008 ---
Traceback (most recent call last):
File "/home/zephyr/workspace/w031/bitbake/lib/bb/cookerdata.py", line 290, in parseBaseConfiguration
bb.event.fire(bb.event.ConfigParsed(), self.data)
File "/home/zephyr/workspace/w031/bitbake/lib/bb/event.py", line 225, in fire
fire_class_handlers(event, d)
File "/home/zephyr/workspace/w031/bitbake/lib/bb/event.py", line 134, in fire_class_handlers
execute_handler(name, handler, event, d)
File "/home/zephyr/workspace/w031/bitbake/lib/bb/event.py", line 106, in execute_handler
ret = handler(event)
File "/home/zephyr/workspace/w031/openembedded-core/meta/classes/base.bbclass", line 238, in base_eventhandler
setup_hosttools_dir(d.getVar('HOSTTOOLS_DIR'), 'HOSTTOOLS', d)
File "/home/zephyr/workspace/w031/openembedded-core/meta/classes/base.bbclass", line 142, in setup_hosttools_dir
bb.fatal("The following required tools (as specified by HOSTTOOLS) appear to be unavailable in PATH, please install them in order to proceed:\n %s" % " ".join(notfound))
File "/home/zephyr/workspace/w031/bitbake/lib/bb/__init__.py", line 120, in fatal
raise BBHandledException()
bb.BBHandledException
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zephyr/workspace/w031/bitbake/lib/bb/daemonize.py", line 83, in createDaemon
function()
File "/home/zephyr/workspace/w031/bitbake/lib/bb/server/process.py", line 469, in _startServer
self.cooker = bb.cooker.BBCooker(self.configuration, self.featureset)
File "/home/zephyr/workspace/w031/bitbake/lib/bb/cooker.py", line 210, in __init__
self.initConfigurationData()
File "/home/zephyr/workspace/w031/bitbake/lib/bb/cooker.py", line 375, in initConfigurationData
self.databuilder.parseBaseConfiguration()
File "/home/zephyr/workspace/w031/bitbake/lib/bb/cookerdata.py", line 317, in parseBaseConfiguration
raise bb.BBHandledException
bb.BBHandledException
ERROR: The following required tools (as specified by HOSTTOOLS) appear to be unavailable in PATH, please install them in order to proceed:
gawk
First, change local.conf, bblayers.conf to previous configuration.
Then, bitbake -c cleanall recipe_name.
It will be all right now!
Like the OP stated a package was missing on the build host. ( makeinfo in his case)
To properly prepare the build host, look into the documentation for your yocto version and your Distro.
latest
In my case some playing with devtool caused a duplicated definition in bblayers.conf:
BBLAYERS ?= " \
${TOPDIR}/../meta \
${TOPDIR}/../meta-poky \
${TOPDIR}/../meta-yocto-bsp \
${TOPDIR}/../meta-atmel \
${TOPDIR}/../meta-libgpiod \
${TOPDIR}/../meta-libuio \
${TOPDIR}/../meta-lsuio \
${TOPDIR}/workspace \
/home/me/yocto/poky/build/workspace \
"
I had to manually remove one of the two last lines as follows:
BBLAYERS ?= " \
${TOPDIR}/../meta \
${TOPDIR}/../meta-poky \
${TOPDIR}/../meta-yocto-bsp \
${TOPDIR}/../meta-atmel \
${TOPDIR}/../meta-libgpiod \
${TOPDIR}/../meta-libuio \
${TOPDIR}/../meta-lsuio \
${TOPDIR}/workspace
"
Then I retried and the issue was resolved.
In my case (PLNX 2018.2) I was not getting this problem because of the .Xil folder that was hidden in the root directory of the project, deleting it solves the problem.
I had a similar issue; with an additional Unicode Decode Error: 'ascii' codec can't decode byte 0xe2 in position 5305: ordinal not in range(128) at the bottom of the list.
I resolved this by checking my 'locale' setting in Ubuntu 18.04. and running the following:
export LC_ALL="en_US.UTF-8"
export LANG="en_US.UTF-8"
export LANGUAGE="en_US.UTF-8"
bitbake-layers command worked perfectly after this.
Please shutdown and rerun that bitbake command then it will solve.
Just typing gcloud for help take 5 secs.
$ gcloud
...
gcloud 0.30s user 0.13s system 7% cpu 5.508 total
$ gcloud version
Google Cloud SDK 128.0.0
alpha 2016.01.12
bq 2.0.24
bq-nix 2.0.24
core 2016.09.23
core-nix 2016.09.20
gcloud
gsutil 4.21
gsutil-nix 4.21
kubectl
kubectl-darwin-x86_64 1.3.7
$ uname -a
Darwin hiroshi-MacBook.local 16.0.0 Darwin Kernel Version 16.0.0: Mon Aug 29 17:56:20 PDT 2016; root:xnu-3789.1.32~3/RELEASE_X86_64 x86_64
EDIT 2017-03-31: Zachary said that gcloud 148.0.0 addressed this issue. So try gcloud components update. see https://stackoverflow.com/users/4922212/zachary-newman
tl;dr
It turns out that socket.gethostbyaddr(socket.gethostname()) is slow for .local hostname in macOS.
$ python -i
>>> socket.gethostname()
'hiroshi-MacBook.local'
>>> socket.gethostbyaddr(socket.gethostname()) # it takes about 5 seconds
('localhost', ['1.0.0.127.in-addr.arpa'], ['127.0.0.1'])
So, for a workaround, just added the hostname to the localhost line of /etc/hosts.
127.0.0.1 localhost hiroshi-Macbook.local
After that is return value is different, but it returns in an instant.
>>> socket.gethostbyaddr(socket.gethostname())
('localhost', ['1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa'], ['::1'])
How do I get there
Where gcloud command is:
$ which gcloud
/Users/hiroshi/google-cloud-sdk/bin/gcloud
Edit the endline of the shell script...
...
+ echo "$CLOUDSDK_PYTHON" $CLOUDSDK_PYTHON_ARGS "${CLOUDSDK_ROOT_DIR}/lib/gcloud.py" "$#"
"$CLOUDSDK_PYTHON" $CLOUDSDK_PYTHON_ARGS "${CLOUDSDK_ROOT_DIR}/lib/gcloud.py" "$#"
to echo where the gcloud.py is:
$ gcloud
python2.7 -S /Users/hiroshi/google-cloud-sdk/lib/gcloud.py
OK. Who take the 5 secs?
$ python2.7 -S -m cProfile -s time /Users/hiroshi/google-cloud-sdk/lib/gcloud.py
173315 function calls (168167 primitive calls) in 5.451 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 5.062 5.062 5.062 5.062 {_socket.gethostbyaddr}
...
_socket.gethostbyaddr is.
What is the argument of the function call and backtrace look like?
I added some lines before main() of gcloud.py
import traceback
def mygethostbyaddr(addr):
print addr
traceback.print_stack()
return addr
import socket
socket.gethostbyaddr = mygethostbyaddr
Execute gcloud again.
I got it is my .local name of my machine.
$ gcloud
hiroshi-MacBook.local
File "/Users/hiroshi/google-cloud-sdk/lib/gcloud.py", line 74, in <module>
main()
File "/Users/hiroshi/google-cloud-sdk/lib/gcloud.py", line 70, in main
sys.exit(googlecloudsdk.gcloud_main.main())
File "/Users/hiroshi/google-cloud-sdk/lib/googlecloudsdk/gcloud_main.py", line 121, in main
metrics.Started(START_TIME)
File "/Users/hiroshi/google-cloud-sdk/lib/googlecloudsdk/core/metrics.py", line 411, in Wrapper
return func(*args, **kwds)
File "/Users/hiroshi/google-cloud-sdk/lib/googlecloudsdk/core/metrics.py", line 554, in Started
collector = _MetricsCollector.GetCollector()
File "/Users/hiroshi/google-cloud-sdk/lib/googlecloudsdk/core/metrics.py", line 139, in GetCollector
_MetricsCollector._instance = _MetricsCollector()
File "/Users/hiroshi/google-cloud-sdk/lib/googlecloudsdk/core/metrics.py", line 197, in __init__
hostname = socket.getfqdn()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 141, in getfqdn
hostname, aliases, ipaddrs = gethostbyaddr(name)
File "/Users/hiroshi/google-cloud-sdk/lib/gcloud.py", line 32, in mygethostbyaddr
traceback.print_stack()
#hiroshi's answer solved the issue for me i.e. to run gcloud components update. However, since I had installed gcloud through their Cloud SDK using a package manager, I was stuck with the following error*:
ERROR: (gcloud.components.update) You cannot perform this action
because the Google Cloud CLI component manager is disabled for this
installation.
Hence, I had to explicitly mention the apt-get libraries to perform the update. The following command helped me to get it done in one go:
sudo apt-get update && sudo apt-get --only-upgrade install google-cloud-sdk-app-engine-go google-cloud-sdk-datastore-emulator google-cloud-sdk-datalab google-cloud-sdk-firestore-emulator google-cloud-sdk-kubectl-oidc google-cloud-sdk google-cloud-sdk-app-engine-python-extras google-cloud-sdk-cloud-build-local kubectl google-cloud-sdk-cbt google-cloud-sdk-minikube google-cloud-sdk-skaffold google-cloud-sdk-cloud-run-proxy google-cloud-sdk-pubsub-emulator google-cloud-sdk-config-connector google-cloud-sdk-gke-gcloud-auth-plugin google-cloud-sdk-kpt google-cloud-sdk-local-extract google-cloud-sdk-nomos google-cloud-sdk-app-engine-grpc google-cloud-sdk-bigtable-emulator google-cloud-sdk-app-engine-python google-cloud-sdk-terraform-validator google-cloud-sdk-anthos-auth google-cloud-sdk-spanner-emulator google-cloud-sdk-app-engine-java
*More details as to why the aforementioned error occurs and how to permanently solve it can be found here.