ImportError: cannot import name 'Pregel' from 'graphframes.lib' - pyspark

I am using pyspark and graphframes from jupyter. I am able to successfully import pyspark and graphframes, but when I try:
from graphframes.lib import Pregel
I get the following error:
ImportError: cannot import name 'Pregel' from 'graphframes.lib'
This post is how I was able to get graphframes to work, but without graphframes.lib:
https://github.com/graphframes/graphframes/issues/104
wget https://github.com/graphframes/graphframes/archive/release-0.2.0.zip
unzip release-0.2.0.zip
cd graphframes-release-0.2.0
build/sbt assembly
cd ..
# Copy necessary files to root level so we can start pyspark.
cp graphframes-release-0.2.0/target/scala-2.11/graphframes-release-0-2-0-assembly-0.2.0-spark2.0.jar .
cp -r graphframes-release-0.2.0/python/graphframes .
# Set environment to use Jupyter
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
# Launch the jupyter server.
pyspark --jars graphframes-release-0-2-0-assembly-0.2.0-spark2.0.jar
I tried to repeat the above commands, without the environment lines as pyspark works just fine for me in jupyter, using a different version and was able to get graphframes.lib, but no Pregel:
wget https://github.com/graphframes/graphframes/archive/release-0.6.0.zip
unzip release-0.6.0.zip
cd graphframes-release-0.6.0
build/sbt assembly
cd ..
# Copy necessary files to root level so we can start pyspark.
cp graphframes-release-0.6.0/target/scala-2.11/graphframes-assembly-0.6.0-spark2.3.jar .
cp -r graphframes-release-0.6.0/python/graphframes .
# Set environment to use Jupyter
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
# Launch the jupyter server.
pyspark --jars graphframes-assembly-0.6.0-spark2.3.jar
Now I can see the graphrames.lib directory but all that's in there is aggregate_messages.py.
Finally, I tried the following but get a 404 error:
wget https://github.com/graphframes/graphframes/archive/release-0.7.0.zip
I expected that, because I was able to import graphframes, that I'd be able to import Pregel from graphframes.lib. It would seem that in my version, 0.6.0 now, there is a graphrames.lib but no Pregel and that there is no 0.7.0 release yet for graphframes.

I was able to resolve this error using the following method:
wget https://github.com/graphframes/graphframes/archive/f9e13ab4ac1a7113f8439744a1ab45710eb50a72.zip
unzip graphframes-f9e13ab4ac1a7113f8439744a1ab45710eb50a72.zip
cd graphframes-f9e13ab4ac1a7113f8439744a1ab45710eb50a72
build/sbt assembly
cd ..
# Copy necessary files to root level so we can start pyspark.
cp graphframes-f9e13ab4ac1a7113f8439744a1ab45710eb50a72/target/scala-2.11/graphframes-assembly-0.7.0-spark2.4.jar .
cp -r graphframes-f9e13ab4ac1a7113f8439744a1ab45710eb50a72/python/graphframes .
# Set environment to use Jupyter (if jupyter working with pyspark, skip)
# export PYSPARK_DRIVER_PYTHON=jupyter
# export PYSPARK_DRIVER_PYTHON_OPTS=notebook
# launch pyspark
pyspark --jars graphframes-assembly-0.7.0-spark2.4.jar

Related

Apache Spark Multiple sources found for csv Error

I'm trying to run my spark program using the spark-submit command (i'm working with scala), i specified the master adress, the class name, the jar file with all dependencies, the input file and then the output file but i'm having and error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Multiple sources found for csv
(org.apache.spark.sql.execution.datasources.v2.csv.CSVDataSourceV2,
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat), please
specify the fully qualified class name.;
Here is a screenshot for this error, What is it about? How can i fix it?
Thank you
Here you got some warnings also,
If you correctly run your fat-jar file with correct permissions you can get a output like this for ./spark-submit
Check whether if correctly set environmental variables for spark (~/.bashrc). Also check the source CSV file permissions. May be it will be the problem.
If you are running on linux environment set the folder permissions for the source CSV folder as
sudo chmod -R 777 /source_folder
After that again try to run ./spark-submit with your fat-jar file.

Azure Databricks: How to delete files of a particular extension outside of DBFS using python

I am able to delete a file of a particular extension from the directory /databricks/driver using the bash command in databricks.
%%bash
rm /databricks/driver/file*.xlsx
But I am unable to figure out, how to access and delete a file outside of dbfs in a python script,
I think using dbutils we cannot access files outside of DBFS and the below command outputs False as its looking in DBFS.
dbutils.fs.rm("/databricks/driver/file*.xlsx")
I am eager to be corrected.
Not sure how to do it using dbutils but I am able to delete it using glob
import os
from glob import glob
for file in glob('/databricks/driver/file*.xlsx'):
os.remove(file)

Can I include a variable in an `sh` command in zeppelin?

I'm using Zeppelin with Hadoop on a Spark cluster.
I'd like to run a command to check files on s3 and I'd like to use a variable.
This is my code
%sh
aws s3 ls s3://my-bucket/my_folder/
Can I replace my-bucket/my_folder/ with a variable?
What do you mean by "a variable"? A Python variable? If so, I'm not sure. But if you just want to pull the path out onto another line, you can use a shell variable:
%sh
export AWS_FOLDER=my-bucket/my_folder/
aws s3 ls s3://$AWS_FOLDER

Change pytest rootdir

I am stuck with this incredibly silly error. I am trying to run pytest on a Raspberry Pi using bluepy.
pi#pi:~/bluepy/bluepy $ pytest test_asdf.py
============================= test session starts ==============================
platform linux2 -- Python 2.7.9, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
rootdir: /home/pi/bluepy, inifile:
collected 0 items / 1 errors
==================================== ERRORS ====================================
______________ ERROR collecting bluepy/test_bluetoothutility.py _______________
ImportError while importing test module '/home/pi/bluepy/bluepy/test_asdf.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
test_asdf:4: in <module>
from asdf import AsDf
asdf.py:2: in <module>
from bluepy.btle import *
E ImportError: No module named btle
!!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!!
=========================== 1 error in 0.65 seconds ============================
I realised that my problem could be that rootdir is showing incorrect path. It should be
/home/pi/bluepy/bluepy
I've been reading pytest docs but I just do not get it how to change the rootdir.
Your problem is nothing to do with Pytest's rootdir.
The rootdir in Pytest has no connection to how test package names are constructed and rootdir is not added to sys.path, as you can see from the problem you were experiencing. (Beware: the directory that is considered rootdir may be added to the path for other reasons, such as it also being the current working directory when you run python -m pytest.)
The problem here, as others have described, is that the top-level bluepy/ is not in sys.path. The easiest way to handle this if you just want to get something running interactively for yourself is as per Cecil Curry's answer: cd to the top-level bluepy and run Pytest as python -m pytest bluepy/test_asdf.py (or just python -m pytest if you want it to discover all test_* files in or under the current directory and run them). But I think you will need to use python -m pytest, not just pytest, in order to make sure that the current working directory is in the path.
If you're looking to set up a test framework that others can easily run without mysterious failures like this, you'll want to set up a test script that sets the current working directory or PYTHONPATH or whatever appropriately. Or use tox. Or just make this a Python package using standard tools that can run the tests for you. (All that goes way beyond the scope of this question.)
By the way, I concur with Cecil's opinion of Mackie Messer's answer; messing around with conftest.py like that is overly difficult and fragile; there are better solutions for almost any circumstance.
Appendix: Use of rootdir
There are only two things, as far as I'm aware, for which rootdir is used:
The .pytest_cache/ directory is stored in the rootdir unless otherwise specified (with the cache_dir configuration option).
If rootdir contains a conftest.py, it will always be loaded, even if no test files are loaded from in or under the rootdir.
The documentation claims that the rootdir also used to generate nodeids, but adding a conftest.py containing
def pytest_runtest_logstart(nodeid, location):
print("logstart nodeid={} location={}".format(nodeid, location))
and running pytest --rootdir=/somewhere/way/outside/the/tree shows that to be incorrect (though node locations are relative to the rootdir).
My first guess would be that you don't have that directory in the python path. You can add it to the python path dynamically. One simple way to do this is in a test configuration file conftest.py, which I believe is always executed before test discovery and test running.
For example, you might have a project setup like:
root
+-- tests
| +-- conftest.py
| +-- tests_asdf.py
+-- bluepy (or main project dir)
| +-- miscellaneous modules
In which case, you could add the root dir to your python path in the conftest.py file like so:
#
# conftest.py
import sys
from os.path import dirname as d
from os.path import abspath, join
root_dir = d(d(abspath(__file__)))
sys.path.append(root_dir)
Let me know if that's helpful.
Actually, py.test is correctly discovering the rootdir for your project to be /home/pi/bluepy. That's good.
Tragically, you are erroneously attempting to run py.test within your project's package subdirectory (i.e., /home/pi/bluepy/bluepy) rather than within your project's rootdir (i.e., /home/pi/bluepy). That's bad.
Let's break this down a little. From within the:
/home/pi/bluepy directory, there is a bluepy.btle submodule. (Good.)
/home/pi/bluepy/bluepy subdirectory, there is no bluepy.btle submodule. (Bad.) Unless you awkwardly attempt to manually inject the parent directory of this subdirectory (i.e., /home/pi/bluepy) onto sys.path as Makie Messer perhaps inadvisably advises, Python has no means of inferring that the package bluepy actually refers to the current directory coincidentally also named bluepy. To avoid ambiguity issues of this sort, Python is typically only run outside rather than inside of a project's package subdirectory.
Since you ran py.test from the latter rather than the former directory, Python is unable to find the bluepy.btle submodule on the current sys.path. For this and similar reasons, py.test should typically only ever be run from your project's top-level rootdir (i.e., /home/pi/bluepy):
pi#pi:~/ $ cd ~/bluepy
pi#pi:~/bluepy $ py.test bluepy/test_asdf.py
Lastly, note that it's typically preferable to defer test discovery to py.test. Rather than explicitly listing all test script filenames on the command line, consider instead letting py.test implicitly find and run all tests containing some substring via the -k option. For example, to run all tests whose function names are prefixed by test_asdf (regardless of the test script they reside in):
pi#pi:~/ $ cd ~/bluepy
pi#pi:~/bluepy $ py.test -k test_asdf .
The suffixing . is optional, but often useful. It instructs py.test to set its rootdir property to the current directory (i.e., /home/pi/bluepy). py.test is usually capable of finding your project's rootdir and setting this property on its own, but it can't hurt to specify it manually. (Especially as you're having... issues.)
For further details on rootdir discovery, see Initialization: determining rootdir and inifile in the official py.test documentation.

How do I create an alias for a directory in ipython?

On Windows I want to create an alias for my working directory so I can quickly cd into it.
I have tried this command
%alias $UWHPSC echo 'c:/Users/xxxx/Documents/uwhpsc'
cd $UWHPSC
which gives the following error
[Error 2] The system cannot find the file specified: u'$UWHPSC'
c:\Users\xxxx\Documents\uwhpsc
%cd has a notion of bookmarks, which persist across IPython sessions:
%bookmark UWHPSC c:/Users/xxxx/Documents/uwhpsc
%cd UWHPSC
See the output of %bookmark? for more info.
Just define a normal Python variable, and then use it with a $ in the cd command:
UWHPSC = 'c:/Users/xxxx/Documents/uwhpsc'
cd $UWHPSC