pytest for creating sparksession on local machine

pytest for creating sparksession on local machine - pyspark

I am trying to run a test for my pyspark code on windows local machine. Pytest is getting stuck at line where I am creating SparkSession in my test code. Do i have to install/configure spark on my local machine for Pytest to work. Finally the test will execute as part of CI/CD, do i have to configure Spark on build machines also? I have a related question, but looks like issue is not with Visual studio Code but pytest (as i have same issue when I run pytest from command line )
below is my test code
# test code
from pyspark.sql import SparkSession, Row, DataFrame
import pytest
def test_poc():
spark_session = SparkSession.builder.master('local[2]').getOrCreate() #this line never returns when debugging test.
spark_session.createDataFrame(data,schema) #data and schema not shown here.

can you add the terminal output of your pyspark script? It will be helpful to understand where to begin with and it might give us a clue what it is the problem in your setup.
At least to see if you have installed pyspark correctly (you still might need to do additional operations to be fully sure), but you can do like below script saved in a python file sample_test.py
from pyspark import sql
spark = sql.SparkSession.builder \
.appName("local-spark-session") \
.getOrCreate()
And running it should print out something like below
C:\Users\user\Desktop>python sample_test.py
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
C:\Users\user\Desktop>SUCCESS: The process with PID 16368 (child process of PID 12664) has been terminated.
SUCCESS: The process with PID 12664 (child process of PID 11736) has been terminated.
SUCCESS: The process with PID 11736 (child process of PID 6800) has been terminated.
And below is a sample test for pyspark using pytest saved in a file called sample_test.py
from pyspark import sql
spark = sql.SparkSession.builder \
.appName("local-spark-session") \
.getOrCreate()
def test_create_session():
assert isinstance(spark, sql.SparkSession) == True
assert spark.sparkContext.appName == 'local-spark-session'
assert spark.version == '3.1.2'
Which you can simply run as below
C:\Users\user\Desktop>pytest -v sample_test.py
============================================= test session starts =============================================
platform win32 -- Python 3.6.7, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- c:\users\user\appdata\local\programs\python\python36\python.exe
cachedir: .pytest_cache
rootdir: C:\Users\user\Desktop
collected 1 item
sample_test.py::test_create_session PASSED [100%]
============================================== 1 passed in 4.51s ==============================================
C:\Users\user\Desktop>SUCCESS: The process with PID 4752 (child process of PID 9780) has been terminated.
SUCCESS: The process with PID 9780 (child process of PID 8988) has been terminated.
SUCCESS: The process with PID 8988 (child process of PID 20176) has been terminated.
Above example is for windows. My account is new so I couldn't respond on your comments...can you update your question to share the messages/errors from the terminal if there are any? And by the way just wondering what OS are you using?

Related

Unit testing in Databricks notebooks

The following code is intended to run unit tests in Databricks notebooks, using pytest.
import pytest
import os
import sys
repo_name = "Databricks-Code-Repo"
# Get the path to this notebook, for example "/Workspace/Repos/{username}/{repo-name}".
notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
# Get the repo's root directory name.
repo_root = os.path.dirname(os.path.dirname(notebook_path))
# Prepare to run pytest from the repo.
os.chdir(f"/Workspace/{repo_root}/{repo_name}")
print(os.getcwd())
# Skip writing pyc files on a readonly filesystem.
sys.dont_write_bytecode = True
# Run pytest.
retcode = pytest.main([".", "-v", "-p", "no:cacheprovider"])
# Fail the cell execution if there are any test failures.
assert retcode == 0, "The pytest invocation failed. See the log for details."
This code snippet is in the guide provided by Databricks.
However, it produces the following error:
PermissionError: [Errno 1] Operation not permitted: '/Workspace//Repos/<email_address>/Databricks-Code-Repo/Databricks-Code-Repo'
This notebook is inside Databricks Repos. I have two other notebooks:
functions (where I have defined three data transformation functions);
test_functions (where I have defined test function for each of the data transformation functions from the previous notebook).
I get that the error has something to do with permissions, but I can't figure out what is causing it. I will appreciate any suggestions.

How to force pytest return error code error code

I have following structure:
Koholo job that calling python script, the script returns error code (1 - failed, 0 - passed) as it ends. Koholo wait for the error code to continue to next job step (next scrips).
Now instead of python script I'm running pytest scrips (with command: python -m pytest test_name) but pytest is not returning error code, so the Kohola job timeouts.
Please let me know if there is a way that pytest will return error code as it finish's?

example you can pass any pytest argument that you normaly pass in the cli, i am just using the markers as an example
import sys
import pytest
results = pytest.main(["-m", "my_marker"])
sys.exit(results)
if you want more details
https://docs.pytest.org/en/7.1.x/reference/exit-codes.html

when pytest finish it calls pytest_sessionfinish(session, exitstatus) method.
try to add sys.exit(exitstatus) to this method.
import sys
def pytest_sessionfinish(session, exitstatus):
""" whole test run finishes. """
sys.exit(exitstatus)
also you can check by running this script to check the exit code
start /wait python -m pytest test_name
echo %errorlevel%

ModuleNotFoundError: No module named 'pyspark.dbutils' while running multiple.py file/notebook on job clusters in databricks

I am working in TravisCI, MlFlow and Databricks environment where .tavis.yml sits at git master branch and detects any change in .py file and whenever it gets updated, It will run mlflow command to run .py file in databricks environment.
my MLProject file looks as following:
name: mercury_cltv_lib
conda_env: conda-env.yml
entry_points:
main:
command: "python3 run-multiple-notebooks.py"
Workflow is as following:
TravisCI detects change in master branch-->triggers build which will run MLFlow command and it'll spin up a job cluster in databricks to run .py file from repo.
It worked fine with one .py file but when I tried to run multiple notebook using dbutils, it is throwing
File "run-multiple-notebooks.py", line 3, in <module>
from pyspark.dbutils import DBUtils
ModuleNotFoundError: No module named 'pyspark.dbutils'
Please find below the relevant code section from run-multiple-notebooks.py
def get_spark_session():
from pyspark.sql import SparkSession
return SparkSession.builder.getOrCreate()
def get_dbutils(self, spark = None):
try:
if spark == None:
spark = spark
from pyspark.dbutils import DBUtils #error line
dbutils = DBUtils(spark) #error line
except ImportError:
import IPython
dbutils = IPython.get_ipython().user_ns["dbutils"]
return dbutils
def submitNotebook(notebook):
print("Running notebook %s" % notebook.path)
spark = get_spark_session()
dbutils = get_dbutils(spark)
I tried all the options and tried
https://stackoverflow.com/questions/61546680/modulenotfounderror-no-module-named-pyspark-dbutils
as well. It is not working :(
Can someone please suggest if there is fix for the above-mentioned error while running .py in job cluster. My code works fine inside databricks local notebook but running from outside using TravisCI and MLFlow isn't working which is must requirement for pipeline automation.

Pytest: collecting 0 items even after following the conventions

I created a test module by following all the conventions, but when I run the test, I get the following message:
collecting 0 items
Here's my directory hierarchy:
integration_tests (Directory)-> tests (Directory)-> test_integration_use_cases.py (python file)
And this is the content of the file:
import pytest
from some_tests.integration_tests.backbone.SomeIntegrationTestBase import SomeIntegrationTestBase
class TestSomeIntegration(SomeIntegrationTestBase):
#pytest.mark.p1
def test_some_integration_use_cases(self):
print("**** Executing integration tests ****")
result = self.execute_test(4)
assert (True == result)
when I run the following command:
pytest test_integration_use_cases.py
I see the following result without any errors:
collecting 0 items
FYI: I am running this on a development machine (Like vagrant)

so I had the same problem as you have even after following all the recommended conventions. My application structure was as follows;
Application
-- API
app.py
-- docs
-- venv
-- tests
-- unit_test
test_factory
...
...
I, however, resolved the issue by moving the tests directory under the API package so that my application structure looked as below;
Application
-- API
app.py
-- tests
-- unit_test
test_factory
...
-- docs
-- venv
...
Although pytest is supposed to auto-discover the tests, it seems to do that if they are placed in the application root. Check out the pytest for flask
I also found this resource helpful.

How to start a Spark Shell using pyspark in Windows?

I am a beginner in Spark and trying to follow instructions from here on how to initialize Spark shell from Python using cmd: http://spark.apache.org/docs/latest/quick-start.html
But when I run in cmd the following:
C:\Users\Alex\Desktop\spark-1.4.1-bin-hadoop2.4\>c:\Python27\python bin\pyspark
then I receive the following error message:
File "bin\pyspark", line 21
export SPARK_HOME="$(cd ="$(cd "`dirname "$0"`"/..; pwd)"
SyntaxError: invalid syntax
What am I doing wrong here?
P.S. When in cmd I try just C:\Users\Alex\Desktop\spark-1.4.1-bin-hadoop2.4>bin\pyspark
then I receive ""python" is not recognized as internal or external command, operable program or batch file".

You need to have Python available in the system path, you can add it with setx:
setx path "%path%;C:\Python27"

I'm a fairly new Spark user (as of today, really). I am using spark 1.6.0 on Windows 10 and 7 machines. The following worked for me:
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
raise ValueError('SPARK_HOME environment variable is not set')
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'C:/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
Using the code above, I was able to launch Spark in an IPython notebook and my Enthought Canopy Python IDE. Before, this, I was only able to launch pyspark through a cmd prompt. The code above will only work if you have your Environment Variables set correctly for Python and Spark (pyspark).

I run these set of path settings whenever I start pyspark in ipython:
import os
import sys
# Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"') for R
### MANNN restart spart using ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3
os.environ['SPARK_HOME']="G:/Spark/spark-1.5.1-bin-hadoop2.6"
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/bin")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/mllib")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip")
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SQLContext
##sc.stop() # IF you wish to stop the context
sc = SparkContext("local", "Simple App")

With the reference and help of the user "maxymoo" I was able to find a way to set a PERMANENT path is Windows 7 as well. The instructions are here:
http://geekswithblogs.net/renso/archive/2009/10/21/how-to-set-the-windows-path-in-windows-7.aspx

Simply set path in System -> Environment Variables -> Path
R Path in my system C:\Program Files\R\R-3.2.3\bin
Python Path in my system c:\python27
Spark Path in my system c:\spark-2
The path must be separated by ";" and there must be no space between paths

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

pytest for creating sparksession on local machine - pyspark

Related

Unit testing in Databricks notebooks

How to force pytest return error code error code

ModuleNotFoundError: No module named 'pyspark.dbutils' while running multiple.py file/notebook on job clusters in databricks

Pytest: collecting 0 items even after following the conventions

How to start a Spark Shell using pyspark in Windows?

Categories

Resources