Visual studio code using pytest for Pyspark getting stuck at SparkSession Creation - visual-studio-code

I am trying to run a pyspark unit test in Visual studio code on my local windows machine. when i debug the test it gets stuck at line where I am creating a sparksession. It doesn't show any error/failure but status bar just shows "Running Tests" . Once it work, i can refactor my test to create sparksession as part of test fixture, but presently my test is getting stuck at sparksession creation.
Do i have to install/configure on my local machine for sparksession to work?
i tried a simple test with assert 'a' == 'b' and i can debug and test run succsfully, so i assume my pytest configurations are correct. Issue i am facing is with creating sparksession.
# test code
from pyspark.sql import SparkSession, Row, DataFrame
import pytest
def test_poc():
spark_session = SparkSession.builder.master('local[2]').getOrCreate() #this line never returns when debugging test.
spark_session.createDataFrame(data,schema) #data and schema not shown here.
Thanks

What I have done to make it work was:
Create a .env file in the root of the project
Add the following content to the created file:
SPARK_LOCAL_IP=127.0.0.1
JAVA_HOME=<java_path>/jdk/zulu#1.8.192/Contents/Home
SPARK_HOME=<spark_path>/spark-3.0.1-bin-hadoop2.7
PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
Go to .vscode file in the root, expand and open settings.json. Add the following like (replace <workspace_path> with your actual workspace path):
"python.envFile": "<workspace_path>/.env"
After refreshing the Testing section in Visual Studio Code, the setup should succeed.
Note: I use pyenv to setup my python version, so I had to make sure that VS Code was using the correct python version with all the expected dependencies installed.
Solution inspired by py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM and https://github.com/microsoft/vscode-python/issues/6594

Related

How to debug unit test while developping a package in Julia

Say I develop a package with a limited set of dependencies (for example, LinearAlgebra).
In the Unit testing part, I might need additional dependencies (for instance, CSV to load a file). I can configure that in the Project.toml all good.
Now from there and in VS Code, how can I debug the Unit tests? I tried running the "runtests.jl" in the debugger; however, it unsurprisingly complains that the CSV package is unavailable.
I could add the CSV package (as a temporary solution), but I would prefer that the debugger run with the configuration for the unit testing; how can I achieve that?
As requested, here is how it can be reproduced (it is not quite minimal, but instead I used a commonly used package as it give confidence the package is not the problem). We will use DataFrames and try to execute the debugger for its unit tests.
Make a local version of DataFrames for the purpose of developing a feature in it. I execute dev DataFrames in a new REPL.
Select the correct environment (in .julia/dev/DataFrames) through the VS-code user interface.
Execute the "proper" unit testing by executing test DataFrames at the pkg prompt. Everything should go smoothly.
Try to execute the tests directly (open the runtests.jl and use the "Run" button in vs-code). I see some errors of the type:
LoadError: ArgumentError: Package CategoricalArrays not found in current path:
- Run `import Pkg; Pkg.add("CategoricalArrays")` to install the CategoricalArrays package.
which is consistent with CategoricalArrays being present in the [extras] section of the Project.toml but not present in the [deps].
Finally, instead of the "Run" command, execute the "Run and Debug". I encounter similar errors here is the first one:
Test Summary: | Pass Total
merge | 19 19
PASSED: index.jl
FAILED: dataframe.jl
LoadError: ArgumentError: Package DataStructures not found in current path:
- Run `import Pkg; Pkg.add("DataStructures")` to install the DataStructures package.
So I can't debug the code after the part requiring the extras packages.
After all that I delete this package with the command free DataFrames at the pkg prompt.
I see the same behavior in my package.
I'm not certain I understand your question, but I think you might be looking for the TestEnv package. It allows you to activate a temporary environment containing the [extras] dependencies. The discourse announcement contains a good description of the use cases.
Your runtest.jl file should contain all necessary imports to run tests.
Hence you are expected to have in your runtests.jl file lines such as:
using YourPackageName
using CSV
# the lines with tests now go here.
This is a standard in Julia package layout. For an example have a look at any mature Julia such as DataFrames.jl (https://github.com/JuliaData/DataFrames.jl/blob/main/test/runtests.jl).

Cucumber Test Runner file-Not executing step definitions

I am building a Restassured API test framework with cucumber.(This is a new adventure for me so apologies if this seems basic)
Below is how I setup my test runner file.
package cucumber.Options;
import org.junit.runner.RunWith;
import io.cucumber.junit.Cucumber;
import io.cucumber.junit.CucumberOptions;
#RunWith(Cucumber.class)
#CucumberOptions(features="src/test/java/features", glue={"stepDefinitions"},strict=true,stepNotifications = true)
public class TestRunner{
}
I have made sure the glue matches my step def file, ive also tried adding the full path.
However my step definitions no matter what I set as the glue are not being executed.
As soon as I run the TestRunner File as a junit test junit marks the runs as complete.
I set up a simple scenario where I am just printing a test output to console and console never shows the output.
I even tried setting my glue to a filename that doesnt even exist to see if i got an error but i get the same result as above.
And whenever I click one of the the lines in the feature file I get error "Test class not found in selected project"
Error
Has anyone experienced the same behavior with eclipse?

how to debug unittest in vscode using pytest

I am trying to use vscode to debug some unit tests using pytest. One special thing is I am importing the module I need to test from another folder.
src
func.py
test
test_func.py
so inside test_func.py, I imported the function I need to test:
import func
def test_func():
...
But when I tried to debug it using vscode, it told me "module func is not found!"
Should I do some extra configuration to make it work?

PySpark on Linux with pycharm - first exception error

I am trying to run my first PySpark script on a Linux VM I configured. The error message I have is KeyError: SPARK_HOME when I run the following:
from os import environ
from pyspark import SparkContext
I momentarily made this error go away by running export SPARK_HOME=~/spark-2.4.3-bin-hadoop2.7. I then ran into a new error error=2, No such file or directory. Searching took me to this page:https://community.cloudera.com/t5/Community-Articles/Tutorial-Install-Configure-iPython-and-create-run-PySpark/ta-p/246400. I then ran export PYSPARK_PYTHON=~/python3*. This brings me back to experiencing the KeyError: SPARK_HOME error.
Honestly, I'm stumbling through this, because it's my first time configuring Spark, and using PySpark. I still don't quite understand the ins-and-outs of pycharm, as well.
I expect to be able to run the following basic sample script on this page: https://medium.com/parrot-prediction/integrating-apache-spark-2-0-with-pycharm-ce-522a6784886f with no issues.
there is a package called findspark here
or you may use below code to set path if not found in environment
import os
if 'SPARK_HOME' not in os.environ:
os.environ['SPARK_HOME'] = 'full_path_to_spark_root'
[code continues]

can't find module 'graphframes' -- Jupyter

I'm trying to install graphframes package following some instructions I have already read.
My first attempt was to do this in the command line:
pyspark--packages graphframes:graphframes:0.5.0-spark2.1-s_2.11
This works perfectly and the download was successfully done in the machine.
However, when I try to import the package in my Jupyter notebook, it displays the error:
can't find module 'graphframes'
My first attempt is to copy the package folder /graphframes to the /site-packages, but I can not make it with a simple cp command.
I'm quite new using spark and I'm sure I'm missing some parts of the configuration...
Could you please help me?
This was what worked for me.
Extract the contents of the graphframes-xxx-xxx-xxx.jar file. You should get something like
graphframes
| -- examples
|-- ...
| -- __init__.py
| -- ...
Zip up the entire folder (not just the contents) and name it whatever you want. We'll just call it graphframes.zip.
Then, run the pyspark shell with
pyspark --py-files graphframes.zip \
--packages graphframes:graphframes:0.5.0-spark2.1-s_2.11
You may need to do
sc.addPyFile('graphframes.zip')
before
import graphframes
The simplest way is to start jupyter with pyspark and graphframes is to start jupyter out from pyspark.
Just open your terminal and set the two environment variables and start pyspark with the graphframes package
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
the advantage of this is also that if you later on want to run your code via spark-submit you can use the same start command