Dataproc: functools.partial no attribute '__module__' error for pyspark UDF

Dataproc: functools.partial no attribute '__module__' error for pyspark UDF - pyspark

I am using GCP/Dataproc for some spark/graphframe calculations.
In my private spark/hadoop standalone cluster,
I have no issue using functools.partial when defining pysparkUDF.
But, now with GCP/Dataproc, I have an issue as below.
Here are some basic settings to check whether partial works well or not.
import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
def power(base, exponent):
return base ** exponent
In the main function, functools.partial works well in ordinary cases as we expect:
# see whether partial works as it is
square = partial(power, exponent=2)
print "*** Partial test = ", square(2)
But, if I put this partial(power, exponent=2) function to PySparkUDF as below,
testSquareUDF = F.udf(partial(power, exponent=2),T.FloatType())
testdf = inputdf.withColumn('pxsquare',testSquareUDF('px'))
I have this error message:
Traceback (most recent call last):
File "/tmp/bf297080f57a457dba4d3b347ed53ef0/gcloudtest-partial-error.py", line 120, in <module>
testSquareUDF = F.udf(square,T.FloatType())
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 1971, in udf
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 1955, in _udf
File "/opt/conda/lib/python2.7/functools.py", line 33, in update_wrapper
setattr(wrapper, attr, getattr(wrapped, attr))
AttributeError: 'functools.partial' object has no attribute '__module__'
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [bf297080f57a457dba4d3b347ed53ef0] entered state [ERROR] while waiting for [DONE].
=========
I had no this kind of issue with my standalone cluster.
My spark cluster version is 2.1.1.
The GCP dataproc's is 2.2.x
Anyone can recognize what prevents me from passing the partial function to the UDF?

As discussed in the comments, the issue was with spark 2.2. And, since spark 2.3 is also supported by Dataproc, just using --image-version=1.3 when creating the cluster fixes it.

Related

How to pass an argument to a spark submit job in airflow

I have to trigger a pyspark module from airflow using a sparksubmit operator. But, the pyspark module need to take the spark session variable as an argument. I have used application_args to pass the parameter to the pyspark module. But, when I ran the dag the spark submit operator is getting failed and the parameter I passed in considered as None type variable.
Need to know how to pass a argument to a pyspark module triggered through spark_submit_operator.
The DAG code is below:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PRJT").enableHiveSupport().getOrCreate()
spark_config = {
'conn_id': 'spark_default',
'driver_memory': '1g',
'executor_cores': 1,
'num_executors': 1,
'executor_memory': '1g'
}
dag = DAG(
dag_id="spark_session_prgm",
default_args=default_args,
schedule_interval='#daily',
catchup=False)
spark_submit_task1 = SparkSubmitOperator(
task_id='spark_submit_task1',
application='/home/airflow_home/dags/tmp_spark_1.py',
application_args=['spark'],
**spark_config, dag=dag)
The sample code in tmp_spark_1.py program:

After a bit of debugging, I found the solution to my problem.
argparse is the reason why it was not working. Instead, I used sys with sys.argv[1] and it does the job.

How to mock an S3AFileSystem locally for testing spark.read.csv with pytest?

What I'm trying to do
I am attempting to unit test an equivalent of the following function, using pytest:
def read_s3_csv_into_spark_df(s3_uri, spark):
df = spark.read.csv(
s3_uri.replace("s3://", "s3a://")
)
return df
The test is defined as follows:
def test_load_csv(self, test_spark_session, tmpdir):
# here I 'upload' a fake csv file using tmpdir fixture and moto's mock_s3 decorator
# now that the fake csv file is uploaded to s3, I try read into spark df using my function
baseline_df = read_s3_csv_into_spark_df(
s3_uri="s3a://bucket/key/baseline.csv",
spark=test_spark_session
)
In the above test, the test_spark_session fixture used is defined as follows:
#pytest.fixture(scope="session")
def test_spark_session():
test_spark_session = (
SparkSession.builder.master("local[*]").appName("test").getOrCreate()
)
return test_spark_session
The problem
I am running pytest on a SageMaker notebook instance, using python 3.7, pytest 6.2.4, and pyspark 3.1.2. I am able to run other tests by creating the DataFrame using test_spark_session.createDataFrame, and then performing aggregations. So the local spark context is indeed working on the notebook instance with pytest.
However, when I attempt to read the csv file in the test I described above, I get the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o84.csv.
E : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found
How can I, without actually uploading any csv files to S3, test this function?
I have also tried providing the S3 uri using s3:// instead of s3a://, but got a different, related error: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3".

Name sc is not defined

I am just trying to execute sc.version inside pyspark shell however getting an error as sc not defined.
>>> sc.version()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
If i run SparkContext.getOrCreate()
>>> SparkContext.getOrCreate()
<pyspark.context.SparkContext object at 0x7f206aa8cfd0>
I am not getting even the output of sc.version(). What is the problem?

A few things:
Inside the pyspark shell you automatically only have access to the spark session (which can be referenced by "spark").
To get the sparkcontext, you can get it from the spark session by sc = spark.sparkContext. Or using the getOrCreate() method as mentioned by #Smurphy0000 in the comments
Version is an attribute of the spark context. To get the version from sparkcontext (sc in this case), version = sc.version. Version can also be extracted from the session directly as version = spark.version

Error in Pycharm when linking to pyspark: name 'spark' is not defined

When I run the example code in cmd, everything is ok.
>>> import pyspark
>>> l = [('Alice', 1)]
>>> spark.createDataFrame(l).collect()
[Row(_1='Alice', _2=1)]
But when I execute the code in pycharm, I get an error.
spark.createDataFrame(l).collect()
NameError: name 'spark' is not defined
Maybe something wrong when I link Pycharm to pyspark.
Environment Variable
Project Structure
Project Interpreter

When you start pyspark from the command line, you have a sparkSession object and a sparkContext available to you as spark and sc respectively.
For using it in pycharm, you should create these variables first so you can use them.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
EDIT:
Please have a look at : Failed to locate the winutils binary in the hadoop binary path

Reading a csv file in pyspark (1.6.0)

Maybe the question is trivial but i am getting issues while reading a csv from local directory in Pyspark.
I tried,
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark import SparkContext as sc
mydata = sc.textFile("/home/documents/mydata.csv")
newdata = mydata.map(lambda line: line.split(","))
But getting a error like,
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unbound method textFile() must be called with SparkContext instance as first argument (got str instance instead)
Now my question is I have called SparkContext just before that. Then why am I getting such error? Please guide me where I am lacking.

You do not import SparkContext as sc:
In interactive usage (i.e. pyspark shell), sc is already initialized, so sc.textFile() should work fine
In self-contained applications, you should initialize sc first:
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
where the arguments in SparkContext() matter - see the provided links for more details.
Finally, Spark 1.x cannot natively read CSV files into dataframes - you will need the Spark CSV external package. You may find a relevant blog post I wrote some time ago for Spark 1.5 useful...