How to pass an argument to a spark submit job in airflow - pyspark

I have to trigger a pyspark module from airflow using a sparksubmit operator. But, the pyspark module need to take the spark session variable as an argument. I have used application_args to pass the parameter to the pyspark module. But, when I ran the dag the spark submit operator is getting failed and the parameter I passed in considered as None type variable.
Need to know how to pass a argument to a pyspark module triggered through spark_submit_operator.
The DAG code is below:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PRJT").enableHiveSupport().getOrCreate()
spark_config = {
'conn_id': 'spark_default',
'driver_memory': '1g',
'executor_cores': 1,
'num_executors': 1,
'executor_memory': '1g'
}
dag = DAG(
dag_id="spark_session_prgm",
default_args=default_args,
schedule_interval='#daily',
catchup=False)
spark_submit_task1 = SparkSubmitOperator(
task_id='spark_submit_task1',
application='/home/airflow_home/dags/tmp_spark_1.py',
application_args=['spark'],
**spark_config, dag=dag)
The sample code in tmp_spark_1.py program:

After a bit of debugging, I found the solution to my problem.
argparse is the reason why it was not working. Instead, I used sys with sys.argv[1] and it does the job.

Related

Pyspark ModuleNotFound when importing custom package

Context: I'm running a script on azure databricks and I'm using imports to import functions from a given file
Let's say we have something like this in a file called "new_file"
from old_file import x
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql.types import *
spark = SparkSession.builder.appName('workflow').config(
"spark.driver.memory", "32g").getOrCreate()
The imported funcion "x" will take as argument a string that was read as a pyspark dataframe as such:
new_df_spark = spark.read.parquet(new_file)
new_df = ps.DataFrame(new_df_spark)
new_df is then passed as argument to a function that calls the function x
I then get an error like
ModuleNotFoundError: No module named "old_file"
Does this mean I can't use imports? Or do I need to install the old_file in the cluster for this to work? If so, how would this work and will the package update if I change old_file again?
Thanks

Pass parameters/arguments to HDInsight/Spark Activity in Azure Data Factory

I have an on-demand HDInsight cluster that is launched from a Spark Activity within Azure Data Factory and runs PySpark 3.1. To test out my code, I normally launch Jupyter Notebook from the created HDInsight Cluster page.
Now, I would like to pass some parameters to that Spark activity and retrieve these parameters from within Jupyter notebook code. I've tried doing so in two ways, but none of them worked for me:
Method A. as Arguments and then tried to retrieve them using sys.argv[].
Method B. as Spark configuration and then tried to retrieve them using sc.getConf().getAll().
I suspect that either:
I am not specifying parameters correctly
or using a wrong way to retrieve them in Jupyter Notebook code
or parameters are only valid for the Python *.py scripts specified in the "File path" field, but not for the Jupyter notebooks.
Any pointers on how to pass parameters into HDInsight Spark activity within Azure Data Factory would be much appreciated.
The issue is with the entryFilePath. In the Spark activity of HDInsight cluster, you must either give the entryFilePath as a .jar file or .py file. When we follow this, we can successfully pass arguments which can be utilized using sys.argv.
The following is an example of how you can pass arguments to python script.
The code inside nb1.py (sample) is as shown below:
from pyspark import SparkContext
from pyspark.sql import *
import sys
sc = SparkContext()
sqlContext = HiveContext(sc)
# Create an RDD from sample data which is already available
hvacText = sc.textFile("wasbs:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
# Create a schema for our data
Entry = Row('Date', 'Time', 'TargetTemp', 'ActualTemp', 'BuildingID')
# Parse the data and create a schema
hvacParts = hvacText.map(lambda s: s.split(',')).filter(lambda s: s[0] != 'Date')
hvac = hvacParts.map(lambda p: Entry(str(p[0]), str(p[1]), int(p[2]), int(p[3]), int(p[6])))
# Infer the schema and create a table
hvacTable = sqlContext.createDataFrame(hvac)
hvacTable.registerTempTable('hvactemptable')
dfw = DataFrameWriter(hvacTable)
#using agrument from pipeline to create table.
dfw.saveAsTable(sys.argv[1])
When the pipeline is triggered, it runs successfully and the required table will be created (name of this table is passed as an argument from pipeline Spark activity). We can query this table in HDInsight cluster's Jupyter notebook using the following query:
select * from new_hvac
NOTE:
So, please ensure that you are passing arguments to python script (.py file) but not a python notebook.

Error in Pycharm when linking to pyspark: name 'spark' is not defined

When I run the example code in cmd, everything is ok.
>>> import pyspark
>>> l = [('Alice', 1)]
>>> spark.createDataFrame(l).collect()
[Row(_1='Alice', _2=1)]
But when I execute the code in pycharm, I get an error.
spark.createDataFrame(l).collect()
NameError: name 'spark' is not defined
Maybe something wrong when I link Pycharm to pyspark.
Environment Variable
Project Structure
Project Interpreter
When you start pyspark from the command line, you have a sparkSession object and a sparkContext available to you as spark and sc respectively.
For using it in pycharm, you should create these variables first so you can use them.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
EDIT:
Please have a look at : Failed to locate the winutils binary in the hadoop binary path

Dataproc: functools.partial no attribute '__module__' error for pyspark UDF

I am using GCP/Dataproc for some spark/graphframe calculations.
In my private spark/hadoop standalone cluster,
I have no issue using functools.partial when defining pysparkUDF.
But, now with GCP/Dataproc, I have an issue as below.
Here are some basic settings to check whether partial works well or not.
import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
def power(base, exponent):
return base ** exponent
In the main function, functools.partial works well in ordinary cases as we expect:
# see whether partial works as it is
square = partial(power, exponent=2)
print "*** Partial test = ", square(2)
But, if I put this partial(power, exponent=2) function to PySparkUDF as below,
testSquareUDF = F.udf(partial(power, exponent=2),T.FloatType())
testdf = inputdf.withColumn('pxsquare',testSquareUDF('px'))
I have this error message:
Traceback (most recent call last):
File "/tmp/bf297080f57a457dba4d3b347ed53ef0/gcloudtest-partial-error.py", line 120, in <module>
testSquareUDF = F.udf(square,T.FloatType())
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 1971, in udf
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 1955, in _udf
File "/opt/conda/lib/python2.7/functools.py", line 33, in update_wrapper
setattr(wrapper, attr, getattr(wrapped, attr))
AttributeError: 'functools.partial' object has no attribute '__module__'
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [bf297080f57a457dba4d3b347ed53ef0] entered state [ERROR] while waiting for [DONE].
=========
I had no this kind of issue with my standalone cluster.
My spark cluster version is 2.1.1.
The GCP dataproc's is 2.2.x
Anyone can recognize what prevents me from passing the partial function to the UDF?
As discussed in the comments, the issue was with spark 2.2. And, since spark 2.3 is also supported by Dataproc, just using --image-version=1.3 when creating the cluster fixes it.

Reading a csv file in pyspark (1.6.0)

Maybe the question is trivial but i am getting issues while reading a csv from local directory in Pyspark.
I tried,
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark import SparkContext as sc
mydata = sc.textFile("/home/documents/mydata.csv")
newdata = mydata.map(lambda line: line.split(","))
But getting a error like,
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unbound method textFile() must be called with SparkContext instance as first argument (got str instance instead)
Now my question is I have called SparkContext just before that. Then why am I getting such error? Please guide me where I am lacking.
You do not import SparkContext as sc:
In interactive usage (i.e. pyspark shell), sc is already initialized, so sc.textFile() should work fine
In self-contained applications, you should initialize sc first:
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
where the arguments in SparkContext() matter - see the provided links for more details.
Finally, Spark 1.x cannot natively read CSV files into dataframes - you will need the Spark CSV external package. You may find a relevant blog post I wrote some time ago for Spark 1.5 useful...