Use different pandas version for different tasks in the same DAG (Airflow) - virtualenv

Say I have two tasks which uses two versions of, say, pandas
#my_task_one
import pandas as pd #Pandas 1.0.0
def f1(data):
.
.
return 0
and
#my_task_two
import pandas as pd #version 2.0.0
def f2(data):
.
.
return 0
In my airflow (local, no Docker), is there a way to create a venv or requirement-file for each task e.g
#dag.py
t1 = PythonOperator(
task_id = "t1",
python_callable = f1,
requirements = "my_task_one_requirement.txt" #How to set requirements for this task?
)
t2 = PythonOperator(
task_id = "t2",
python_callable = f2,
requirements = "my_task_two_requirement.txt" #How to set requirements for this task?
)
t1>>t2
In case it can't be in the same DAG-file, is there a way to specify the requirements for a given DAG-file e.g placing t1 and t2 in DAG1 and DAG2 respectively, but with different packages/requirement-file?

Airflow has PythonVirtualenvOperator that is suitable for this use case.
t1 = PythonVirtualenvOperator(
task_id="t1",
python_callable=f1,
requirements=["pandas==1.0.0"],
)
t2 = PythonVirtualenvOperator(
task_id="t2",
python_callable=f2,
requirements=["pandas==2.0.0"],
)

Related

Run postgres operator with statement timeout airflow dag

so i've been trying to run 2 postgres operator on my DAG and looks like this:
default_args = {
'owner': 'local',
}
log = logging.getLogger(_name_)
TEMP_SETTLEMENT ="""
set statement_timeout to 0;
select function_a();
"""
VACUM_SETTLEMENT="""
vacuum (verbose, analyze) test_table;
"""
try:
with DAG(
dag_id='temp-test',
default_args=default_args,
schedule_interval=none,
start_date=datetime(2021, 10, 1),
max_active_runs=1,
catchup=False
) as dag:
pg = PostgresOperator(
task_id="data",
postgres_conn_id="connection_1",
database="DB_test",
autocommit=True,
sql=TEMP_SETTLEMENT,
)
vacum = PostgresOperator(
task_id="vacum",
postgres_conn_id="connection_1",
database="DB_test",
autocommit=True,
sql=VACUM_SETTLEMENT, )
pg >> vacum
except ImportError as e:
log.warning("Could not import DAGs: %s", str(e))
i keep getting the statement timeout when i try to run the temp_settlement, is there any way to keep the statement_timeout=0?
Thanks
Update:
Starting apache-airflow-providers-postgres>=4.1.0
you can do:
PostgresOperator(
...,
runtime_parameters={'statement_timeout': '3000ms'},
)
This capability was added in PR that solved issue.
Original Answer:
You didn't mention it by from your description I assume that the timeout comes from Postgres and not from Airflow.
For the moment the PostgresOperator does not allow to override the hook/connection settings.
To solve your issue you will need to edit connection_1 in the extra field as explained in the docs you will need to add statement_timeout:
{'statement_timeout': '3600s'}
I opened https://github.com/apache/airflow/issues/21486 as a followup feature request to allow setting statement_timeout directly from the operator.

Airflow default variables - Incremental load setup

I am trying to implement a incremental data load for a data extract from rds postgres to another postgres rds
I am using airflow, to implement the ETL. So, after reading for a while about airflow macros, I decided I'll set up the incremental flow with airflow default variables.
So, the algorithm is this way,
if my previous execution date is None or null or '':
pick data from the beginning of time(in our case its a year back)
else
pick the previous execution date
end if
Note : the following code is to understand default variables at first, and this is not yet implemented to the problem I have mentioned above
The corresponding code for that is as shown below. When I run the dag for the first time, I always end up printing 'None' for previoussuccessfulexecutiondate variable and never the historical date like what I have mentioned. I am unable to figure this out. Any ideas on this would be of great help
from datetime import *
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.dates import days_ago
default_args={'owner':'airflow','start_date': days_ago(1),'depends_on_past':'False'}
dag = DAG('jinja_trial_10',default_args=default_args,schedule_interval=timedelta(minutes=5))
def printexecutiontimes(**kwargs):
executiondate = kwargs.get('execution_date')
previoussuccessfulexecutiondate = kwargs.get('prev_execution_date_success')
previousexecutiondate = kwargs.get('prev_ds_nodash')
if (previoussuccessfulexecutiondate == 'None' or previoussuccessfulexecutiondate is None):
previoussuccessfulexecutiondate = datetime.strftime(datetime.now() - timedelta(days = 365),'%Y-%m-%d')
print('Execution Date : {0}'.format(executiondate))
print('Previous successful execution date : {0}'.format(previoussuccessfulexecutiondate))
print('Previous execution date : {0}'.format(previousexecutiondate))
print('hello')
task_start = DummyOperator(task_id = 'start',dag=dag)
jinja_task= PythonOperator(task_id = 'TryingoutJinjatemplates',
python_callable =printexecutiontimes,
provide_context = 'True',
dag=dag )
task_end = DummyOperator(task_id = 'end',dag=dag)
task_start >>jinja_task >> task_end
I had to something very similar recently and following code is what i have ended up creating a custom function using DagRun details.
Refer to this answer - if you just want to get last DAG run (irrespective of status).
For me, i had to get the last date of successful run, hence created below function:
def get_last_dag_run(dag_id):
dag_runs = DagRun.find(dag_id=dag_id)
dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
for dag_run in dag_runs:
#print all dag runs - debug only
print(f"All ----- state: {dag_run.state} , run_id: {dag_run.run_id} , execution_date: {dag_run.execution_date}")
print('Success runs ---------------------------------')
dag_runs = list(filter(lambda x: x.state == 'success', dag_runs))
for dag_run in dag_runs:
#print successfull dag runs - debug only
print(f"Success - state: {dag_run.state} , run_id: {dag_run.run_id} , execution_date: {dag_run.execution_date}")
# return last execution run or default value (01-01-1970)
return dag_runs[0].execution_date if dag_runs else datetime(1970, 1, 1)
After a few experiments and a lot of reading, I came up with the following code and it worked for me
Create a variable in Airflow UI and assign it a value 0
Use Airflow’s predefined variables, to determine whether it is a full
load or a incremental load
Pseudo code -
If
value of Variable created = 0
then
set Variable = 1
set the start data to point in time in the past(a date-time from the inception of a certain process)
set the end date to the value of the "execution_date" (defined as a part of airflow's predefined variables)
else
set the start date to "prev_execution_date_success" (defined as a part of airflow's predefined variables)
set the end date to "execution_date" (defined as a part of airflow's predefined variables)
end
Below is the code snippet for the same
from datetime import *
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.dates import days_ago
from airflow.models import Variable
default_args={'owner':'airflow','start_date': datetime(2020,11,3,12,5),'depends_on_past':'False'}
dag = DAG('airflow_incremental_load_setup',default_args=default_args,schedule_interval=timedelta(minutes=5))
def printexecutiontimes(**kwargs):
# Variable to be created before the running of dag
full_load_check = Variable.get('full_load_completion')
print('full_load_check : {0}'.format(full_load_check))
if full_load_check == '0':
print('First execution')
print('Execution date : {0}'.format(kwargs.get('execution_date')))
print('Actual start date : {0}'.format(kwargs.get('ds')))
print('Previous successful execution date : {0}'.format(kwargs.get('prev_execution_date_success')))
print('Calculated field : {0}'.format(datetime.strftime(datetime.now() - timedelta(days=365), '%Y-%m-%d')))
Variable.set('full_load_check', '1')
start_date = datetime.strftime(datetime.now() - timedelta(days=365), '%Y-%m-%d')
end_date = datetime.strftime(kwargs.get('execution_date'), '%Y-%m-%d')
else:
print('After the first execution ..')
print('Execution date : {0}'.format(kwargs.get('execution_date')))
print('Actual start date : {0}'.format(kwargs.get('ds')))
print('Previous successful execution date : {0}'.format(kwargs.get('prev_execution_date_success')))
print('Calculated field : {0}'.format(kwargs.get('prev_execution_date_success')))
start_date = kwargs.get('prev_execution_date_success')
start_date = parse(str(start_date))
end_date = kwargs.get('execution_date')
end_date = parse(str(end_date))
print('Type of start_date_check : {0}'.format(type(start_date)))
start_date = datetime.strftime(start_date, '%Y-%m-%d')
end_date = datetime.strftime(end_date, '%Y-%m-%d')
task_start = DummyOperator(task_id = 'start',dag=dag)
main_task= PythonOperator(task_id = 'IncrementalJobTask',
python_callable =printexecutiontimes,
provide_context = 'True',
dag=dag )
task_end = DummyOperator(task_id = 'end',dag=dag)
task_start >>main_task >> task_end
It helped me:
if isinstance(context['prev_execution_date_success'], type(None)):

Python using list parameter in pyspark.sql like a macro in sas

I have a list and want to use in pyspark.sql statement.
VLIST=['afhjh', 'aikn5','hsa76']
INC=pyspark.sql("select * from table1 where VIG=$VLIST")
I tried to use like a sas statement(using $ instead of &) which failed.
How can I use it correctly.
You can try something a bit different :
VLIST = ('afhjh', 'aikn5','hsa76')
INC = pyspark.sql(f"select * from table1 where VIG in {VLIST}")
or another way :
from pyspark.sql import functions as F
INC = pyspark.table("table1")
INC = INC.where(F.col("VIG").isin(*(F.lit(val) for val in VLIST)))

Exporting PySpark Dataframe to Azure Data Lake Takes Forever

The code below ran perfectly well on the standalone version of PySpark 2.4 on Mac OS (Python 3.7) when the size of the input data (around 6 GB) was small. However, when I ran the code on HDInsight cluster (HDI 4.0, i.e. Python 3.5, PySpark 2.4, 4 worker nodes and each has 64 cores and 432 GB of RAM, 2 header nodes and each has 4 cores and 28 GB of RAM, 2nd generation of data lake) with larger input data (169 GB), the last step, which is, writing data to the data lake, took forever (I killed it after 24 hours of execution) to complete. Given the fact that HDInsight is not popular in the cloud computing community, I could only reference posts that complained about the low speed when writing dataframe to S3. Some suggested to repartition the dataset, which I did, but it did not help.
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StringType, IntegerType, BooleanType
from pyspark.sql.functions import udf, regexp_extract, collect_set, array_remove, col, size, asc, desc
from pyspark.ml.fpm import FPGrowth
import os
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.5"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3.5"
def work(order_path, beer_path, corpus_path, output_path, FREQ_THRESHOLD=1000, LIFT_THRESHOLD=1):
print("Creating Spark Environment...")
spark = SparkSession.builder.appName("Menu").getOrCreate()
print("Spark Environment Created!")
print("Working on Checkpoint1...")
orders = spark.read.csv(order_path)
orders.createOrReplaceTempView("orders")
orders = spark.sql(
"SELECT _c14 AS order_id, _c31 AS in_menu_id, _c32 AS in_menu_name FROM orders"
)
orders.createOrReplaceTempView("orders")
beer = spark.read.csv(
beer_path,
header=True
)
beer.createOrReplaceTempView("beer")
beer = spark.sql(
"""
SELECT
order_id AS beer_order_id,
in_menu_id AS beer_in_menu_id,
'-999' AS beer_in_menu_name
FROM beer
"""
)
beer.createOrReplaceTempView("beer")
orders = spark.sql(
"""
WITH orders_beer AS (
SELECT *
FROM orders
LEFT JOIN beer
ON orders.in_menu_id = beer.beer_in_menu_id
)
SELECT
order_id,
in_menu_id,
CASE
WHEN beer_in_menu_name IS NOT NULL THEN beer_in_menu_name
WHEN beer_in_menu_name IS NULL THEN in_menu_name
END AS menu_name
FROM orders_beer
"""
)
print("Checkpoint1 Completed!")
print("Working on Checkpoint2...")
corpus = spark.read.csv(
corpus_path,
header=True
)
keywords = corpus.select("Food_Name").rdd.flatMap(lambda x: x).collect()
orders = orders.withColumn(
"keyword",
regexp_extract(
"menu_name",
"(?=^|\s)(" + "|".join(keywords) + ")(?=\s|$)",
0
)
)
orders.createOrReplaceTempView("orders")
orders = spark.sql("""
SELECT order_id, in_menu_id, keyword
FROM orders
WHERE keyword != ''
""")
orders.createOrReplaceTempView("orders")
orders = orders.groupBy("order_id").agg(
collect_set("keyword").alias("items")
)
print("Checkpoint2 Completed!")
print("Working on Checkpoint3...")
fpGrowth = FPGrowth(
itemsCol="items",
minSupport=0,
minConfidence=0
)
model = fpGrowth.fit(orders)
print("Checkpoint3 Completed!")
print("Working on Checkpoint4...")
frequency = model.freqItemsets
frequency = frequency.filter(col("freq") > FREQ_THRESHOLD)
frequency = frequency.withColumn(
"items",
array_remove("items", "-999")
)
frequency = frequency.filter(size(col("items")) > 0)
frequency = frequency.orderBy(asc("items"), desc("freq"))
frequency = frequency.dropDuplicates(["items"])
frequency = frequency.withColumn(
"antecedent",
udf(
lambda x: "|".join(sorted(x)), StringType()
)(frequency.items)
)
frequency.createOrReplaceTempView("frequency")
lift = model.associationRules
lift = lift.drop("confidence")
lift = lift.filter(col("lift") > LIFT_THRESHOLD)
lift = lift.filter(
udf(
lambda x: x == ["-999"], BooleanType()
)(lift.consequent)
)
lift = lift.drop("consequent")
lift = lift.withColumn(
"antecedent",
udf(
lambda x: "|".join(sorted(x)), StringType()
)(lift.antecedent)
)
lift.createOrReplaceTempView("lift")
result = spark.sql(
"""
SELECT lift.antecedent, freq AS frequency, lift
FROM lift
INNER JOIN frequency
ON lift.antecedent = frequency.antecedent
"""
)
print("Checkpoint4 Completed!")
print("Writing Result to Data Lake...")
result.repartition(1024).write.mode("overwrite").parquet(output_path)
print("All Done!")
def main():
work(
order_path=169.1 GB of txt,
beer_path=4.9 GB of csv,
corpus_path=210 KB of csv,
output_path="final_result.parquet"
)
if __name__ == "__main__":
main()
I first thought this was caused by the file format parquet. However, when I tried csv, I met with the same problem. I tried result.count() to see how many rows the table result has. It took forever to get the row number, just like writing the data to the data lake.
There was a suggestion to use broadcast hash join instead of the default sort-merge join if a large dataset is joined with a small dataset. I thought it was worth trying because the smaller samples in the pilot study told me the row number of frequency is roughly 0.09% of that of lift (See the query below if you have difficulty tracking frequency and lift).
SELECT lift.antecedent, freq AS frequency, lift
FROM lift
INNER JOIN frequency
ON lift.antecedent = frequency.antecedent
With that in mind, I revised my code:
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StringType, IntegerType, BooleanType
from pyspark.sql.functions import udf, regexp_extract, collect_set, array_remove, col, size, asc, desc
from pyspark.ml.fpm import FPGrowth
import os
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.5"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3.5"
def work(order_path, beer_path, corpus_path, output_path, FREQ_THRESHOLD=1000, LIFT_THRESHOLD=1):
print("Creating Spark Environment...")
spark = SparkSession.builder.appName("Menu").getOrCreate()
print("Spark Environment Created!")
print("Working on Checkpoint1...")
orders = spark.read.csv(order_path)
orders.createOrReplaceTempView("orders")
orders = spark.sql(
"SELECT _c14 AS order_id, _c31 AS in_menu_id, _c32 AS in_menu_name FROM orders"
)
orders.createOrReplaceTempView("orders")
beer = spark.read.csv(
beer_path,
header=True
)
beer.createOrReplaceTempView("beer")
beer = spark.sql(
"""
SELECT
order_id AS beer_order_id,
in_menu_id AS beer_in_menu_id,
'-999' AS beer_in_menu_name
FROM beer
"""
)
beer.createOrReplaceTempView("beer")
orders = spark.sql(
"""
WITH orders_beer AS (
SELECT *
FROM orders
LEFT JOIN beer
ON orders.in_menu_id = beer.beer_in_menu_id
)
SELECT
order_id,
in_menu_id,
CASE
WHEN beer_in_menu_name IS NOT NULL THEN beer_in_menu_name
WHEN beer_in_menu_name IS NULL THEN in_menu_name
END AS menu_name
FROM orders_beer
"""
)
print("Checkpoint1 Completed!")
print("Working on Checkpoint2...")
corpus = spark.read.csv(
corpus_path,
header=True
)
keywords = corpus.select("Food_Name").rdd.flatMap(lambda x: x).collect()
orders = orders.withColumn(
"keyword",
regexp_extract(
"menu_name",
"(?=^|\s)(" + "|".join(keywords) + ")(?=\s|$)",
0
)
)
orders.createOrReplaceTempView("orders")
orders = spark.sql("""
SELECT order_id, in_menu_id, keyword
FROM orders
WHERE keyword != ''
""")
orders.createOrReplaceTempView("orders")
orders = orders.groupBy("order_id").agg(
collect_set("keyword").alias("items")
)
print("Checkpoint2 Completed!")
print("Working on Checkpoint3...")
fpGrowth = FPGrowth(
itemsCol="items",
minSupport=0,
minConfidence=0
)
model = fpGrowth.fit(orders)
print("Checkpoint3 Completed!")
print("Working on Checkpoint4...")
frequency = model.freqItemsets
frequency = frequency.filter(col("freq") > FREQ_THRESHOLD)
frequency = frequency.withColumn(
"antecedent",
array_remove("items", "-999")
)
frequency = frequency.drop("items")
frequency = frequency.filter(size(col("antecedent")) > 0)
frequency = frequency.orderBy(asc("antecedent"), desc("freq"))
frequency = frequency.dropDuplicates(["antecedent"])
frequency = frequency.withColumn(
"antecedent",
udf(
lambda x: "|".join(sorted(x)), StringType()
)(frequency.antecedent)
)
lift = model.associationRules
lift = lift.drop("confidence")
lift = lift.filter(col("lift") > LIFT_THRESHOLD)
lift = lift.filter(
udf(
lambda x: x == ["-999"], BooleanType()
)(lift.consequent)
)
lift = lift.drop("consequent")
lift = lift.withColumn(
"antecedent",
udf(
lambda x: "|".join(sorted(x)), StringType()
)(lift.antecedent)
)
result = lift.join(
frequency.hint("broadcast"),
["antecedent"],
"inner"
)
print("Checkpoint4 Completed!")
print("Writing Result to Data Lake...")
result.repartition(1024).write.mode("overwrite").parquet(output_path)
print("All Done!")
def main():
work(
order_path=169.1 GB of txt,
beer_path=4.9 GB of csv,
corpus_path=210 KB of csv,
output_path="final_result.parquet"
)
if __name__ == "__main__":
main()
The code ran perfectly well with the same sample data on my Mac OS and as expected took less time (34 seconds vs. 26 seconds). Then I decided to run the code to HDInsight with full datasets. In the last step, which is writing data to the data lake, the task failed and I was told Job cancelled because SparkContext was shut down. I am rather new to big data and have no idea with this means. Posts on the internet said there could be many reasons behind it. Whatever the method I should use, how to optimize my code so I can get the desired output in the data lake within bearable amount of time?
I would try several things, ordered by the amount of energy they require:
Check if the ADL storage is in the same region as your HDInsight cluster.
Add calls for df = df.cache() after heavy calculations, or even write and then read the dataframes into and from a cache storage in between these calculations.
Replace your UDFs with "native" Spark code, since UDFs are one of the performance bad practices of Spark.
I have figured it out after five days' struggle. Here are the approaches that I took to optimize the code. The time of code execution dropped from more than 24 hours to around 10 minutes. Code optimization is really really important.
As David Taub below pointed out, use df.cache() after heavy computation or before feeding the data to the model. I used df.cache().count() since calling .cache() on its own is lazily evaluated but the following .count() forces an evaluation of the entire dataset.
Use flashtext instead of regular expression to extract keywords. This greatly improves code performance.
Be careful with joins / merge. It might get extremely slow due to data skewness. Always think about ways to avoid unnecessary joins.
Set minSupport for FPGrowth. This significantly reduces the time when calling model.freqItemsets.

Spark reading from Postgres JDBC table slow

I am trying to load about 1M rows from a PostgreSQL database into Spark. When using Spark it takes about 10s. However, loading the same query using psycopg2 driver takes 2s. I am using postgresql jdbc driver version 42.0.0
def _loadFromPostGres(name):
url_connect = "jdbc:postgresql:"+dbname
properties = {"user": "postgres", "password": "postgres"}
df = SparkSession.builder.getOrCreate().read.jdbc(url=url_connect, table=name, properties=properties)
return df
df = _loadFromPostGres("""
(SELECT "seriesId", "companyId", "userId", "score"
FROM user_series_game
WHERE "companyId"=655124304077004298) as
user_series_game""")
print measure(lambda : len(df.collect()))
The output is -
--- 10.7214591503 seconds ---
1076131
Using psycopg2 -
import psycopg2
conn = psycopg2.connect(conn_string)
cur = conn.cursor()
def _exec():
cur.execute("""(SELECT "seriesId", "companyId", "userId", "score"
FROM user_series_game
WHERE "companyId"=655124304077004298)""")
return cur.fetchall()
print measure(lambda : len(_exec()))
cur.close()
conn.close()
The output is -
--- 2.27961301804 seconds ---
1076131
The measure function -
def measure(func) :
start_time = time.time()
x = func()
print("--- %s seconds ---" % (time.time() - start_time))
return x
Kindly help me find the cause of this problem.
Edit 1
I did a few more benchmarks. Using Scala and JDBC -
import java.sql._;
import scala.collection.mutable.ArrayBuffer;
def exec() {
val url = ("jdbc:postgresql://prod.caumccqvmegm.ap-southeast-1.rds.amazonaws.com/prod"+
"?tcpKeepAlive=true&prepareThreshold=-1&binaryTransfer=true&defaultRowFetchSize=10000")
val conn = DriverManager.getConnection(url,"postgres","postgres");
val sqlText = """SELECT "seriesId", "companyId", "userId", "score"
FROM user_series_game
WHERE "companyId"=655124304077004298"""
val t0 = System.nanoTime()
val stmt = conn.prepareStatement(sqlText, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)
val rs = stmt.executeQuery()
val list = new ArrayBuffer[(Long, Long, Long, Double)]()
while (rs.next()) {
val seriesId = rs.getLong("seriesId")
val companyId = rs.getLong("companyId")
val userId = rs.getLong("userId")
val score = rs.getDouble("score")
list.append((seriesId, companyId, userId, score))
}
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) * 1e-9 + "s")
println(list.size)
rs.close()
stmt.close()
conn.close()
}
exec()
The output was -
Elapsed time: 1.922102285s
1143402
When I did collect() in Spark + Scala -
import org.apache.spark.sql.SparkSession
def exec2() {
val spark = SparkSession.builder().getOrCreate()
val url = ("jdbc:postgresql://prod.caumccqvmegm.ap-southeast-1.rds.amazonaws.com/prod"+
"?tcpKeepAlive=true&prepareThreshold=-1&binaryTransfer=true&defaultRowFetchSize=10000")
val sqlText = """(SELECT "seriesId", "companyId", "userId", "score"
FROM user_series_game
WHERE "companyId"=655124304077004298) as user_series_game"""
val t0 = System.nanoTime()
val df = spark.read
.format("jdbc")
.option("url", url)
.option("dbtable", sqlText)
.option("user", "postgres")
.option("password", "postgres")
.load()
val list = df.collect()
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) * 1e-9 + "s")
print (list.size)
}
exec2()
The output was
Elapsed time: 1.486141076s
1143445
So 4x amount of extra time is spent within Python serialisation. I understand there will be some penalty, but this seems too much.
The reason is really simple and have two simultaneous reasons.
First I will give you a perpective of how psycopg2 works.
This lib psycopg2 works like any other lib to connect to a RDMS. This lib will send the query to the engine of your postgres and it will return the data to you. Straight foward like this.
Conn -> Query -> ReturnData -> FetchData
When you use spark is a little bit different in two ways. Spark is not like a programatic language that run in one single thread. It has a Distributed System to work. Even if you are running in a local machine. See Spark has a basic concept of Driver(Master) and Workers.
The Driver recieve the request to execute the query to the Postgres, the Driver will not request the data for each worker request the information from your Postgres.
If you see the documentation here you will se a note like this:
Don’t create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
This note means that each worker will be responsible to request the data for your postgres. This is a small overhead of starting this process but nothing really big. But have a overhead here, to send the data to each worker.
Seccond point, your collect in this part of code:
print measure(lambda : len(df.collect()))
The collect function will send a command for all of your workers to send the data to your Driver. To store in the memory of your driver, it is like a Reduce, it creates Shuffle in the middle of the process. Shuffle is the step of the process that the data is send to other workers. In the case of collect each worker will send that to your Driver.
So the steps of Spark in JDBC of your code is:
(Workers)Conn -> (Workers)Query -> (Workers)FetchData -> (Driver)
Request the Data -> (Workers) Shuffle -> (Driver) Collect
Well there in a bunch of other stuffs that happens with the Spark, like the QueryPlan, build the DataFrame and other stuffs.
That is the reason that you have faster response in your simple code of Python than Spark.