pyspark udf to process multiple rows at a time - pyspark

Reading this blog:
Introducing Pandas UDF for PySpark
I acknowledged that using #udf processes one row at a time, but using #pandas_udf processes multiple rows at a time (as pandas) and is much faster.
Why is it necessary to convert the spark dataframe into pandas dataframe in order to achieve this (processing multiple rows at a time)? Can't #udf take just a part of the spark dataframe at a time and avoid this conversion? Is it because spark dataframes are not optimized to process multiple rows at a time like pandas? If so, why?
Thanks~

Related

Spark JDBC Save to HDFS Performance

Below is my problem statement,looking for suggestions
1)I have 4-5 dataframes that arte reading data from a teradata source using spark jdbc read API.
2)These 4-5 dataframes are combined into a final dataframe FinalDF that uses a shuffle partition of 1000
3)My data volume is really high ,currently each of the tasks are processing > 2GB of data
4)Lastly i am writing the FinalDF into an ORC File in HDFS.
5)The queries i am populating into the dataframes using jdbc,i am using predicates in the jdbc api for the date ranges.
My questions are as below :
1)While it writes the DF as ORC,does it internally works like a foreachpartition.Like the Action is called for each partition while it tries to fetch the data from source via JDBC call?
2)How can i improve the process performance wise,currently some of my tasks die out due large data in the RDDs and my stages are going for a memory spill
3)I have limitation opening too many sessions to the teradata source as there is a limit set on the source database,this stops me running multiple execotrs as i could only hold on to the limit of 300 concurrent sessions.

Time consuming write process of Spark Dataset into the Oracle DB using JDBC driver

I am using Apache Spark for dataset loading,processing,and outputting the dataset into the Oracle DB using JDBC driver.
I am using spark jdbc write method for writing the Dataset into Database.
But,meanwhile writing the Dataset into the DB it takes same time for writing 10 rows and 10 Million rows into the different tables of the Database.
I want to know how to performance tune this write method using spark,so that we can make wise use of the apache spark compute engine.Otherwise,there is no benefit in using it for fast computation process;if it takes time to write the dataset into the Database.
The code to write the 10 rows and 10M rows is as follows:
with 10 rows to write
finalpriceItemParamsGroupTable.distinct().write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_PRICEITEM_PARM).save();
with 10M rows to write
finalPritmOutput.distinct().write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_TXN_DTL).save();
Attaching the screenshot of the apache spark Dashb
oard Spark Stages Screenshot
If some can help out would be helpful...
You can bulk insert the records at once rather than inserting 1000 records (default setting) at a time by adding a new option batchSize and increasing its value
finalPritmOutput.distinct().write()
.mode("append")
.format("jdbc").option("url", connection)
.option("dbtable", CI_TXN_DTL)
.option("batchsize", "100000")
.save()
Refer to https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases on how to configure your jdbc for better performance.

How to achieve Task Parallelism in Spark and Scala?

We have a Spark batch job, where we are reading data from HBase table and apply multiple transformations and then populate the data in Cassandra[multiple tables].
We have multiple independent tasks, which is using the same DataFrame [ Hbase table data]. Basically we have several dashboards based on the same Hbase table data.
Currently everything is running sequentially, how to run this parallel?
Is it a good practice to use Scala Future to run the tasks parallel?

Does Spark do UnionAll in parallel?

I got 10 DataFrames with the same schema which I'd like to combine into one DataFrame. Each DataFrame is constructed using a sqlContext.sql("select ... from ...").cahce, which means that technically, the DataFrames are not really calculated until it's time to use them.
So, if I run:
val df_final = df1.unionAll(df2).unionAll(df3).unionAll(df4) ...
will Spark calculate all these DataFrames in parallel or one by one (due to the dot operator)?
And also, while we're here - is there a more elegant way to preform a unionAll on several DataFrames than the one I listed above?
unionAll is lazy. The example line in your question does not trigger any calculation, synchronous or asynchronous.
In general Spark is a distributed computation system. Each operation itself is made up of a bunch of tasks that are processed in parallel. So in general you don't have to worry about whether two operations can run in parallel or not. The cluster resources will be well utilized anyway.

Spark: spark-csv partitioning and parallelism in subsequent DataFrames

I'm wondering how to enforce usage of subsequent, more appropriately partitioned DataFrames in Spark when importing source data with spark-csv.
Summary:
spark-csv doesn't seem to support explicit partitioning on import like sc.textFile() does.
While it gives me inferred schema "for free", by default I'm getting returned DataFrames with normally only 2 partitions, when I'm using 8 executors in my cluster.
Even though subsequent DataFrames that have many more partitions are being cached via cache() and used for further processing (immediately after import of the source files), Spark job history is still showing incredible skew in the task distribution - 2 executors will have the vast majority of the tasks instead of a more even distribution that I expect.
Can't post data, but the code is just some simple joining, adding a few columns via .withColumn(), and then very basic linear regression via spark.mlib.
Below is a comparison image from the Spark History UI showing tasks per executor (the last row is the driver).
Note: I get the same skewed task distribution regardless of calling repartition() on the spark-csv DataFrames or not.
How do I "force" Spark to basically forget those initial DataFrames and start from more appropriately partitioned DataFrames, or force spark-csv to somehow partition its DataFrames differently (without forking it/modifying its source)?
I can resolve this issue using sc.textFile(file, minPartitions), but I'm hoping I don't have to resort to that because of things like the nicely typed schema that spark-csv provides.