Does Spark do UnionAll in parallel? - scala

I got 10 DataFrames with the same schema which I'd like to combine into one DataFrame. Each DataFrame is constructed using a sqlContext.sql("select ... from ...").cahce, which means that technically, the DataFrames are not really calculated until it's time to use them.
So, if I run:
val df_final = df1.unionAll(df2).unionAll(df3).unionAll(df4) ...
will Spark calculate all these DataFrames in parallel or one by one (due to the dot operator)?
And also, while we're here - is there a more elegant way to preform a unionAll on several DataFrames than the one I listed above?

unionAll is lazy. The example line in your question does not trigger any calculation, synchronous or asynchronous.
In general Spark is a distributed computation system. Each operation itself is made up of a bunch of tasks that are processed in parallel. So in general you don't have to worry about whether two operations can run in parallel or not. The cluster resources will be well utilized anyway.

Related

pyspark udf to process multiple rows at a time

Reading this blog:
Introducing Pandas UDF for PySpark
I acknowledged that using #udf processes one row at a time, but using #pandas_udf processes multiple rows at a time (as pandas) and is much faster.
Why is it necessary to convert the spark dataframe into pandas dataframe in order to achieve this (processing multiple rows at a time)? Can't #udf take just a part of the spark dataframe at a time and avoid this conversion? Is it because spark dataframes are not optimized to process multiple rows at a time like pandas? If so, why?
Thanks~

Parallelised collections in Spark

What's the concept of "Paralleled collections" in Spark is, and how this concept can improve the overall performance of a job? Besides, how should partitions be configured for that?
Parallel collections are provided in the Scala language as a simple way to parallelize data processing in Scala. The basic idea is that when you perform operations like map, filter, etc... to a collection it is possible to parallelize it using a thread pool. This type of parallelization is called data parallelization because it is based on the data itself. This is happening locally in the JVM and Scala will use as many threads as cores are available to the JVM.
On the other hand Spark is based on RDD, that are an abstraction that represents a distributed dataset. Unlike the Scala parallel collections this datasets are distributed in several nodes. Spark is also based on data parallelism, but this time is distributed data parallelism. This allows you to parallelize much more than in a single JVM, but it also introduces other issues related with data shuffling.
In summary, Spark implements a distributed data parallelism system, so everytime you execute a map, filter, etc... you are doing something similar to what a Scala parallel collection would do but in a distributed fashion. Also the unit of parallelism in Spark are partitions, while in Scala collections is each row.
You could always use Scala parallel collections inside a Spark task to parallelize within a Spark task, but you won't necessarily see performance improvement, specially if your data was already evenly distributed in your RDD and each task needs about the same computational resources to be executed.

Best practice for writing to hadoop from spark

I was reviewing some code written by a co-worker, and I found a method like this:
def writeFile(df: DataFrame,
partitionCols: List[String],
writePath: String): Unit {
val df2 = df.repartition(partitionCols.get.map(col): _*)
val dfWriter = df2.write.partitionBy(partitionCols.get.map(col): _*)
dfWriter
.format("parquet")
.mode(SaveMode.Overwrite)
.option("compression", "snappy")
.save(writePath)
}
Is it generally good practice to call repartition on a predefined set of columns like this, and then call partitionBy, and then save to disk?
Generally you call repartition with the same columns as the partitionBy to have a single parquet file in each partition. This is being achieved here. Now you could argue that this could mean the parquet file size becoming large or worse could cause memory overflow.
This problem generally handled by adding a row_number to the Dataframe and then specify the number of documents than each parquet file can have. Something like
val repartitionExpression =colNames.map(col) :+ floor(col(RowNumber) / docsPerPartition)
// now use this to repartition
To answer the next part as persist after partitionBy that is not needed here as after partition it is directly written to the disk.
To help you understand the differences between partitionBy() and repartition(), repartition on the dataframe uses a Hash based partitioner which takes COL as well as NumOfPartitions basing on which generates a hash value and buckets the data.
By default repartition() creates 200 partitions. Because of possibility of collisions there is good chance of partitioning multiple records with different keys into same buckets.
On the other hand the partitionBy() takes COL by which the partitions are purely based on the unique keys. The partitions are proportional to the no: of unique keys in the data.
In repartition case there is a good chance of writing empty files. But, in the case of partitionBy there will not be any empty files.
Is your job CPU-bound, memory-bound, network-IO bound, or disk-IO bound?
First 2 cases are significant if df2 is sufficiently large, and other answers correctly address those cases.
If your job is disk-IO bound (and you see yourself writing large files to HDFS frequently in future), many cloud providers will let you pick a faster SSD disk for an extra charge.
Also Sandy Ryza recommends keeping --executor-cores under 5:
I’ve noticed that the HDFS client has trouble with tons of concurrent threads. A rough guess is that at most five tasks per executor can achieve full write throughput, so it’s good to keep the number of cores per executor below that number.

Spark: spark-csv partitioning and parallelism in subsequent DataFrames

I'm wondering how to enforce usage of subsequent, more appropriately partitioned DataFrames in Spark when importing source data with spark-csv.
Summary:
spark-csv doesn't seem to support explicit partitioning on import like sc.textFile() does.
While it gives me inferred schema "for free", by default I'm getting returned DataFrames with normally only 2 partitions, when I'm using 8 executors in my cluster.
Even though subsequent DataFrames that have many more partitions are being cached via cache() and used for further processing (immediately after import of the source files), Spark job history is still showing incredible skew in the task distribution - 2 executors will have the vast majority of the tasks instead of a more even distribution that I expect.
Can't post data, but the code is just some simple joining, adding a few columns via .withColumn(), and then very basic linear regression via spark.mlib.
Below is a comparison image from the Spark History UI showing tasks per executor (the last row is the driver).
Note: I get the same skewed task distribution regardless of calling repartition() on the spark-csv DataFrames or not.
How do I "force" Spark to basically forget those initial DataFrames and start from more appropriately partitioned DataFrames, or force spark-csv to somehow partition its DataFrames differently (without forking it/modifying its source)?
I can resolve this issue using sc.textFile(file, minPartitions), but I'm hoping I don't have to resort to that because of things like the nicely typed schema that spark-csv provides.

Understanding parallelism in Spark and Scala

I have some confusion about parallelism in Spark and Scala. I am running an experiment in which I have to read many (csv) files from the disk change/ process certain columns and then write it back to the disk.
In my experiments, if I use SparkContext's parallelize method only then it does not seem to have any impact on the performance. However simply using Scala's parallel collections (through par) reduces the time almost to half.
I am running my experiments in localhost mode with the arguments local[2] for the spark context.
My question is when should I use scala's parallel collections and when to use spark context's parallelize?
SparkContext will have additional processing in order to support generality of multiple nodes, this will be constant on the data size so may be negligible for huge data sets. On 1 node this overhead will make it slower than Scala's parallel collections.
Use Spark when
You have more than 1 node
You want your job to be ready to scale to multiple nodes
The Spark overhead on 1 node is negligible because the data is huge, so you might as well choose the richer framework
SparkContext's parallelize may makes your collection suitable for processing on multiple nodes, as well as on multiple local cores of your single worker instance ( local[2] ), but then again, you probably get too much overhead from running Spark's task scheduler an all that magic. Of course, Scala's parallel collections should be faster on single machine.
http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html#parallelized-collections - are your files big enough to be automatically split to multiple slices, did you try setting slices number manually?
Did you try running the same Spark job on single core and then on two cores?
Expect best result from Spark with one really big uniformly structured file, not with multiple smaller files.