In Apache Spark, each physical operator in the physical plan has 4 properties:
outputPartitioning
outputOrdering​
requiredChildDistribution
requiredChildOrdering
But aren't outputPartioning and requiredChildDistribution the same? How are they different and what do they fundamentally represent? Same for outputOrdering and requiredChildOrdering?
Related
How PyFlink performance is compared to Flink + Scala?
Big Picture.
The goal is to build Lambda architecture with Cold and Hot Tier.
Cold (Batch) Tier will be implemented with Apache Spark (PySpark).
But with Hot (Streaming) Tier there are different options: Spark Streaming or Flink.
Thus Apache Flink is pure streaming rather then Spark's micro-batches, I tend to choose Apache Flink.
But my only point of concern is performance of PyFlink. Will it have less latency that PySpark streaming? Is it slower then Scala written Flink code? In what cases it's slower?
Thank you in advance!
I had implemented something very similar , and from my experience these are a few things
Performance of the job is completely dependent on the type of code you are writing , if you are using some custom UDFs written in python to run while you extract then the performance is going to be slower than doing the same thing using Scala based code - this happens majorly because of the conversion of python objects to JVM and vice versa . But this will happen while you are using Pyspark .
Flink is true streaming process, the micro batches in spark are not so if your use case does need a true streaming service go ahead with Flink.
If you stick your service to the native functions given in PyFlink you will not observe any noticeable difference in performance .
I am newbie to apache-spark
I have a hard time understanding Data Locality in Apache Spark. I have tried to read this article https://data-flair.training/blogs/apache-spark-performance-tuning/
which says "PROCESS_LOCAL" and "NODE_LOCAL". Are there settings I need to configure?
Can someone take an example and explain it to me?
Thanks,
Padd
Data locality in simple terms means doing computation on the node where data resides.
Just to elaborate:
Spark is cluster computing system. It is not a storage system like HDFS or NOSQL. Spark is used to process the data stored in such distributed system.
Typical installation is like spark is installed on same nodes as that of HDFS/NOSQL.
In case there is a spark application which is processing data stored HDFS. Spark tries to place computation tasks alongside HDFS blocks.
With HDFS the Spark driver contacts NameNode about the DataNodes (ideally local) containing the various blocks of a file or directory as well as their locations (represented as InputSplits), and then schedules the work to the SparkWorkers.
Note : Spark’s compute nodes / workers should be running on storage nodes.
This is how data locality achieved in Spark.
The advantage is performance gain as less data is transferred over the network.
Try the below articles for reading - There are n number of documents availavle in google to search for
http://www.russellspitzer.com/2017/09/01/Spark-Locality/
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html
I have an application that is using Spark (with Spark Job Server) that uses a Cassandra store. My current setup is that of a client mode running with master=local[*]. So there is a single Spark executor which is also the driver process that is using all 8 cores of the machine. I have a Cassandra instance running on the same machine.
The Cassandra tables have a primary key of the form ((datasource_id, date), clustering_col_1...clustering_col_n) where date is a single day of the form "2019-02-07" and is part of a composite partition key.
In my Spark application, I am running a query like so:
df.filter(col("date").isin(days: _*))
In the Spark physical plan, I notice that these filters along with the filter for the "datasource_id" partition key are pushed up to the Cassandra CQL query.
For our biggest datasources, I know that the partitions are around 30MB in size. So I have the following setting in the Spark Job Server configuration:
spark.cassandra.input.split.size_in_mb = 1
However I notice that there is no parallelization in the Cassandra loading step. Though there are multiple Cassandra partitions that are >1MB, there are no additional spark partitions created. There is only a single task that does all the querying on a single core, thus taking ~20 secs to load data for a 1 month date range that corresponds to ~1 million rows.
I have tried the alternative approach below:
df union days.foldLeft(df)((df: DataFrame, day: String) => {
df.filter(col("date").equalTo(day))
})
This does indeed create a spark partition (or task) for every "day" partition in cassandra. However, for smaller datasources where the cassandra partitions are much smaller in size, this method proves to be quite expensive in terms of excessive tasks created and the overhead due to their coordination. For these datasources, it would be totally fine to lump many cassandra partitions into one spark partition. Hence why I thought using the spark.cassandra.input.split.size_in_mb configuration would prove useful in dealing with both small and large datasources.
Is my understanding wrong? Is there something else that I'm missing in order for this configuration to take effect?
P.S. I have also read the answers about using joinWithCassandraTable. However, our code relies on using DataFrame. Also, converting from a CassandraRDD to a DataFrame is not very viable for us since our schema is dynamic and cannot be specified using case classes.
What's the concept of "Paralleled collections" in Spark is, and how this concept can improve the overall performance of a job? Besides, how should partitions be configured for that?
Parallel collections are provided in the Scala language as a simple way to parallelize data processing in Scala. The basic idea is that when you perform operations like map, filter, etc... to a collection it is possible to parallelize it using a thread pool. This type of parallelization is called data parallelization because it is based on the data itself. This is happening locally in the JVM and Scala will use as many threads as cores are available to the JVM.
On the other hand Spark is based on RDD, that are an abstraction that represents a distributed dataset. Unlike the Scala parallel collections this datasets are distributed in several nodes. Spark is also based on data parallelism, but this time is distributed data parallelism. This allows you to parallelize much more than in a single JVM, but it also introduces other issues related with data shuffling.
In summary, Spark implements a distributed data parallelism system, so everytime you execute a map, filter, etc... you are doing something similar to what a Scala parallel collection would do but in a distributed fashion. Also the unit of parallelism in Spark are partitions, while in Scala collections is each row.
You could always use Scala parallel collections inside a Spark task to parallelize within a Spark task, but you won't necessarily see performance improvement, specially if your data was already evenly distributed in your RDD and each task needs about the same computational resources to be executed.
I have setup Spark 2.0 and Cassandra 3.0 on a local machine (8 cores, 16gb ram) for testing purposes and edited spark-defaults.conf as follows:
spark.python.worker.memory 1g
spark.executor.cores 4
spark.executor.instances 4
spark.sql.shuffle.partitions 4
Next I imported 1.5 million rows in Cassandra:
test(
tid int,
cid int,
pid int,
ev list<double>,
primary key (tid)
)
test.ev is a list containing numeric values i.e. [2240,2081,159,304,1189,1125,1779,693,2187,1738,546,496,382,1761,680]
Now in the code, to test the whole thing I just created a SparkSession, connected to Cassandra and make a simple select count:
cassandra = spark.read.format("org.apache.spark.sql.cassandra")
df = cassandra.load(keyspace="testks",table="test")
df.select().count()
At this point, Spark outputs the count and takes about 28 seconds to finish the Job, distributed in 13 Tasks (in Spark UI, the total Input for the Tasks is 331.6MB)
Questions:
Is that the expected performance? If not, what am I missing?
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to 4, why is creating 13 Tasks? (Also made sure the number of partitions calling rdd.getNumPartitions() on my DataFrame)
Update
A common operation I would like to test over this data:
Query a large data set, say, from 100,000 ~ N rows grouped by pid
Select ev, a list<double>
Perform an average on each member, assuming by now each list has the same length i.e df.groupBy('pid').agg(avg(df['ev'][1]))
As #zero323 suggested, I deployed a external machine (2Gb RAM, 4 cores, SSD) with Cassandra just for this test, and loaded the same data set. The result of the df.select().count() was an expected greater latency and overall poorer performance in comparison with my previous test (took about 70 seconds to finish the Job).
Edit: I misunderstood his suggestion. #zero323 meant to let Cassandra perform the count instead of using Spark SQL, as explained in here
Also I wanted to point out that I am aware of the inherent anti-pattern of setting a list<double> instead a wide row for this type of data, but my concerns at this moment are more the time spent on retrieval of a large dataset rather than the actual average computation time.
Is that the expected performance? If not, what am I missing?
It looks slowish but it is not exactly unexpected. In general count is expressed as
SELECT 1 FROM table
followed by Spark side summation. So while it is optimized it still rather inefficient because you have fetch N long integers from the external source just to sum these locally.
As explained by the docs Cassandra backed RDD (not Datasets) provide optimized cassandraCount method which performs server side counting.
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to (...), why is creating (...) Tasks?
Because spark.sql.shuffle.partitions is not used here. This property is used to determine number of partitions for shuffles (when data is aggregated by some set of keys) not for Dataset creation or global aggregations like count(*) (which always use 1 partition for final aggregation).
If you interested in controlling number of initial partitions you should take a look at spark.cassandra.input.split.size_in_mb which defines:
Approx amount of data to be fetched into a Spark partition. Minimum number of resulting Spark partitions is 1 + 2 * SparkContext.defaultParallelism
As you can see another factor here is spark.default.parallelism but it is not exactly a subtle configuration so depending on it in general is not an optimal choice.
I see that it is very old question but maybe someone needs it now.
When running Spark on local machine it is very important to set into SparkConf master "local[*]" that according to documentation allows to run Spark with as many worker threads as logical cores on your machine.
It helped me to increase performance of count() operation by 100% on local machine comparing to master "local".