Understanding Spark partitioning - scala

I'm trying to understand how Spark partitions data. Suppose I have an execution DAG like that in the picture (orange boxes are the stages). The two groupBy and the join operations are supposed to be very heavy if the RDD's are not partitioned.
Is it wise then to use .partitonBy(new HashPartitioner(properValue)) to P1, P2, P3 and P4 to avoid shuffle? What's the cost of partitioning an existing RDD? When isn't proper to partition an existing RDD? Doesn't Spark partition my data automatically if I don't specify a partitioner?
Thank you

tl;dr The answers to your questions respectively: Better to partition at the outset if you can; Probably less than not partitioning; Your RDD is partitioned one way or another anyway; Yes.
This is a pretty broad question. It takes up a good portion of our course! But let's try to address as much about partitioning as possible without writing a novel.
As you know, the primary reason to use a tool like Spark is because you have too much data to analyze on one machine without having the fan sound like a jet engine. The data get distributed among all the cores on all the machines in your cluster, so yes, there is a default partitioning--according to the data. Remember that the data are distributed already at rest (in HDFS, HBase, etc.), so Spark just partitions according to the same strategy by default to keep the data on the machines where they already are--with the default number of partitions equal to the number of cores on the cluster. You can override this default number by configuring spark.default.parallelism, and you want this number to be 2-3 per core per machine.
However, typically you want data that belong together (for example, data with the same key, where HashPartitioner would apply) to be in the same partition, regardless of where they are to start, for the sake of your analytics and to minimize shuffle later. Spark also offers a RangePartitioner, or you can roll your own for your needs fairly easily. But you are right that there is an upfront shuffle cost to go from default partitioning to custom partitioning; it's almost always worth it.
It is generally wise to partition at the outset (rather than delay the inevitable with partitionBy) and then repartition if needed later. Later on you may choose to coalesce even, which causes an intermediate shuffle, to reduce the number of partitions and potentially leave some machines and cores idle because the gain in network IO (after that upfront cost) is greater than the loss of CPU power.
(The only situation I can think of where you don't partition at the outset--because you can't--is when your data source is a compressed file.)
Note also that you can preserve partitions during a map transformation with mapPartitions and mapPartitionsWithIndex.
Finally, keep in mind that as you experiment with your analytics while you work your way up to scale, there are diagnostic capabilities you can use:
toDebugString to see the lineage of RDDs
getNumPartitions to, shockingly, get the number of partitions
glom to see clearly how your data are partitioned
And if you pardon the shameless plug, these are the kinds of things we discuss in Analytics with Apache Spark. We hope to have an online version soon.

By applying partitionBy preemptively you don't avoid the shuffle. You just push it in another place. This can be a good idea if partitioned RDD is reused multiple times, but you gain nothing for a one-off join.
Doesn't Spark partition my data automatically if I don't specify a partitioner?
It will partition (a.k.a. shuffle) your data a part of the join) and subsequent groupBy (unless you keep the same key and use transformation which preserves partitioning).

Related

Spark performance: local faster than cluster (very uneven executor load)

let me start off by saying that I'm relatively new to spark so if I'm saying something that doesn't make sense just please correct me.
Summarising the problem, no mather what I do, at certain stages one executor does all the computation, which makes cluster execution slower than local, one-processor execution.
Full story:
I've written a spark 1.6 application which consists of series of maps, filters, joins and a short graphx part. The app uses only one data source - csv file. For the purpose of development I created a mockup dataset consisting of 100 000 rows, 7MB, with all of the fields having random data with uniform distribution (random sorting in file as well). The joins are self inner joins on PairRDD on various fields (the dataset has duplicate keys with ~200 duplicates per key immitating real data), leading to cartesian product within key. Then I perform a number of map and filter operations on the result of the joins, store it as RDD of some custom-class objects and save everything as a graph at the and.
I developed the code on my laptop and run it, which took about 5 minutes (windows machine, local file). To my surprise, when I deployed the jar onto the cluster (master yarn, cluster mode, file in csv in HDFS) and submitted it the code has taken 8 minutes to execute.
I've run same experiment with smaller data and the results were 40 seconds locally and 1.1 min on the cluster.
When I looked at history server I've seen that 2 stages are particularly long (almost 4 mins each), and on these stages there is one task that takes >90% of the time. I run the code multiple times and it was always the same task that took so much time, even though it was deployed on different data node each time.
To my surprise, when I opened the executors I saw that one executor does almost all of the job (in terms of time spent) and executes most jobs. In the screenshot provided second most 'active' executor had 50 tasks, but that's not always the case - in different submission second most busy executor had 15 tasks, and the leading one 95).
Moreover, I saw that the time of 3.9 mins is used for computation (second screenshot), which is most heavy on the joined data shortly after map. I thought, that the data may not be partitioned equally and one executor has to perform all the computation. Therefore, I tried to patrition the pairRdd manually (using .partitionBy(new HashPartitioner(40))) right before join (similar execution time) or right after join (execution even slower).
What could be the issue? Any help will be appreciated.
It's hard to tell without seeing your queries and understanding your Dataset, I'm guessing you didn't include it either because it's very complex or sensitive? So this is a little bit of a shot in the dark, however this looks a lot like a problem we dealt with on my team at work. My rough guess at what is happening is that during one of your joins, you have a key space that has a high cardinality, but very uneven distribution. In our case, we were joining on sources of web traffic, which while we have thousands of possible sources of traffic, the overwhelming majority of the traffic comes from just a few. This caused a problem when we joined. The keys would be distributed evenly among the executors, however since maybe 95% of the data shared maybe 3 or 4 keys, a very small number of executors were doing most of the work. When you find a join that suffers from this, the thing to do is to pick the smaller of the two datasets and explicitly perform a broadcast join. (Spark normally will try to do this, but it's not always perfect at being able to tell when it should.)
To do this, let's say you have two DataFrames. One of them has two columns, number and stringRep where number is just one row for all integers from 0-10000 and stringRep is just a string representation of that, so "one", "two", "three", etc. We'll call this numToString
The other DataFrame has some key column to join against number in numToString called kind, some other irrelevant data, and 100,000,000 rows. We'll call this DataFrame ourData. Then let's say that the distribution of the 100,000,000 rows in ourData is 90% have kind == 1, 5% have kind == 2, and the remaining 5% distributed pretty evenly amongst the remaining 99,998 numbers. When you perform the following code:
val numToString: DataFrame = loadNumToString()
val ourData: DataFrame = loadOurCode()
val joined = ourData.join(numToString).where(ourData("kind") === numToString("number"))
...it is very likely that Spark will send %90 of the data (that which has kind == 1) to one executor, %5 of the data (that which has kind == 2) to another executor, and the remaining %5 smeared across the rest, leaving two executors with huge partitions and the rest with very tiny ones.
The way around this as I mentioned before is to explicitly perform a broadcast join. What this does is take one DataFrame and distribute it entirely to each node. So you would do this instead:
val joined = ourData.join(broadcast(numToString)).where(ourData("kind") === numToString("number"))
...which would send numToString to each executor. Assuming that ourData was evenly partitioned beforehand, the data should remain evenly partitioned across executors. This might not be your problem, but it does sound a lot like a problem we were having. Hope it helps!
More information on broadcast joins can be found here:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins-broadcast.html

Handling Skew data in apache spark production scenario

Can anyone explain how the skew data is handled in production for Apache spark?
Scenario:
We submitted the spark job using "spark-submit" and in spark-ui it is observed that few tasks are taking long time which indicates presence of skew.
Questions:
(1) What steps shall we take(re-partitioning,coalesce,etc.)?
(2) Do we need to kill the job and then include the skew solutions in the jar and
re-submit the job?
(3) Can we solve this issue by running the commands like (coalesce) directly from
shell without killing the job?
Data skews a primarily a problem when applying non-reducing by-key (shuffling) operations. The two most common examples are:
Non-reducing groupByKey (RDD.groupByKey, Dataset.groupBy(Key).mapGroups, Dataset.groupBy.agg(collect_list)).
RDD and Dataset joins.
Rarely, the problem is related to the properties of the partitioning key and partitioning function, with no per-existent issue with data distribution.
// All keys are unique - no obvious data skew
val rdd = sc.parallelize(Seq(0, 3, 6, 9, 12)).map((_, None))
// Drastic data skew
rdd.partitionBy(new org.apache.spark.HashPartitioner(3)).glom.map(_.size).collect
// Array[Int] = Array(5, 0, 0)
What steps shall we take(re-partitioning,coalesce,etc.)?
Repartitioning (never coalesce) can help you with the the latter case by
Changing partitioner.
Adjusting number of partitions to minimize possible impact of data (here you can use the same rules as for associative arrays - prime number and powers of two should be preferred, although might not resolve the problem fully, like 3 in the example used above).
The former cases typically won't benefit from repartitioning much, because skew is naturally induced by the operation itself. Values with the same key cannot be spread multiple partitions, and non-reducing character of the process, is minimally affected by the initial data distribution.
These cases have to be handled by adjusting the logic of your application. It could mean a number of things in practice, depending on the data or problem:
Removing operation completely.
Replacing exact result with an approximation.
Using different workarounds (typically with joins), for example frequent-infrequent split, iterative broadcast join or prefiltering with probabilistic filter (like Bloom filter).
Do we need to kill the job and then include the skew solutions in the jar and re-submit the job?
Normally you have to at least resubmit the job with adjust parameters.
In some cases (mostly RDD batch jobs) you can design your application, to monitor task execution and kill and resubmit particular job in case of possible skew, but it might hard to implement right in practice.
In general, if data skew is possible, you should design your application to be immune to data skews.
Can we solve this issue by running the commands like (coalesce) directly from shell without killing the job?
I believe this is already answered by the points above, but just to say - there is no such option in Spark. You can of course include these in your application.
We can fine tune the query to reduce the complexity .
We can Try Salting mechanism:
Salt the skewed column with random number creation better distribution of data across each partition.
Spark 3 Enables Adaptive Query Execution mechanism to avoid such scenarios in production.
Below are couple of spark properties which we can fine tune accordingly.
spark.sql.adaptive.enabled=true
spark.databricks.adaptive.autoBroadcastJoinThreshold=true #changes sort merge join to broadcast join dynamically , default size = 30 mb
spark.sql.adaptive.coalescePartitions.enabled=true #dynamically coalesced
spark.sql.adaptive.advisoryPartitionSizeInBytes=64MB default
spark.sql.adaptive.coalescePartitions.minPartitionSize=true
spark.sql.adaptive.coalescePartitions.minPartitionNum=true # Default 2X number of cores
spark.sql.adaptive.skewJoin.enabled=true
spark.sql.adaptive.skewJoin.skewedPartitionFactor=Default is 5
spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=256 MB

Kafka Streams - reducing the memory footprint for large state stores

I have a topology (see below) that reads off a very large topic (over a billion messages per day). The memory usage of this Kafka Streams app is pretty high, and I was looking for some suggestions on how I might reduce the footprint of the state stores (more details below). Note: I am not trying to scape goat the state stores, I just think there may be a way for me to improve my topology - see below.
// stream receives 1 billion+ messages per day
stream
.flatMap((key, msg) -> rekeyMessages(msg))
.groupBy((key, value) -> key)
.reduce(new MyReducer(), MY_REDUCED_STORE)
.toStream()
.to(OUTPUT_TOPIC);
// stream the compacted topic as a KTable
KTable<String, String> rekeyedTable = builder.table(OUTPUT_TOPIC, REKEYED_STORE);
// aggregation 1
rekeyedTable.groupBy(...).aggregate(...)
// aggreation 2
rekeyedTable.groupBy(...).aggregate(...)
// etc
More specifically, I'm wondering if streaming the OUTPUT_TOPIC as a KTable is causing the state store (REKEYED_STORE) to be larger than it needs to be locally. For changelog topics with a large number of unique keys, would it be better to stream these as a KStream and do windowed aggregations? Or would that not reduce the footprint like I think it would (e.g. that only a subset of the records - those in the window, would exist in the local state store).
Anyways, I can always spin up more instances of this app, but I'd like to make each instance as efficient as possible. Here's my question:
Are there any config options, general strategies, etc that should be considered for Kafka Streams app with this level of throughput?
Are there any guidelines for how memory intensive a single instance should have? Even if you have a somewhat arbitrary guideline, it may be helpful to share with others. One of my instances is currently utilizing 15GB of memory - I have no idea if that's good/bad/doesn't matter.
Any help would be greatly appreciated!
With your current pattern
stream.....reduce().toStream().to(OUTPUT_TOPIC);
builder.table(OUTPUT_TOPIC, REKEYED_STORE)
you get two stores with the same content. One for the reduce() operator and one for reading the table() -- this can be reduced to one store though:
KTable rekeyedTable = stream.....reduce(.);
rekeyedTable.toStream().to(OUTPUT_TOPIC); // in case you need this output topic; otherwise you can also omit it completely
This should reduce your memory usage notably.
About windowing vs non-windowing:
it's a matter of your required semantics; so simple switching from a non-windowed to a windowed reduce seems to be questionable.
Even if you can also go with windowed semantics, you would not necessarily reduce memory. Note, in aggregation case, Streams does not store the raw records but only the current aggregate result (ie, key + currentAgg). Thus, for a single key, the storage requirement is the same for both cases (a single window has the same storage requirement). At the same time, if you go with windows, you might actually need more memory as you get an aggregate pro key pro window (while you get just a single aggregate pro key in the non-window case). The only scenario you might save memory, is the case for which you 'key space' is spread out over a long period of time. For example, you might not get any input records for some keys for a long time. In the non-windowed case, the aggregate(s) of those records will be stores all the time, while for the windowed case the key/agg record will be dropped and new entried will be re-created if records with this key occure later on again (but keep in mind, that you lost the previous aggergate in this case -- cf. (1))
Last but not least, you might want to have a look into the guidelines for sizing an application: http://docs.confluent.io/current/streams/sizing.html

Spark dataframe saveAsTable is using a single task

We have a pipeline for which the initial stages are properly scalable - using several dozen workers apiece.
One of the last stages is
dataFrame.write.format(outFormat).mode(saveMode).
partitionBy(partColVals.map(_._1): _*).saveAsTable(tname)
For this stage we end up with a single worker. This clearly does not work for us - in fact the worker runs out of disk space - on top of being very slow.
Why would that command end up running on a single worker/single task only?
Update The output format was parquet. The number of partition columns did not affect the result (tried one column as well as several columns).
Another update None of the following conditions (as posited by an answer below) held:
coalesce or partitionBy statements
window / analytic functions
Dataset.limit
sql.shuffle.partitions
The problem is unlikely to be related in any way to saveAsTable.
A single task in a stage indicates that the input data (Dataset or RDD) has only a one partition. This is contrast to cases where there are multiple tasks but one or more have significantly higher execution time, which normally correspond to partitions containing positively skewed keys. Also you should confound a single task scenario with low CPU utilization. The former is usually a result of insufficient IO throughput (high CPU wait times are the most obvious indication of that), but in rare cases can be traced to usage of shared objects with low level synchronization primitives.
Since standard data sources don't shuffle data on write (including cases where partitionBy and bucketBy options are used) it is safe to assume that data has been repartitioned somewhere in the upstream code. Usually it means that one of the following happened:
Data has been explicitly moved to a single partition using coalesce(1) or repartition(1).
Data has been implicitly moved to a single partition for example with:
Dataset.limit
Window function applications with window definition lacking PARTITION BY clause.
df.withColumn(
"row_number",
row_number().over(Window.orderBy("some_column"))
)
sql.shuffle.partitions option is set to 1 and upstream code includes non-local operation on a Dataset.
Dataset is a result of applying a global aggregate function (without GROUP BY caluse). This usually not an issue, unless function is non-reducing (collect_list or comparable).
While there is no evidence that it is the problem here, in general case you should also possibility, data contains only a single partition all the way to the source. This usually when input is fetched using JDBC source, but the 3rd party formats can exhibit the same behavior.
To identify the source of the problem you should either check the execution plan for the input Dataset (explain(true)) or check SQL tab of the Spark Web UI.

What is RDD in spark

Definition says:
RDD is immutable distributed collection of objects
I don't quite understand what does it mean. Is it like data (partitioned objects) stored on hard disk If so then how come RDD's can have user-defined classes (Such as java, scala or python)
From this link: https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch03.html It mentions:
Users create RDDs in two ways: by loading an external dataset, or by
distributing a collection of objects (e.g., a list or set) in their
driver program
I am really confused understanding RDD in general and in relation to spark and hadoop.
Can some one please help.
An RDD is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. An RDD could come from any datasource, e.g. text files, a database via JDBC, etc.
The formal definition is:
RDDs are fault-tolerant, parallel data structures that let users
explicitly persist intermediate results in memory, control their
partitioning to optimize data placement, and manipulate them using a
rich set of operators.
If you want the full details on what an RDD is, read one of the core Spark academic papers, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
RDD is a logical reference of a dataset which is partitioned across many server machines in the cluster. RDDs are Immutable and are self recovered in case of failure.
dataset could be the data loaded externally by the user. It could be a json file, csv file or a text file with no specific data structure.
UPDATE: Here is the paper what describe RDD internals:
Hope this helps.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDDs.
RDDs have the following properties –
Immutability and partitioning:
RDDs composed of collection of records which are partitioned. Partition is basic unit of parallelism in a RDD, and each partition is one logical division of data which is immutable and created through some transformations on existing partitions.Immutability helps to achieve consistency in computations.
Users can define their own criteria for partitioning based on keys on which they want to join multiple datasets if needed.
Coarse grained operations:
Coarse grained operations are operations which are applied to all elements in datasets. For example – a map, or filter or groupBy operation which will be performed on all elements in a partition of RDD.
Fault Tolerance:
Since RDDs are created over a set of transformations , it logs those transformations, rather than actual data.Graph of these transformations to produce one RDD is called as Lineage Graph.
For example –
firstRDD=sc.textFile("hdfs://...")
secondRDD=firstRDD.filter(someFunction);
thirdRDD = secondRDD.map(someFunction);
result = thirdRDD.count()
In case of we lose some partition of RDD , we can replay the transformation on that partition in lineage to achieve the same computation, rather than doing data replication across multiple nodes.This characteristic is biggest benefit of RDD , because it saves a lot of efforts in data management and replication and thus achieves faster computations.
Lazy evaluations:
Spark computes RDDs lazily the first time they are used in an action, so that it can pipeline transformations. So , in above example RDD will be evaluated only when count() action is invoked.
Persistence:
Users can indicate which RDDs they will reuse and choose a storage strategy for them (e.g., in-memory storage or on Disk etc.)
These properties of RDDs make them useful for fast computations.
Resilient Distributed Dataset (RDD) is the way Spark represents data. The data can come from various sources :
Text File
CSV File
JSON File
Database (via JBDC driver)
RDD in relation to Spark
Spark is simply an implementation of RDD.
RDD in relation to Hadoop
The power of Hadoop reside in the fact that it let users write parallel computations without having to worry about work distribution and fault tolerance. However, Hadoop is inefficient for the applications that reuse intermediate results. For example, iterative machine learning algorithms, such as PageRank, K-means clustering and logistic regression, reuse intermediate results.
RDD allows to store intermediate results inside the RAM. Hadoop would have to write it to an external stable storage system, which generate disk I/O and serialization. With RDD, Spark is up to 20X faster than Hadoop for iterative applications.
Futher implementations details about Spark
Coarse-Grained transformations
The transformations applied to an RDD are Coarse-Grained. This means that the operations on a RDD are applied to the whole dataset, not on its individual elements. Therefore, operations like map, filter, group, reduce are allowed, but operations like set(i) and get(i) are not.
The inverse of coarse-grained is fine-grained. A fine-grained storage system would be a database.
Fault Tolerant
RDD are fault tolerant, which is a property that enable the system to continue working properly in the event of the failure of one of its components.
The fault tolerance of Spark is strongly linked to its coarse-grained nature. The only-way to implement fault tolerance in a fine-grained storage system is to replicate its data or log updates across machines. However, in a coarse-grained system like Spark, only the transformations are logged. If a partition of an RDD is lost, the RDD has enough information the recompute it quickly.
Data storage
The RDD is "distributed" (separated) in partitions. Each partitions can be present in the memory or on the disk of a machine. When Spark wants to launch a task on a partition, he sends it to the machine containing the partition. This is know as "locally aware scheduling".
Sources :
Great research papers about Spark :
http://spark.apache.org/research.html
Include the paper suggested by Ewan Leith.
RDD = Resilient Distributed Dataset
Resilient (Dictionary meaning) = (of a substance or object) able to recoil or spring back into shape after bending, stretching, or being compressed
RDD is defined as (from LearningSpark - OREILLY): The ability to always recompute an RDD is actually why RDDs are called “resilient.” When a machine holding RDD data fails, Spark uses this ability to recompute the missing partitions, transparent to the user.
This means 'data' is surely available at all times. Also, Spark can run without Hadoop and hence data is NOT replicated. One of the best characterstics of Hadoop2.0 is 'High Availbility' with the help of Passive Standby Namenode. The same is achieved by RDD in Spark.
A given RDD (Data) can span across various nodes in Spark cluster (like in Hadoop based cluster).
If any node crashes, Spark can re-compute the RDD and loads the data in some other node, and data is always available.
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel (http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds)
To compare RDD with scala collection, below are few differences
Same but runs on a cluster
Lazy in nature where scala collections are strict
RDD is always Immutable i.e., you can not change the state of the data in the collection
RDD are self recovered i.e., fault-tolerant
RDD (Resilient Distributed Datasets) are an abstraction for representing data. Formally they are a read-only, partitioned collection of records that provides a convenient API.
RDD provide a performant solution for processing large datasets on cluster computing frameworks such as MapReduce by addressing some key issues:
data is kept in memory to reduce disk I/O; this is particularly relevant for iterative computations -- not having to persist intermediate data to disk
fault-tolerance (resilience) is obtained not by replicating data but by keeping track of all transformations applied to the initial dataset (the lineage). This way, in case of failure lost data can always be recomputed from its lineage and avoiding data replication again reduces storage overhead
lazy evaluation, i.e. computations are carried out first when they're needed
RDD's have two main limitations:
they're immutable (read-only)
they only allow coarse-grained transformations (i.e. operations that apply to the entire dataset)
One nice conceptual advantage of RDD's is that they pack together data and code making it easier to reuse data pipelines.
Sources: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, An Architecture for Fast and General Data Processing on Large Clusters
RDD is a way of representing data in spark.The source of data can be JSON,CSV textfile or some other source.
RDD is fault tolerant which means that it stores data on multiple locations(i.e the data is stored in distributed form ) so if a node fails the data can be recovered.
In RDD data is available at all times.
However RDD are slow and hard to code hence outdated.
It has been replaced by concept of DataFrame and Dataset.
RDD
is an Resilient Distributed Data Set.
It is an core part of spark.
It is an Low Level API of spark.
DataFrame and DataSets are built on top of RDD.
RDD are nothing but row level data i.e. sits on n number of executors.
RDD's are immutable .means you cannot change the RDD. But you can create new RDD using Transformation and Actions
Resilient Distributed Datasets (RDDs)
Resilient: If an operation is lost while performing on a node in spark, the dataset can be reconstituted from history.
Distributed: Data in RDDs is divided into one or many partitions and distributed as in-memory collections of objects across worker nodes in the cluster.
Dataset: RDDs are datasets that consist of records, records are uniquely identifiable data collections within a dataset.