I am using Scala on Flink with DataSet API.
I want to re-partition my data across the nodes. Spark has a function that lets the user to re-partition the data with a given numberOfPartitions parameter (link) and I believe Flink does not support such function.
Thus, I wanted to achieve this by implementing a custom partitioning function.
My data is of type DataSet(Double,SparseVector)
An example line from the data:
(1.0 SparseVector((2024,1.0), (2025,1.0), (2030,1.0), (2045,1.0), (2046,1.41), (2063,1.0), (2072,1.0), (3031,1.0), (3032,1.0), (4757,1.0), (4790,1.0), (177196,1.0), (177197,0.301), (177199,1.0), (177202,1.0), (1544177,1.0), (1544178,1.0), (1544179,1.0), (1654031,1.0), (1654190,1.0), (1654191,1.0), (1654192,1.0), (1654193,1.0), (1654194,1.0), (1654212,1.0), (1654237,1.0), (1654238,1.0)))
Since my "Double" is binary (1 or -1), I want to partition my data according to the length of the SparceVector.
My custom partitioner is as follows:
class myPartitioner extends Partitioner[SparseVector]
{
override def partition(key: SparseVector, numPartitions: Int): Int = {
key.size % numPartitions
}
}
I call this custom partitioner as follows:
data.partitionCustom(new myPartitioner(),1)
Can somebody please help me to understand how to specify number of partitions as "numPartitions" argument when calling myPartitioner function in Scala.
Thank you.
In flink you can define setParallelism for a single operator or for all the operators using enviornment.setParallelism. I hope this link will help you.
Spark uses repartition(n: Int) function to redistribute data into n partitions, which will be processed by n tasks. From my perspective, this includes two changes: data redistribution and number of downstream tasks.
Therefore, in Apache Flink, I think that the Partitioner is mapped to data redistribution and the parallelism is mapped to the number of downstream tasks, which means you can use setParallelism to determine the "numPartitions".
I'm assuming you're using the length of the SparseVector just to have something that gives you relatively random values to use for partitioning. If that's true, then you can just do a DataSet.rebalance(). If you follow that by any operator (including a Sink) where you set the parallelism to numPartitions, then you should get nicely repartitioned data.
But your description of ...want to re-partition my data across the nodes makes me think that you're trying to apply Spark's concept of RDDs to Flink, which isn't really valid. E.g. assuming you have numPartition parallel operators processing the (re-partitioned) data in your DataSet, then these operators will be running in slots provided by the available TaskManagers, and these slots might or might not be on different physical servers.
Related
I have two pair RDDs with the structure RDD[String, Int], called rdd1 and rdd2.
Each of these RDDs is groupped by its key, and I want to execute a function over its values (so I will use mapValues method).
Does the method "GroupByKey" creates a new partition for each key or have I to specify this manually using "partitionBy"?
I understand that the partitions of a RDD won't change if I don't perform operations that change the key, so if I perform a mapValues operation on each RDD or if I perform a join operation between the previous two RDDs, the partitions of the resulting RDD won't change. Is it true?
Here we have a code example. Notice that "function" is not defined because it is not important here.
val lvl1rdd=rdd1.groupByKey()
val lvl2rdd=rdd2.groupByKey()
val lvl1_lvl2=lvl1rdd.join(lvl2rdd)
val finalrdd=lvl1_lvl2.mapValues(value => function(value))
If I join the previous RDDs and I execute a function over the values of the resulting RDD (mapValues), all the work is being done in a single worker instead of distributing the different tasks over the different workers nodes of the cluster. I mean, the desired behaviour should be to execute, in parallel, the function passed as a parameter to the mapValues method in so many nodes as the cluster allows us.
1) Avoid groupByKey operations as they act as bottleneck for network I/O and execution performance.
Prefer reduceByKey Operation in this case as the data shuffle is comparatively less than groupByKey and we can witness the difference much better if it is a larger Dataset.
val lvl1rdd = rdd1.reduceByKey(x => function(x))
val lvl1rdd = rdd2.reduceByKey(x => function(x))
//perform the Join Operation on these resultant RDD's
Application of function on RDD's seperately and joining them is far better than joining RDD's and applying a function using groupByKey()
This will also ensure the tasks get distributed among different executors and execute in parallel
Refer this link
2) The underlying partitioning technique is Hash partitioner. If we assume that our data is located in n number of partitions initially then groupByKey Operation will follow Hash mechanism.
partition = key.hashCode() % numPartitions
This will create fixed number of partitions which can be more than intial number when you use the groupByKey Operation.we can also customize the partitions to be made. For example
val result_rdd = rdd1.partitionBy(new HashPartitioner(2))
This will create 2 partitions and in this way we can set the number of partitions.
For deciding the optimal number of partitions refer this answer https://stackoverflow.com/a/40866286/7449292
I have a mix-and-match Scala topology where the main worker is a PAPI processor, and other parts are connected through DSL.
EventsProcessor:
INPUT: eventsTopic
OUTPUT: visitorsTopic (and others)
Data throughout the topics (incl. original eventsTopic) is partitioned through a, let's call it DoubleKey that has two fields.
Visitors are sent to visitorsTopic through a Sink:
.addSink(VISITOR_SINK_NAME, visitorTopicName,
DoubleKey.getSerializer(), Visitor.getSerializer(), visitorSinkPartitioner, EVENT_PROCESSOR_NAME)
In the DSL, I create a KV KTable over this topic:
val visitorTable = builder.table(
visitorTopicName,
Consumed.`with`(DoubleKey.getKafkaSerde(),
Visitor.getKafkaSerde()),
Materialized.as(visitorStoreName))
which I later connect to the EventProcessor:
topology.connectProcessorAndStateStores(EVENT_PROCESSOR_NAME, visitorStoreName)
Everything is co-partitioned (via DoubleKey). visitorSinkPartitioner performs a typical modulo operation:
Math.abs(partitionKey.hashCode % numPartitions)
In the PAPI processor EventsProcessor, I query this table to see if there are existent Visitors already.
However, in my tests (using EmbeddedKafka, but that should not make a difference), if I run them with one partition, all is fine (the EventsProcessor checks KTable on two events on same DoubleKey, and on the second event - with some delay - it can see the existent Visitor on the store), but if I run it with a higher number, the EventProcessor never sees the value in the Store.
However if I check the store via API ( iterating store.all()), the record is there. So I understand it must be going to different partition.
Since the KTable should work on the data on its partition, and everything is sent to the same partition, (using explicit partitioners calling the same code), the KTable should get that data on the same partition.
Are my assumptions correct? What could be happening?
KafkaStreams 1.0.0, Scala 2.12.4.
PS. Of course it would work doing the puts on the PAPI creating the store through PAPI instead of StreamsBuilder.table(), since that would definitely use the same partition where the code runs, but that's out of the question.
Yes, the assumptions were correct.
In case it helps anyone:
I had a problem when passing the Partitioner to the Scala EmbeddedKafka library. In one of the tests suites it was not done right.
Now, following the everhealthy practice of refactoring, I have this method used in all the suites of this topology.
def getEmbeddedKafkaTestConfig(zkPort: Int, kafkaPort: Int) :
EmbeddedKafkaConfig = {
val producerProperties = Map(ProducerConfig.PARTITIONER_CLASS_CONFIG ->
classOf[DoubleKeyPartitioner].getCanonicalName)
EmbeddedKafkaConfig(kafkaPort = kafkaPort, zooKeeperPort = zkPort,
customProducerProperties = producerProperties)
}
I am using structured streaming with Spark 2.1.1. I need to apply some business logic to incoming messages (from Kafka source).
essentially, I need to pick up the message, get some key values, look them up in HBase and perform some more biz logic on the dataset. the end result is a string message that needs to be written out to another Kafka queue.
However, since the abstraction for incoming messages is a dataframe (unbounded table - structured streaming), I have to iterate through the dataset received during a trigger through mapPartitions (partitions due to HBase client not being serializable).
During my process, i need to iterate through each row for executing the business process for the same.
Is there a better approach possible that could help me avoid the dataFrame.mapPartitions call? I feel its sequential and iterative !!
Structured streaming basically forces me to generate an output data frame out of my business process, whereas there is none to start with. What other design pattern can I use to achieve my end goal ?
Would you recommend an alternative approach ?
When you talk about working with Dataframes in Spark, speaking very broadly, you can do one of 3 things
a) Generate a Dataframe
b) Transform a data frame
c) Consume a data frame
In structured streaming, a streaming DataFrame is generated using a DataSource. Normally you create sources using methods exposed sparkSession.readStream method. This method returns a DataStreamReader which has several methods for reading from various kinds of input. All of there return a DataFrame. Internally it creates a DataSource. Spark allows you to implement your own DataSource, but they recommend against it, because as of 2.2, the interface is considered experimental
You transform data frames mostly using map or reduce, or using spark SQL. There are different flavors of map (map, mapPartition, mapParititionWithIndex), etc. All of them basically take a row and return a row. Internally Spark does the work of parallelizing the calls to your map method. It partitions the data, spreads it around on executors on the cluster, and calls your map method in the executor. You don't need to worry about parallelism. It's built under the hood. mapParitions is not "sequential". Yes, rows within a partition are executed sequentially, but multiple partitions are executed in parallel. You can easily control the degree of parallelism by partitioning your dataframe. You have 5 partitions, you will have 5 processes running in parallel. You have 200, you can have 200 of them running in parallel if you have 200 cores
Note that there is nothing stopping you from going out to external systems that manage state inside your transformation. However, your transformations should be idempotent. Given a set of input, they should always generate the same output, and leave the system in the same state over time. This can be difficult if you are talking to external systems inside your transformation. Structured Streaming provides at least once guarantee. The means that the same row might be transformed multiple times. So, if you are doing something like adding money to a bank account, you might find that you have added the same amount of money twice to some of the accounts.
Data is consumed by sinks. Normally, you add a sink by calling the format method on a Dataframe and then calling start. StructuredStreaming has a handful of inbuilt sinks which (except for one) are more or less useless.You can create your custom Sink but again it's not recommended because the interface is experimental. The only useful sink is what you would implement. It is called ForEachSink. Spark will call your for each sink with all the rows in your partition. You can do whatever you want with the rows, which includes writing it to Hbase. Note that because of the at least once nature of Structured Streaming, the same row might be fed to your ForEachSink multiple times. You are expected to implement it in an idempotent manner. Also, if you have multiple sinks, data is written to sinks in parallel. You cannot control in what order the sinks are called. It can happen that one sink is getting data from one micro batch while another sink is still processing data for the previous micro batch. Essentially, the Sinks are eventually consistent, not immediately consistent.
Generally, the cleanest way to build your code is to avoid going to outside systems inside your transformations. Your transformations should purely transform data in data frames. If you want data from HBase, get it into a data frame, join it with your streaming data frame, and then transform it. This is because when you go to outside systems, it becomes difficult to scale. You want to scale up your transformations by increasing partitioning on your data frames and adding nodes. However, too many nodes talking to external systems can increase the load on the external systems and cause bottlenecks, Separating transformation from data retrieval allows you to scale them independently.
BUT!!!! there are big buts here......
1) When you talk about Structured streaming, there is no way to implement a Source that can selectively get data from your HBase based on the data in your input. You have to do this inside a map(-like) method. So, IMO, what you have is perfectly fine if the data in Hbase changes or there is a lot of data that you don't want to keep in memory. If your data in HBase is small and unchanging, then it's better to read it into a batch data frame, cache it and then join it with your streaming data frame. Spark will load all the data into its own memory/disk storage, and keep it there. If your data is small and changing very frequently, it's better to read it in a data frame, don't cache it and join it with a streaming data frame. Spark will load the data from HBase every time it runs a micro batch.
2) there is no way to order the execution of 2 separate Sinks. So, if your requirement requires you to write to a database, and write to Kafka, and you want to guarantee that a row in Kafka is written after the row is committed in the database, then the only way to do that is to
a) do both writes in a For each Sink.
b)write to one system in a map-like function and the other in a for each sink
Unfortunately, if you have a requirement that requires you to read data from a streaming source, join it with data from batch source, transform it, write it to database, call an API, get the result from the API and write the result of the API to Kafka, and those operations have to be done in exact order, then the only way you can do this is by implementing sink logic in a transformation component. You have to make sure you keep the logic separate in separate map functions, so you can parallelize them in an optimal manner.
Also, there is no good way to know when a micro-batch is completely processed by your application, especially if you have multiple sinks
try ForeachWriter, In ForeachWriter process() method receives single row from data frame.
and you can process the data as you want. https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/ForeachWriter.html
I have data already sorted by key into my Spark Streaming partitions by virtue of Kafka, i.e. keys found on one node are not found on any other nodes.
I would like to use redis and its incrby (increment by) command as a state engine and to reduce the number of requests sent to redis, I would like to partially reduce my data by doing a word count on each worker node by itself. (The key is tag+timestamp to obtain my functionality from word count).
I would like to avoid shuffling and let redis take care of adding data across worker nodes.
Even when I have checked that data is cleanly split among worker nodes, .reduce(_ + _) (Scala syntax) takes a long time (several seconds vs. sub-second for map tasks), as the HashPartitioner seems to shuffle my data to a random node to add it there.
How can I write a simple word count reduce on each partitioner without triggering the shuffling step in Scala with Spark Streaming?
Note DStream objects lack some RDD methods, which are available only through the transform method.
It seems I might be able to use combineByKey. I would like to skip the mergeCombiners() step and instead leave accumulated tuples where they are.
The book "Learning Spark" enigmatically says:
We can disable map-side aggregation in combineByKey() if we know that our data won’t benefit from it. For example, groupByKey() disables map-side aggregation as the aggregation function (appending to a list) does not save any space. If we want to disable map-side combines, we need to specify the partitioner; for now you can just use the partitioner on the source RDD by passing rdd.partitioner.
https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html
The book then continues to supply no syntax for how to do this, nor have I had any luck with google so far.
What is worse, as far as I know, the partitioner is not set for DStream RDDs in Spark Streaming, so I don't know how to supply a partitioner to combineByKey that doesn't end up shuffling data.
Also, what does "map-side" actually mean and what consequences does mapSideCombine = false have, exactly?
The scala implementation for combineByKey can be found at
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
Look for combineByKeyWithClassTag.
If the solution involves a custom partitioner, please include also a code sample for how to apply that partitioner to the incoming DStream.
This can be done using mapPartitions, which takes a function that maps an iterator of the input RDD on one partition to an iterator over the output RDD.
To implement a word count, I map to _._2 to remove the Kafka key and then perform a fast iterator word count using foldLeft, initializing a mutable.hashMap, which then gets converted to an Iterator to form the output RDD.
val myDstream = messages
.mapPartitions( it =>
it.map(_._2)
.foldLeft(new mutable.HashMap[String, Int])(
(count, key) => count += (key -> (count.getOrElse(key, 0) + 1))
).toIterator
)
What is the difference between reduce vs. fold with respect to their technical implementation?
I understand that they differ by their signature as fold accepts additional parameter (i.e. initial value) which gets added to each partition output.
Can someone tell about use case for these two actions?
Which would perform better in which scenario consider 0 is used for fold?
Thanks in advance.
There is no practical difference when it comes to performance whatsoever:
RDD.fold action is using fold on the partition Iterators which is implemented using foldLeft.
RDD.reduce is using reduceLefton the partition Iterators.
Both methods keep mutable accumulator and process partitions sequentially using simple loops with foldLeft implemented like this:
foreach (x => result = op(result, x))
and reduceLeft like this:
for (x <- self) {
if (first) {
...
}
else acc = op(acc, x)
}
Practical difference between these methods in Spark is only related to their behavior on empty collections and ability to use mutable buffer (arguably it is related to performance). You'll find some discussion in Why is the fold action necessary in Spark?
Moreover there is no difference in the overall processing model:
Each partition is processed sequentially using a single thread.
Partitions are processed in parallel using multiple executors / executor threads.
Final merge is performed sequentially using a single thread on the driver.