There is a very small but very powerful detail in the Kafka org.apache.kafka.clients.producer.internals.DefaultPartitioner implementation that bugs me a lot.
It is this line of code:
return DefaultPartitioner.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
to be more precise, the last % numPartitions. I keep asking myself what is the reason behind introducing such a huge constraint by making the partition ID a function of the number of existent partitions? Just for the convenience of having small numbers (human readable/traceable?!) in comparison to the total number of partitions? Does anyone here have a broader insight into the issue?
I'm asking this because in our implementation, the key we use to store data in kafka is domain-sensitive and we use it to retrieve information from kafka based on that. For instance, we have consumers that need to subscribe ONLY to partitions that present interest to them and the way we do that link is by using such keys.
Would be safe to use a custom partitioner that doesn't do that modulo operation? Should we notice any performance degradation. Does this have any implications on the Producer and/or Consumer side?
Any ideas and comments are welcome.
Partitions in a Kafka topic are numbered from 0...N. Thus, if a key is hashed to determine a partitions, the result hash value must be in the interval [0;N] -- it must be a valid partition number.
Using modulo operation is a standard technique in hashing.
Normally you do modulo on hash to make sure that the entry will fit in the hash range.
Say you have hash range of 5.
-------------------
| 0 | 1 | 2 | 3 | 4 |
-------------------
if your hashcode of entry happens to be 6 you would have to divide by number of available
buckets so that it fits in the range, means bucket 1 in this case.
Even more important thing is when you decide to add or remove bucket from the range.
Say you decreased the size of hashmap to 4 buckets, then the last bucket will be inactive and
you have to rehash the values in bucket#4 to next bucket in clockwise direction. (I'm talking
about consistent hashing here)
Also, new coming hashes need to be distributed within active 4 buckets, because 5th one will go away, this is taken care by the modulo.
The same concept is used in distributed systems for rehashing which happens when you add or remove node to your cluster.
Kafka Default Partiotioner is using modulo for the same purpose. If you add or remove partitions, which is very usual case if you ask me, for example during high volume of incoming messages I might want to add more partitions so that I achieve high write throughput and also high read throughput, as I can parallely consume partitions.
You can override partitioning algorithm based on your business logic by choosing some key in your message which will make sure the messages are distributed uniformly within the range[0...n]
The performance impact of using a custom partitioner entirely depends on your implementation of it.
I'm not entirely sure what you're trying to accomplish though. If I understand your question correctly, you want to use the value of the message key as the partition number directly, without doing any modulo operation on it to determine a partition?
In that case all you need to do is use the overloaded constructor for the ProducerRecord(java.lang.String topic, java.lang.Integer partition, K key, V value) when producing a message to a kafka topic, passing in the desired partition number.
This way all the default partitioning logic will be bypassed entirely and the message will go to specified partition.
Related
We are working in Spark streaming .
Our DataFrame contains the following columns
[unitID,source,avrobyte,schemeType]
The unitID values are [ 10, 76, 510, 269 , 7, 0, 508, , 509 ,511 , 507]
We active the following command :
val dfGrouped :KeyValueGroupedDataset [Int,Car] = dfSource.groupByKey(car1=> ca1.unitID)
val afterLogic : DataSet[CarLogic]= dfGrouped.flatMapGroups{
case(unitID: Int , messages:Iterator[Car])=> performeLogic(...)
}
We allocate 8 Spark executers .
In our Dataset we have 10 different units so we have 10 different unitID,
so we excepted that job processing will split on all over the executers in equal manner, but when we looking on the executers performance via the UI I see that only 2 executers are working and all the other are idle during the mission....
What are we doing wrong? or how we can divide the job over all the executers to be more or the less equal...
What you are seeing can be explained by the low cardinality of your key space. Spark uses a HashPartitioner (by default) to assign keys to partitions (by default 200 partitions). On a low cardinality key space this is rather problematic and requires careful attention as each collision has a massive impact. Even further, these partitions then have to be assigned to executors. At the end of this process it's not surprising to end up with a rather sub-optimal distribution of data.
You have a few options:
If applicable, attempt to increase the cardinality of your keys, e.g. by salting them (appending some randomness temporarily). That has the advantage that you can also better handle skew in the data (when the amount of data per keys is not equally distributed). In a following step you can then remove the random part again and combine the partial results.
If you absolutely require a partition per key (and the key space is static and well-known), you should configure spark.sql.shuffle.partitions to match the cardinality n of your keys space and assign each key a partition id in [0, n) ahead of time (to avoid collisions when hashing). Then you can use this partition id in your groupBy.
Just for completeness, using the RDD API you could provide you own custom partitioner that does the same as described above: rdd.partitionBy(n, customPartitioner)
Though, one final word: Even following one of the latter two options above, using 8 executors for 10 keys (equals 10 non-empty partitions) is a poor choice. If your data is equally distributed, you will still end up with 2 executors doing double the work. If your data is skewed things might even be worse (or you are accidentally lucky) - in any case, it's out of your control.
So it's best to make sure that the number of partitions can be equally distributed among your executors.
We have been using Kafka for various use cases and have solved various problems. But the one problem which we are frequently facing is messages with any one key will be suddenly produced more or it will take some secs of execution so that the messages for the other keys in the queue are processed in delay.
We have implemented various ways to find those keys and offloaded it to a separate queue where we will be having a topic pool. But the topics in the pool goes on increasing and we find that we are not using the topic resource in an efficient manner.
If we are having 100 such keys, then we need to create 100 such topics and this not seems to be an optimised solution.
Whether in these type of cases, we should store the data in the DB where the particular key's data resides and we need to implement our own Queue based on the data in the table or there is some other mechanisms in which we can solve this problem ?
This problem is only for the keys having high data rate and high processing time (with 3 to 5s). Can anyone suggest what will be the better architecture for these type of cases?
I have a topology (see below) that reads off a very large topic (over a billion messages per day). The memory usage of this Kafka Streams app is pretty high, and I was looking for some suggestions on how I might reduce the footprint of the state stores (more details below). Note: I am not trying to scape goat the state stores, I just think there may be a way for me to improve my topology - see below.
// stream receives 1 billion+ messages per day
stream
.flatMap((key, msg) -> rekeyMessages(msg))
.groupBy((key, value) -> key)
.reduce(new MyReducer(), MY_REDUCED_STORE)
.toStream()
.to(OUTPUT_TOPIC);
// stream the compacted topic as a KTable
KTable<String, String> rekeyedTable = builder.table(OUTPUT_TOPIC, REKEYED_STORE);
// aggregation 1
rekeyedTable.groupBy(...).aggregate(...)
// aggreation 2
rekeyedTable.groupBy(...).aggregate(...)
// etc
More specifically, I'm wondering if streaming the OUTPUT_TOPIC as a KTable is causing the state store (REKEYED_STORE) to be larger than it needs to be locally. For changelog topics with a large number of unique keys, would it be better to stream these as a KStream and do windowed aggregations? Or would that not reduce the footprint like I think it would (e.g. that only a subset of the records - those in the window, would exist in the local state store).
Anyways, I can always spin up more instances of this app, but I'd like to make each instance as efficient as possible. Here's my question:
Are there any config options, general strategies, etc that should be considered for Kafka Streams app with this level of throughput?
Are there any guidelines for how memory intensive a single instance should have? Even if you have a somewhat arbitrary guideline, it may be helpful to share with others. One of my instances is currently utilizing 15GB of memory - I have no idea if that's good/bad/doesn't matter.
Any help would be greatly appreciated!
With your current pattern
stream.....reduce().toStream().to(OUTPUT_TOPIC);
builder.table(OUTPUT_TOPIC, REKEYED_STORE)
you get two stores with the same content. One for the reduce() operator and one for reading the table() -- this can be reduced to one store though:
KTable rekeyedTable = stream.....reduce(.);
rekeyedTable.toStream().to(OUTPUT_TOPIC); // in case you need this output topic; otherwise you can also omit it completely
This should reduce your memory usage notably.
About windowing vs non-windowing:
it's a matter of your required semantics; so simple switching from a non-windowed to a windowed reduce seems to be questionable.
Even if you can also go with windowed semantics, you would not necessarily reduce memory. Note, in aggregation case, Streams does not store the raw records but only the current aggregate result (ie, key + currentAgg). Thus, for a single key, the storage requirement is the same for both cases (a single window has the same storage requirement). At the same time, if you go with windows, you might actually need more memory as you get an aggregate pro key pro window (while you get just a single aggregate pro key in the non-window case). The only scenario you might save memory, is the case for which you 'key space' is spread out over a long period of time. For example, you might not get any input records for some keys for a long time. In the non-windowed case, the aggregate(s) of those records will be stores all the time, while for the windowed case the key/agg record will be dropped and new entried will be re-created if records with this key occure later on again (but keep in mind, that you lost the previous aggergate in this case -- cf. (1))
Last but not least, you might want to have a look into the guidelines for sizing an application: http://docs.confluent.io/current/streams/sizing.html
I'm trying to understand how Spark partitions data. Suppose I have an execution DAG like that in the picture (orange boxes are the stages). The two groupBy and the join operations are supposed to be very heavy if the RDD's are not partitioned.
Is it wise then to use .partitonBy(new HashPartitioner(properValue)) to P1, P2, P3 and P4 to avoid shuffle? What's the cost of partitioning an existing RDD? When isn't proper to partition an existing RDD? Doesn't Spark partition my data automatically if I don't specify a partitioner?
Thank you
tl;dr The answers to your questions respectively: Better to partition at the outset if you can; Probably less than not partitioning; Your RDD is partitioned one way or another anyway; Yes.
This is a pretty broad question. It takes up a good portion of our course! But let's try to address as much about partitioning as possible without writing a novel.
As you know, the primary reason to use a tool like Spark is because you have too much data to analyze on one machine without having the fan sound like a jet engine. The data get distributed among all the cores on all the machines in your cluster, so yes, there is a default partitioning--according to the data. Remember that the data are distributed already at rest (in HDFS, HBase, etc.), so Spark just partitions according to the same strategy by default to keep the data on the machines where they already are--with the default number of partitions equal to the number of cores on the cluster. You can override this default number by configuring spark.default.parallelism, and you want this number to be 2-3 per core per machine.
However, typically you want data that belong together (for example, data with the same key, where HashPartitioner would apply) to be in the same partition, regardless of where they are to start, for the sake of your analytics and to minimize shuffle later. Spark also offers a RangePartitioner, or you can roll your own for your needs fairly easily. But you are right that there is an upfront shuffle cost to go from default partitioning to custom partitioning; it's almost always worth it.
It is generally wise to partition at the outset (rather than delay the inevitable with partitionBy) and then repartition if needed later. Later on you may choose to coalesce even, which causes an intermediate shuffle, to reduce the number of partitions and potentially leave some machines and cores idle because the gain in network IO (after that upfront cost) is greater than the loss of CPU power.
(The only situation I can think of where you don't partition at the outset--because you can't--is when your data source is a compressed file.)
Note also that you can preserve partitions during a map transformation with mapPartitions and mapPartitionsWithIndex.
Finally, keep in mind that as you experiment with your analytics while you work your way up to scale, there are diagnostic capabilities you can use:
toDebugString to see the lineage of RDDs
getNumPartitions to, shockingly, get the number of partitions
glom to see clearly how your data are partitioned
And if you pardon the shameless plug, these are the kinds of things we discuss in Analytics with Apache Spark. We hope to have an online version soon.
By applying partitionBy preemptively you don't avoid the shuffle. You just push it in another place. This can be a good idea if partitioned RDD is reused multiple times, but you gain nothing for a one-off join.
Doesn't Spark partition my data automatically if I don't specify a partitioner?
It will partition (a.k.a. shuffle) your data a part of the join) and subsequent groupBy (unless you keep the same key and use transformation which preserves partitioning).
Imagine I have a RDD with 100 records and I partitioned it with 10, so each partition is now having 10 records I am just converting to rdd to key value pair rdd and saving it to a file now my output data is divided into 10 partitions which is ok to me, but is it best practise to use coalesce function before saving output data to file ? for example rdd.coalesce(1) this gives just one file as output does it not shuffles data insides nodes ? want to know where coalesce should be used.
Thanks
Avoid coalesce if you don't need it. Only use it to reduce the amount of files generated.
As with anything, depends on your use case; coalesce() can be used to either increase or decrease the number of partitions but there is a cost associated with it.
If you are attempting to increase the number of partitions (in which the shuffle parameter must be set to true), you will incur the cost of redistributing data through a HashPartitioner. If you are attempting to decrease the number of partitions, the shuffle parameter can be set to false but the number of nodes actively grabbing from the current set of partitions will be the number of partitions you are coalescing to. For example, if you are coalescing to 1 partition, only 1 node will be active in pulling data from the parent partitions (this can be dangerous if you are coalescing a large amount of data).
Coalescing can be useful though as sometimes you can make your job run more efficiently by decreasing your partition set size (e.g. after a filter or a sparse inner join).
you can simply use it like this
rdd.coalesce(numberOfPartition)
It doesn't shuffle data if you decease partitions but its shuffle data if you increase partitions. Its according to use cases.But we careful to use it because if you decrease partition less than or not equal to number of cores in your cluster then its cant use full resources of your cluster. And Sometimes less shuffle data or network IO like you decrease rdd partition but equal to number of partition so its increase performance of your system.