I need to get the count of rows in the store, with store being maintained in the low level processor API's. I see that the method "approximateNumEntries()" can provide an approximate count of key-value mappings in this store. Can you please clarify on % of accuracy, meaning if there are 100 rows in the store will we get 95 as the approximate count OR could it get even lower than 50 at times? Just trying to understand the factors that can influence the count accuracy.
Note: Assume that the stream application consumes a single topic and runs on a single instance. Stores are being accessed through low level processor API's, not sure if there are any caching applied by default. The commit frequency remains default.
It depends on the store. If you are using default RocksDB store, the method internally returns "rocksdb.estimate-num-keys" from RocksDB (cf. https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ) -- not sure what the error bounds are.
For in-memory stores, the count is actually exact in the current implementation (current release 1.1).
Related
How to define window of fixed size (fixed number of items) in Apache Beam?
I know that we have
(FixedWindows.of(Duration.standardMinutes(10))
but I do not care about time-only about number of items.
More details:
I am writing significant amount of data (53 gigabytes) to S3. Currently my proces uses
FileIO.<KV<...>writeDynamic()
.by(kv -> kv.getKey())
(grouping by key). This causes serve performance bottleneck, because of skewed key distribution. My total data size is 53Gb, but size of data for one key is 37Gb. This single key takes an hour to write (writing occurs on single executor, single thread, rest of cluster waits idle).
I do not need any special grouping. Ideally I want uniform distribution of data, so writing will happen concurrently and finish as soon as possible.
Guaranteeing exactly equal sized grouping is fairly hard, but you can get pretty close by using hashes of your data modulo some constant as the keys. For example:
FileIO.<KV<...>writeDynamic()
.by(kv -> kv.hashCode() % 530)
This will give roughly equal 100MB partitions.
Additionally, if you are using the DataflowRunner, you don't need to specify keys at all; the system will automatically group up the data, and dynamically rebalance the load to avoid stragglers. For this, use FileIO.write() instead of FileIO.writeDynamic().
I am doing an aggregation on a Kafka topic stream and saving to an in memory state store. I would like to know the exact size of the accumulated in memory data, is this possible to find?
I looked through the jmx metrics on jconsole and Confluent Control Centre but nothing seemed relevant, is there anything I can use to find this out please?
You can get the number of stored key-value-pairs of an in-memory store, via KeyValueStore#approximateNumEntries() (for the default in-memory-store implementation, this number is actually accurate). If you can estimate the byte size per key-value pair, you can do the math.
However, estimating the byte size of an object is pretty hard to do in general in Java. The problem is, that Java does not provide any way to receive the actual size of an object. Also, objects can be nested making it even harder. Finally, besides the actual data, there is always some metadata overhead per object, and this overhead is JVM implementation dependent.
Event sourcing means a 180 degree shift in the way many of us have been architecting and developing web applications, with lots of advantages but also many challenges.
Apache Kafka is an awesome platform that through its Apache Kafka Streams API is advertised as a tool that allows us to implement this paradimg through its many features (decoupling, fault tolerance, scalability...): https://www.confluent.io/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/
On the other hand there are some articles discouraging us from using it for event sourcing: https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
These are my questions regarding Kafka Streams suitability as an event sourcing plaftorm:
The article above comes from Jesper Hammarbäck (who works for serialized.io, an event sourcing platform). I would like to get an answer to the main problems he brings up:
Loading current state. In my view with log compaction and state stores it's not a problem. Am I right?
Consistent writes.
When moving certain pieces of functionality into Kafka Streams I'm not sure if they do fit naturally:
Authentication & Security: Imagine your customers are stored in a state store generated from a customer-topic. Should we keep their passwords in the topic/store? It doesn't sound safe enough, does it? Then how are we supposed to manage this aspect of having customers on a state store and their passwords somewhere else? Any recommended good practice?
Queries: Interactive queries are a nice tool to generate queriable views of our data (by key). That's ok to get an entity by id but what about complex queries (joins)? Do we need to generate state stores per query? For instance one store for customers by id, another one for customers by state, another store for customers who purchased a product last year... It doesn't sound manageable. Another point is the lack of pagination: how can we handle big sets of data when querying the state stores? One more point, we can’t do dynamic queries (like JPA criteria API) anymore. This leads to CQRS maybe? Complexity keeps growing this way...
Data growth: with databases we are used to have thousands and thousands of rows per table. Kafka Streams applications keep a local state store that will grow and grow over time. How scalable is that? How is that local storage kept (local disk/RAM)? If it's disk we should provision applications with enough space, if it's RAM enough memory.
Loading Current State: The mechanism described in the blog, about re-reacting current state ad-hoc for a single entity would indeed be costly with Kafka. However Kafka Streams follow the philosophy to keep the current state for all object in a KTable (that is distributed/sharded). Thus, it's never required to do this -- of course, it come with certain memory costs.
Kafka Streams parallelized based on different events. Thus, all interactions for a single event (processing, state updates) are performed by a single thread. Thus, I don't see why there should be inconsistent writes.
I am not sure what the exact requirement would be. In the current implementation, Kafka Streams does not offer any store specific authentication or security features. There are several things one could do for security though: (a) encrypt the local disk: this might be the simplest thing to do to protect data. (2) encrypt messages within the business logic, before you put them into the store.
Interactive Queries offers limited support for many reasons (don't want to go into details) and it was never design with the goal to support complex queries. The idea is about eager computation of result what can be retrieved with simple lookups. As you pointed out, this is not very scalable (cost intensive) if you have a lot of different queries. To tackle this, it would make sense to load the data into a database, and let the DB does what it is build for. Kafka Streams alone is not the right tool for this atm -- however, there is no reason to not combine both.
Per default Kafka Streams uses RocksDB to keep local state (you can switch to in-memory stores, too). Thus, it's possible to write to disk and to use very large state. Of course, you need to provision your instances accordingly (cf: https://docs.confluent.io/current/streams/sizing.html). Besides this, Kafka Streams scales horizontally and is fully elastic. Thus, you can add new instances at any point in time allowing you to hold terra-bytes of state if you have large disks and enough instances. Note, that the number of input topic partitions limit the number of instances you can use (internally, Kafka Streams is a consumer group, and you cannot have more instances than partitions). If this is a concern, it's recommended to over-partition the input topics in the first place.
I have a topology (see below) that reads off a very large topic (over a billion messages per day). The memory usage of this Kafka Streams app is pretty high, and I was looking for some suggestions on how I might reduce the footprint of the state stores (more details below). Note: I am not trying to scape goat the state stores, I just think there may be a way for me to improve my topology - see below.
// stream receives 1 billion+ messages per day
stream
.flatMap((key, msg) -> rekeyMessages(msg))
.groupBy((key, value) -> key)
.reduce(new MyReducer(), MY_REDUCED_STORE)
.toStream()
.to(OUTPUT_TOPIC);
// stream the compacted topic as a KTable
KTable<String, String> rekeyedTable = builder.table(OUTPUT_TOPIC, REKEYED_STORE);
// aggregation 1
rekeyedTable.groupBy(...).aggregate(...)
// aggreation 2
rekeyedTable.groupBy(...).aggregate(...)
// etc
More specifically, I'm wondering if streaming the OUTPUT_TOPIC as a KTable is causing the state store (REKEYED_STORE) to be larger than it needs to be locally. For changelog topics with a large number of unique keys, would it be better to stream these as a KStream and do windowed aggregations? Or would that not reduce the footprint like I think it would (e.g. that only a subset of the records - those in the window, would exist in the local state store).
Anyways, I can always spin up more instances of this app, but I'd like to make each instance as efficient as possible. Here's my question:
Are there any config options, general strategies, etc that should be considered for Kafka Streams app with this level of throughput?
Are there any guidelines for how memory intensive a single instance should have? Even if you have a somewhat arbitrary guideline, it may be helpful to share with others. One of my instances is currently utilizing 15GB of memory - I have no idea if that's good/bad/doesn't matter.
Any help would be greatly appreciated!
With your current pattern
stream.....reduce().toStream().to(OUTPUT_TOPIC);
builder.table(OUTPUT_TOPIC, REKEYED_STORE)
you get two stores with the same content. One for the reduce() operator and one for reading the table() -- this can be reduced to one store though:
KTable rekeyedTable = stream.....reduce(.);
rekeyedTable.toStream().to(OUTPUT_TOPIC); // in case you need this output topic; otherwise you can also omit it completely
This should reduce your memory usage notably.
About windowing vs non-windowing:
it's a matter of your required semantics; so simple switching from a non-windowed to a windowed reduce seems to be questionable.
Even if you can also go with windowed semantics, you would not necessarily reduce memory. Note, in aggregation case, Streams does not store the raw records but only the current aggregate result (ie, key + currentAgg). Thus, for a single key, the storage requirement is the same for both cases (a single window has the same storage requirement). At the same time, if you go with windows, you might actually need more memory as you get an aggregate pro key pro window (while you get just a single aggregate pro key in the non-window case). The only scenario you might save memory, is the case for which you 'key space' is spread out over a long period of time. For example, you might not get any input records for some keys for a long time. In the non-windowed case, the aggregate(s) of those records will be stores all the time, while for the windowed case the key/agg record will be dropped and new entried will be re-created if records with this key occure later on again (but keep in mind, that you lost the previous aggergate in this case -- cf. (1))
Last but not least, you might want to have a look into the guidelines for sizing an application: http://docs.confluent.io/current/streams/sizing.html
There is a very small but very powerful detail in the Kafka org.apache.kafka.clients.producer.internals.DefaultPartitioner implementation that bugs me a lot.
It is this line of code:
return DefaultPartitioner.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
to be more precise, the last % numPartitions. I keep asking myself what is the reason behind introducing such a huge constraint by making the partition ID a function of the number of existent partitions? Just for the convenience of having small numbers (human readable/traceable?!) in comparison to the total number of partitions? Does anyone here have a broader insight into the issue?
I'm asking this because in our implementation, the key we use to store data in kafka is domain-sensitive and we use it to retrieve information from kafka based on that. For instance, we have consumers that need to subscribe ONLY to partitions that present interest to them and the way we do that link is by using such keys.
Would be safe to use a custom partitioner that doesn't do that modulo operation? Should we notice any performance degradation. Does this have any implications on the Producer and/or Consumer side?
Any ideas and comments are welcome.
Partitions in a Kafka topic are numbered from 0...N. Thus, if a key is hashed to determine a partitions, the result hash value must be in the interval [0;N] -- it must be a valid partition number.
Using modulo operation is a standard technique in hashing.
Normally you do modulo on hash to make sure that the entry will fit in the hash range.
Say you have hash range of 5.
-------------------
| 0 | 1 | 2 | 3 | 4 |
-------------------
if your hashcode of entry happens to be 6 you would have to divide by number of available
buckets so that it fits in the range, means bucket 1 in this case.
Even more important thing is when you decide to add or remove bucket from the range.
Say you decreased the size of hashmap to 4 buckets, then the last bucket will be inactive and
you have to rehash the values in bucket#4 to next bucket in clockwise direction. (I'm talking
about consistent hashing here)
Also, new coming hashes need to be distributed within active 4 buckets, because 5th one will go away, this is taken care by the modulo.
The same concept is used in distributed systems for rehashing which happens when you add or remove node to your cluster.
Kafka Default Partiotioner is using modulo for the same purpose. If you add or remove partitions, which is very usual case if you ask me, for example during high volume of incoming messages I might want to add more partitions so that I achieve high write throughput and also high read throughput, as I can parallely consume partitions.
You can override partitioning algorithm based on your business logic by choosing some key in your message which will make sure the messages are distributed uniformly within the range[0...n]
The performance impact of using a custom partitioner entirely depends on your implementation of it.
I'm not entirely sure what you're trying to accomplish though. If I understand your question correctly, you want to use the value of the message key as the partition number directly, without doing any modulo operation on it to determine a partition?
In that case all you need to do is use the overloaded constructor for the ProducerRecord(java.lang.String topic, java.lang.Integer partition, K key, V value) when producing a message to a kafka topic, passing in the desired partition number.
This way all the default partitioning logic will be bypassed entirely and the message will go to specified partition.