SnappyData table definitions using partition keys - streaming

Reading through the documentation (http://snappydatainc.github.io/snappydata/streamingWithSQL/) and had a question about this item:
"Reduced shuffling through co-partitioning: With SnappyData, the partitioning key used by the input queue (e.g., for Kafka sources), the stream processor and the underlying store can all be the same. This dramatically reduces the need to shuffle records."
If we are using Kafka and partition our data in a topic using a key (single value). Is it possible to map this single key from kafka to multiple partition keys identified in the snappy table?
Is there a hash of some sort to turn multiple keys into a single key?
The benefit of reduced shuffling seems significant and trying to understand the best practice here.
thanks!

With DirectKafka stream, each partition pulls the data from own designated topic. If no partitioning is specified for the storage table, then each DirectKafka partition will put only to local storage buckets and then everything will line up well without requiring anything extra. The only thing to take care of is enough number of topics (thus partitions) for better concurrency -- ideally at least as many as total number of processor cores in the cluster so all cores are busy.
When partitioning storage tables explicitly, SnappyData's store has been adjusted to use the same hashing as Spark's HashPartitioning (for "PARTITION_BY" option of both column and row tables) since that is the one used at Catalyst SQL execution layer. So execution and storage are always collocated.
However, aligning that with ingestion from DirectKafka partitions will require some manual work (align kafka topic partitioning with HashPartitioning, then having the preferred locations for each DirectKafka partition match the storage). Will be simplified in coming releases.

Related

Guaranteed ordering of messages across a Kafka cluster

I have read dozens of articles about Kafka message ordering and still don't see an out-of-the-box solution to my very common need - publishing messages with a sequentially-incrementing ID and consuming them in that same order.
Kafka preserves message order within a partition. But what enterprise-grade solution would ever use a single partition for critical data (single point of data loss failure, reduced throughput without parallelism, etc.)? So the challenge is how to consume messages in order across a multi-partitioned topic.
Doing blockchain analytics, we harvest sequentially-incrementing blocks of data from blockchain nodes and then publish them to our Kafka topic. Key = block number, Value = block data. Block numbers start at 0 and increment by 1 for eternity.
Our analytics code needs to consume those messages IN ORDER (block 1, block 2, block 3, etc.). If a Smart contract get created on a blockchain in block 2 and then a transaction on it occurs in block 3, our analytics code would fail if we processed block 3 before block 2 ("no contract found error", for example).
Some more info about our use case.
The topic with block data will never be purged. This will grow to several TB and will have millions of messages on it. Though most consumers won't use this directly, it still servers as our off-chain copy of a blockchain and may fulfill future needs within our software.
We have a SQL database table which stores the stateful information about how much of a blockchain we've analyzed (example, highest block # is 25,555,555).
For guaranteed ordering, most articles recommend Kafka Streams and KTables. If we use in-memory KTables, then we face major challenges (can't store TB of data in-memory, rebuilding the KTable at startup would take days, etc.)
If we use persisted KTables, then we're bloating our disk usage (several TB of data duplicated across the source topic and the KTable).
We can create a secondary "operational" single-partition topic [with a relatively short data retention time] and stream the data to that in order, and then have our consumers pull data from that topic. But this is exactly the opposite of out-of-the-box and we'd like to avoid doing this for the hundreds of blockchains and messaging needs we have. It'll become and administrative debacle.
This seems like a technical need that thousands of companies have had since the creation of Kafka (like what messaging queues have done for decades). Is there no out-of-the-box solution for a KafkaListener to receive messages in order based on a numeric Key [in a multi-partition topic]?
publishing messages with a sequentially-incrementing ID and consuming them in that same order
A single partition is the only way to accomplish this when using Kafka.
One alternative design, from a blockchain perspective, would be to key by wallet address, for example, then you have ordered events per wallet. But then if you have transactions between wallets, there is no guarantee the "other wallet" from that withdraw/deposit event-value will exist, so you will need some other state-store (e.g. KTable) for all known wallet addresses before fully processing such events.
The topic with block data will never be purged. This will grow to several TB
Partition segments are not distributed. If you had one partition, that means you're limited to the size of one HDD.
Similarly, RocksDB or in-memory state-stores will have the same problem. But, the interface for those are pluggable and can be replaced, with some tradeoffs for processing ordering guarantees.

Can I use Kafka for multiple independent consumers sequential reads?

I have the following use case:
50 students write their own code which consumes a preloaded dataset, and they will repeat it many times.
They all need to do the same task: read the data in order, and process it.
The dataset is a time series containing 600 million messages, each message is about 1.3KB.
Processing will probably be in Spark, but not mandatory.
The dataset is fixed and ReadOnly.
The data should be read at "reasonable speed" > 30MB/sec for each consumer.
I was thinking of setting kafka cluster with 3+ brokers, 1 topic, and 50 partitions.
My issue with the above plan is that each student (== consumer) must read all the data, regardless of what other consumers do.
Is Kafka a good fit for this? If so, how?
What if I relax the requirement of reading the dataset in order? i.e. a consumer can read the 600M messages in any order.
Is it correct that in this case each consumer will simply pull the full topic (starting with "earliest)?
An alternative is to set an HDFS storage (we use Azure so it's called Storage Account) and simply supply a mount point. However, I do not have control of the throughput in this case.
Throughput calculation:
let's say 25 consumers run concurrently, each reading at 30MB/s -> 750MB/s .
Assuming data is read from disk, and disk rate is 50MB/s, I need to read concurrently from 750/50 = 15 disks.
Does it mean I need to have 15 brokers? I did not see how one broker can allocate partitions to several disks attached to it.
similar posts:
Kafka topic partitions to Spark streaming
How does one Kafka consumer read from more than one partition?
(Spring) Kafka appears to consume newly produced messages out of order
Kafka architecture many partitions or many topics?
Is it possible to read from multiple partitions using Kafka Simple Consumer?
Processing will probably be in Spark, but not mandatory
An alternative is to set an HDFS storage (we use Azure)
Spark can read from Azure Blob Storage, so I suggest you start with that first. You can easily scale up Spark executors in parallel for throughput.
If want to use Kafka, don't base consumption rate on disk speed alone, especially when Kafka can do zero-copy transfers. Use kafka-consumer-perf-test script to test how fast your consumers can go with one partition. Or, better, if your data has some key other than timestamp that you can order by, then use that.
It's not really clear if each "50 students" does the same processing on the data set, or some pre computations can be done, but if so, Kafka Streams KTables can be setup to aggregate some static statistics of the data, if it's all streamed though a topic, that way, you can distribute load for those queries, and not need 50 parallel consumers.
Otherwise, my first thought would be to simply use a TSDB like OpenTSDB, Timescale or Influx, maybe Druid . Which could also be used with Spark, or queried directly.
If you are using Apache Spark 3.0+ there are ways around consumer per partition bound, as it can use more executor threads than partitions are, so it's mostly about how fast your network and disks are.
Kafka stores latest offsets in memory, so probably for your use case most of reads will be from memory.
Desired minimum number of partitions to read from Kafka. By default, Spark has a 1-1 mapping of topicPartitions to Spark partitions consuming from Kafka. If you set this option to a value greater than your topicPartitions, Spark will divvy up large Kafka partitions to smaller pieces. Please note that this configuration is like a hint: the number of Spark tasks will be approximately minPartitions. It can be less or more depending on rounding errors or Kafka partitions that didn't receive any new data.
https://spark.apache.org/docs/3.0.1/structured-streaming-kafka-integration.html

Apache Flink - how to align Flink and Kafka sharding

I am developing a DataStream-based Flink application for a high volume streaming use case (tens of millions of events per second). The data is consumed from a Kafka topic and is already sharded according to a certain key. My intention is to create key-specific states on the Flink side to run custom analytics. The main problem that I can't wrap my head around is how to create the keyed states without reshuffling of the incoming data that is imposed by keyBy().
I can guarantee that the maximum parallelism of the Flink job will be less than or equal to the number of partitions in the source Kafka topic, so logically the shuffling is not necessary. The answer to this StackOverflow question suggests that it may be possible to write the data into Kafka in a way that is compatible with the expectations of Flink and then use reinterpretAsKeyedStream(). I would be happy to do it for this application. Would someone be able to share the necessary steps?
Thank you in advance.
What you need to do is to ensure that each event is written to the Kafka partition that will be read by the same task slot to which the key for that event will be assigned.
Here's what you need to know to make that work:
(1) Kafka partitions are assigned in round-robin fashion to task slots: partition 0 goes to slot 0, partition 1 to slot 1, etc, wrapping back around to slot 0 if there are more partitions than slots.
(2) Keys are mapped to key groups, and key groups are assigned to slots. The number of key groups is determined by the maximum parallelism (which is a configuration parameter; the default is 128).
The key group for a key is computed via
keygroupId = MathUtils.murmurHash(key.hashCode()) % maxParallelism
and then the slot is assigned according to
slotIndex = keygroupId * actualParallelism / maxParallelism
(3) Then you'll need to use DataStreamUtils.reinterpretAsKeyedStream to get Flink to treat the pre-partitioned streams as keyed streams.
One effect of adopting this approach is that it will be painful if you ever need to change the parallelism.

Combining data coming from multiple kafka to single kafka

I have N Kafka topic, with data and a timestamp, I need to combine them in a single topic with sorted timestamp order, where the data is sorted inside the partition. I got one way to do that.
Combine all the Kafka topic data in Cassandra(because of its fast write) with clustering order as DESCENDING, it will combine them all but the limit would be if after a timed window of accumulation of data if a data came late, it won't be sorted
Is there any other appropriate way to do that? If not then is there any chance of improvement in my solution.
Thanks
Not clear why you need Kafka to sort on timestamps. Typically this is done only at consumption time for each batch of messages.
For example, create Kafka Streams process that reads from all topics. Create a Global KTable and enable Interactive Querying.
When you query, then you sort the data on the client side, regardless of how it is ordered in the topic.
This way, you are no limited to a single, ordered partition.
Alternatively, I would write to something other than Cassandra (due to my lack of deep knowledge of it). For example, Couchbase or CockroachDB.
Then when you query those later, run a SORT BY

Kafka Streams processors - state store and input topic partitioning

I would like to fully understand the rules that kafka-streams processors must obey with respect to partitioning of a processor's input and its state(s). Specifically I would like to understand:
Whether or not it is possible and what are the potential consequences of using a key for the state store(s) that is not the same as the key of the input topic
Whether or not state store keys are shared across partitions, i.e. whether or not I will get the same value if I try to access the same key in a processor while it is processing records belonging to two different partitions
I have been doing some research on this and the answers I found seem not to be very clear and sometimes contradictory: e.g. this one seems to suggest that the stores are totally independent and you can use any key while this one says that you should never use a store with a different key than the one in the input topic.
Thanks for any clarification.
You have to distinguish between input partitions and store shards/changelog topic partitions for a complete picture. Also, it depends if you use the DSL or the Processor API, because the DSL does some auto-repartitioning but the Processor API doesn't. Because the DSL compiles down to the Processor API, I'll start with this.
If you have a topic with let's say 4 partitions and you create a stateful processor that consumes this topic, you will get 4 tasks, each task running a processor instance that maintains one shard of the store. Note, that the overall state is split into 4 shards and each shard is basically isolated from the other shards.
From an Processor API runtime point of view, the input topic partitions and the state store shards (including their corresponding changelog topic partitions) are a unit of parallelism. Hence, the changelog topic for the store is create with 4 partitions, and changelog-topic-partition-X is mapped to input-topic-partition-X. Note, that Kafka Streams does not use hash-based partitioning when writing into a changelog topic, but provides the partition number explicitly, to ensure that "processor instance X", that processes input-topic-partition-X, only reads/write from/into changelog-topic-partition-X.
Thus, the runtime is agnostic to keys if you wish.
If your input topic is not partitioned by keys, messages with the same key will be processed by different task. Depending on the program, this might be ok (eg. filtering), or not (eg, count per key).
Similar to state: you can put any key into a state store, but this key is "local" to the corresponding shard. Other tasks, will never see this key. Thus, if you use the same key in a store on different tasks, they will be completely independent from each other (as if they would be two keys).
Using Processor API, it's your responsibility to partition input data correctly and to use stores correctly, depending on the operator semantics you need.
At DSL level, Kafka Streams will make sure that data is partitioned correctly to ensure correct operator semantics. First, it's assumed that input topics are partitioned by key. If the key is modified, for example via selectKey() and a downstream operator is an aggregation, Kafka Streams is repartitioning the data first, to insure that records with the same key are in the same topic partition. This ensures, that each key will be used in a single store shard. Thus, the DSL will always partition the data such that one key is never processed on different shards.