How to modify the configuration of Kafka to process large amount of data

How to modify the configuration of Kafka to process large amount of data - apache-kafka

I am using kafka_2.10-0.10.0.1. I have two questions:
- I want to know how I can modify the default configuration of Kafka to process large amount of data with good performance.
- Is it possible to configure Kafka to process the records in memory without storing in disk?
thank you

Is it possible to configure Kafka to process the records in memory without storing in disk?
No. Kafka is all about storing records reliably on disk, and then reading them back quickly off of disk. In fact, its documentation says:
As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.
You can read more about its design here: https://kafka.apache.org/documentation/#design. The implementation section is also quite interesting: https://kafka.apache.org/documentation/#implementation.
That said, Kafka is also all about processing large amounts of data with good performance. In 2014 it could handle 2 million writes per second on three cheap instances: https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines. More links about performance:
https://docs.confluent.io/current/kafka/deployment.html
https://www.confluent.io/blog/optimizing-apache-kafka-deployment/
https://community.hortonworks.com/articles/80813/kafka-best-practices-1.html
https://www.cloudera.com/documentation/kafka/latest/topics/kafka_performance.html

Related

Limitations of Kafka as a Distributed DB

I have an application which requires an interesting orchestration between states of instances distributed across geographic regions, in combination with the need for a scalable distributed database.
At the moment I think that Kafka with log compaction will fit my needs for state maintenance and message exchange between instances, and Cassandra will fit my needs for high volume distributed reads and writes of persisted data.
However, there is quite a lot of data duplicated that way: Many of the data exchanged via Kafka would also need to be stored to Cassandra for distributed data access. Using Kafka for both messaging and distributed data querying and persistence seems tempting.
Therefore, I'm interested to figure out the real-world pros and cons to be expected when using e.g. the pull queries feature of Kafka to use it as a distributed database [1].
Though, I'm a bit suspicious about what to expect of that in terms of performance and scalability, especially when compared to Cassandra, as well as unknown pitfalls.
What are the tradeoffs when using Kafka as a distributed DB, and what would it compare performance-wise to "native" distributed systems like Cassandra?
[1] https://www.confluent.io/de-de/blog/pull-queries-in-preview-confluent-cloud-ksqdb/

pure KV lookups
Then Kafka StateStores / Interactive Queries can work, but with the caveat that if you use containers and an orchestrator, you need to maintain the state of those stores somewhere on persistent volumes. Otherwise, when the containers move to a fresh host, the streams changelog topic needs to be read from the very beginning, giving you a "cold-start" problem, and you will be unable to query.
Using any database (with persistent storage) will not have this problem, and will always be able to query immediately.
I'm not sure I would suggest Cassandra for strictly KV data, though.

Redis Streams for implementing a Messaging System (chat) app versus traditional approaches

I'm implementing a chat app, which will support both one-on-one conversation and Group conversations.
So far the direction was to use Redis Pub/Sub with PostgreSQL as the cold storage, and WebSocket being the transport.
Every user will fetch the history from postgresql upon launch (up until the timestamp of the WebSocket+redis connection), and then subscribe to channels that go by their own user_id.
However, having a roundtrip to a DMBS with each new message sounds a bit strange, while definitely doable and legit.
So I decided to examine other approaches. One possible approach was to use Kafka and eliminate the need for an DBMS altogether.
It sounds viable and comes with its own set of advantages.
But turns out there's a new kid on the block - Redis Streams.
From what I gather, it is actually quite similar to Kafka in this specific scenario (chat).
It has many nice features that sound very convenient for implementing a chat system.
And now I am trying to understand whether Streams + disk persistency is the wise way to go versus Kafka versus PostgreSQL+Redis pub/sub
The main aspects in consideration are:
Performance. Postgres and Kafka both operate on disk, meaning slower than the in-memory operations in the case of redis. On the other hand , obviously the messages must be persisted and available at all times and events, so redis will be persisted to disk. Wouldn't that negate the whole in-memory performance gain?
And even if not - would the performance gain under peak load and a big data base be noticeable?
Memory / Costs. With redis these two are closely tied together. As a small startup, the efforts are focused on being ready to cope with sudden scale peaks (up to a million users), but at the same time - the costs should be minimized.
Is storing millions of messages in Streams going to be too memory-costly which in turn will translate to financially-costly?
Recovery, Reliability & Availability, Persistency. with Postgres, even a single instance can handle a big traffic load, but it can also offer master-slave setups and also consistency. Can Redis be a match to that? Also, with a DMBS I can be assured that the data is there to stay. Can I know that with redis?
Scaling.

How does write ahead logging improve IO performance in Postgres?

I've been reading through the WAL chapter of the Postgres manual and was confused by a portion of the chapter:
Using WAL results in a significantly reduced number of disk writes, because only the log file needs to be flushed to disk to guarantee that a transaction is committed, rather than every data file changed by the transaction.
How is it that continuous writing to WAL more performant than simply writing to the table/index data itself?
As I see it (forgetting for now the resiliency benefits of WAL) postgres need to complete two disk operations; first pg needs to commit to WAL on disk and then you'll still need to change the table data to be consistent with WAL. I'm sure there's a fundamental aspect of this I've misunderstood but it seems like adding an additional step between a client transaction and the and the final state of the table data couldn't actually increase overall performance. Thanks in advance!

You are fundamentally right: the extra writes to the transaction log will per se not reduce the I/O load.
But a transaction will normally touch several files (tables, indexes etc.). If you force all these files out to storage (“sync”), you will incur more I/O load than if you sync just a single file.
Of course all these files will have to be written and sync'ed eventually (during a checkpoint), but often the same data are modified several times between two checkpoints, and then the corresponding files will have to be sync'ed only once.

In-memory vs persistent state stores in Kafka Streams?

I've read the stateful stream processing overview and if I understand correctly, one of the main reasons why the RocksDB is being used as a default implementation of the key value store is a fact, that unlike in-memory collections, it can handle data larger than the available memory, because it can flush to disk. Both types of stores can survive application restarts, because the data is backed up as a Kafka topic.
But are there other differences? For example, I've noticed that my persistent state store creates some .log files for each topic partition, but they're all empty.
In short, I'm wondering what are the performance benefits and possible risks of replacing persistent stores with in-memory ones.

I've got a very limited understanding of the internals of Kafka Streams and the different use cases of state stores, esp. in-memory vs persistent, but what I managed to learn so far is that a persistent state store is one that is stored on disk (and hence the name persistent) for a StreamTask.
That does not give much as the names themselves in-memory vs persistent may have given the same understanding, but something that I found quite refreshing was when I learnt that Kafka Streams tries to assign partitions to the same Kafka Streams instances that had the partitions assigned before (a restart or a crash).
That said, an in-memory state store is simply recreated (replayed) every restart which takes time before a Kafka Streams application is up and running while a persistent state store is something already materialized on a disk and the only time the Kafka Streams instance has to do to re-create the state store is to load the files from disk (not from the changelog topic that takes longer).
I hope that helps and I'd be very glad to be corrected if I'm wrong (or partially correct).

I don't see any real reason to swap current RocksDB store. In fact RocksDB one of the fastest k,v store:
Percona benchmarks (based on RocksDB)
with in-memory ones - RocksDB already acts as in-memory with some LRU algorithms involved:
RocksDB architecture
The three basic constructs of RocksDB are memtable, sstfile and logfile. The memtable is an in-memory data structure - new writes are inserted into the memtable and are optionally written to the logfile.
But there is one more noticeable reason for choosing this implementation:
RocksDB source code
If you will look at source code ratio - there are a lot of Java api exposed from C++ code. So, it's much simpler to integrate this product in existing Java - based Kafka ecosystem with comprehensive control over store, using exposed api.

Redis versus Cassandra(Bigtable data model)

Suppose I need to do the following operations intensively:
put(key, value)
where value is a map of <column name, column value>.
I havn’t known NoSQL for long, what I know is that both Cassandra insert(which conform the api defined in Bigtable paper) and Redis “HSET” command could do that. But what’s the pros and cons of both way? Any performance and scalability difference there?
EDIT :
My requirement is something like an IM server --- I need to store session data , and I want all of them to be in memory so that low latency can be easily achieved. The session last for at most 2 hours. No consistency requirement to consider yet. And disk is only for fail-over. Lost of data is not terrible. All i need is lower latency. Operations per second --- the more, the better.

Both redis and cassandra can be used as a key value store. The difference is in speed, scale and reliability.
Redis works best as a single server, where the entire data set resides in memory.
Cassandra can handle data sets that don't fit in memory, and data sets that don't fit on a single machine. As part of distributing over multiple machines, cassandra is much more reliable. Cassandra can handle machine failures, rebuilding machines, adding capacity to the cluster when needed.
Because redis is entirely in memory, and reads/writes are served by a single machine (a single cassandra write will typically talk to multiple machines), redis will most likely be faster.
If your primary goal is speed, and you don't need to store data reliably, and your data set fits in memory, then redis would probably be a better solution.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse