Akka Distributed Pub/Sub and number of named topics - scala

I would like to create a named topic per online user in my system using akka clustering. Does having couple of 10000s named topic at a time impact the performance negatively?

I would not recommend. Topic information is represented by a service key in the Receptionist. Between 10k and 100k is probably OK, above will most likely give you some performance issues.
Depending on what you need, using cluster sharding might be a better fit.

Related

How to Partition a Queue in a distributed system

This problem accrued to me a while ago, unfortunately, I could not find the answer I was looking for on the web. Here is the problem statement:
Consider a simple producer-consumer environment where we only have one
producer writing to a queue and one consumer reading from it. Now
since the objects written on the queue are quite large in size and our
available resources are not much on our current machine, we decided to
implement a distributed queue system where the data inside the queue
is partitioned among multiple nodes. It is important to us that the
total ordering is conserved while pushing and poping the data,
meaning that from the point of a user this distributed queue acts just
like a single unified queue.
Before giving a solution to this problem we have to ask if high availability is more important to us or portion tolerance. I believe in both versions, there are interesting challenges to tackle and I thought that such a question must surely be raised before, however, after searching for existing solutions I could not find a complete and well-thought-out answer from an algorithmic or scientific point of view. Most of what I found were engineering and high-level approaches, leveraging tools like Kafka, RabitMQ, Redis etc.
So the problem remains and I would be thankful if you could share with me your designs, algorithms and thoughts on this problem or point me to some scientific journal or article etc that has already tackled such a problem.
This can be one of the ways in which the above can be achieved. Here the partitioning is achieved in the round-robin fashion.
To achieve high availability, you can have partition replicas.
Pros:-
By adding replicas system becomes highly available.
Multi-consumer groups can be implemented
Cons:-
route table becomes the single source of failure, hence redundancy can be achieved via using dynamo DB & consistent read here.

Best Architecture for offloading high usage Key in Kafka

We have been using Kafka for various use cases and have solved various problems. But the one problem which we are frequently facing is messages with any one key will be suddenly produced more or it will take some secs of execution so that the messages for the other keys in the queue are processed in delay.
We have implemented various ways to find those keys and offloaded it to a separate queue where we will be having a topic pool. But the topics in the pool goes on increasing and we find that we are not using the topic resource in an efficient manner.
If we are having 100 such keys, then we need to create 100 such topics and this not seems to be an optimised solution.
Whether in these type of cases, we should store the data in the DB where the particular key's data resides and we need to implement our own Queue based on the data in the table or there is some other mechanisms in which we can solve this problem ?
This problem is only for the keys having high data rate and high processing time (with 3 to 5s). Can anyone suggest what will be the better architecture for these type of cases?

What is the better way to have a statistical information among the events in Kafka?

I've a project where I need to provide statistical information via API to the external services. In the mentioned service I use only Kafka as a "storage". When the application starts it reads events from cluster for 1 week and counts some values. And actively listens to new events to update the information. For example information is "how many times x item was sold" etc.
Startup of the application takes a lot of time and brings some other problems with it. It is a Kubernetes service and readiness probe fails time to time, when reading last 1 weeks events takes much time.
Two alternatives came to my mind to replace the entire logic:
Kafka Streams or KSQL (I'm not sure if I will need same amount of memory and computation unit here)
Cache Database
I'm wondering which idea would be better here? Or is there any idea better than them?
First, I hope this is a compacted topic that you are reading, otherwise, your "x times" will be misleading as data is deleted from the topic.
Any option you chose will require reading from the beginning of the topic, so the solution will come down to starting a persistent consumer that:
Stores data on disk (such as Kafka Streams or KSQL KTable) in RocksDB
Some other database of your choice. Redis would be a good option, but so would Couchbase if you want to use Memcached

kafka for real-time trading platform handling $mns of transactions per minute?

any folks familiar with both Kafka and real-time trade booking function of a bank? can u recommend kafka for 24/7 uptime, high durability, 10k msg per second, multi-region, unlimited data retention handling (source of truth) of a trading system that handles financial txns worth $billions a week ?
There are many banks and financial services known to be using Kafka, GS for ex. is one of them
What do you mean by 24/7 uptime? If you meant 24/7 availability, yes. Kafka like many other distributed system offers replication to achieve this and with some care, it is possible to even withstand an entire datacenter downtime.
Unlimited data retention is not configurable to my knowledge, but you could set log.retention.hours to a very high number. But message brokers are not necessarily best source of truth.
Do you want to have random access to your data by key? Kafka is a poor choice. Do you have complex range queries? Same. But you can still use Kafka as your primary datastore and use a cache or indexed KV store as a secondary storage. So it really depends on your usecases, query patterns, how you are gonna replay the data on demand etc.
Whether you want to use Kafka as a source of truth should be decided based on your overall architecture, not the other way around.

Kafka which volume to use it?

I work on a log centralization project.
I'm working with ELK to Collect/Aggregate/Store/Visualize my data. I see that Kafka can be useful for large volume of data but
I can not find information from what volume of data it could become interesting to use it.
10 Giga of log per day ? Less, more ?
Thanks for your help.
Let's approach this in two ways.
What volumes of data is Kafka suitable for? Kafka is used at large scale (Netflix, Uber, Paypal, Twitter, etc) and small.
You can start with a cluster of three brokers handling a few MB if you want, and scale out from there as required. 10 Gb of data a day would be perfectly reasonable in Kafka—but so would ten times less or ten times more.
What is Kafka suitable for? In the context of your question, Kafka serves as an event-driven integration point between systems. It can be a "dumb" pipeline, but since it persists data that enables its reconsumption elsewhere. It also offers native stream processing capabilities and integration with other systems.
If all you are doing is getting logs into Elasticsearch then Kafka may be overkill. But if you wanted to use that log data in another place (e.g. HDFS, S3, etc), or process it for patterns, or filter it for conditions to route elsewhere—then Kafka would be a sensible option to route it through. This talk explores some of these concepts.
In terms of ELK and Kafka specifically, Logstash and Beats can write to Kafka as an output, and there's a Kafka Connect connector for Elasticsearch
Disclaimer: I work for Confluent.