We are planning to run kafka streams application distributed in two machines. Each instance stores its Ktable data on its own machine.
The challenge we face here is,
We have a million records pushed to Ktable. We need to iterate the
whole Ktable (RocksDB) data and generate the report.
Let's say 500K records stored in each instance. It's not possible to get all records from other instance in a single GET over http
(unless there is any streaming TCP technique available) . Basically
we need two instance data in a single call and generate the report.
Proposed Solution:
We are thinking to have a shared location (state.dir) for these two instances.So that these two instances will store the Ktable data at same directory and the idea is to get all data from a single instance without interactive query by just calling,
final ReadOnlyKeyValueStore<Key, Result> allDataFromTwoInstance =
QueryableStoreTypes.<Key, Result>keyValueStore())
KeyValueIterator<Key, ReconResult> iterator = allDataFromTwoInstance.all();
while (iterator.hasNext()) {
//append to excel report
Will the above solution work without any issues? If not, is there any alternative solution for this?
Please suggest. Thanks in Advance

GlobalKTable is the most natural first choice, but it means each node where the global table is defined contains the entire dataset.
The other alternative that comes to mind is indeed to stream the data between the nodes on demand. This makes sense especially if creating the report is an infrequent operation or when the dataset cannot fit a single node. Basically, you can follow the documentation guidelines for querying remote Kafka Streams nodes here:
and for RPC use a framework that supports streaming, e.g. akka-http.
Server-side streaming:
Consuming a streaming response:

This will not work. Even if you have a shared state.dir, each instance only loads its own share/shard of the data and is not aware about the other data.
I think you should use GlobalKTable to get a full local copy of the data.


There is a Kafka stream component that fetches JSON data from a topic. Now I have to do the following:
Parse that input JSON data and fetch the value of a certain ID
(identifier) attribute
Do a lookup against a particular table in Oracle database
Enrich that input JSON with data from the lookup table
Publish the enriched JSON data to another topic
What is the best design approach to achieve Step#2? I have a fair idea on how I can do the other steps. Any help is very much appreciated.
Depending on the size of the dataset you're talking about, and of the volume of the stream, I'd try to cache the database as much as possible (assuming it doesn't change that often). Augmenting data by querying a database on every record is very expensive in terms of latency and performance.
The way I've done this before is instantiating a thread whose only task is to maintain a fresh local cache (usually a ConcurrentHashMap), and make that available to the process that requires it. In this case, you'll probably want to create a processor, give it the reference to the ConcurrentHashMap described above, and when the Kafka record comes in, lookup the data with the key, augment the record, and send it to either a Sink processor, or to another Streams processor, depending on what you want do with it.
In case the lookup fails, you can fallback to actually do a query on demand to the database, but you probably want to test and profile this, because in the worst case scenario of 100% cache misses, you're going to be querying the database a lot.

Has anyone tried to lookup external API's using PySpark?

Checking to see if anyone has used pyspark of distributing data and use the same to lookup an external API and store results.
I am working on this problem:
I have a source file with say 100000 rows of User Agents. I have to lookup an external API (using requests) and get characteristics of the User Agents and store the same. I was able to accomplish this with Queues and Threads in a reasonable manner.
However, I noticed that 100k row count might turn up to a million.
I was thinking if I could use Spark to distribute this data and perform this API lookup operation in parallel fashion using executors.
Has anyone accomplished this ?

CQRS, Event Sourcing and Scaling

It's clear that system based on these patterns is easily scalable. But I would like to ask you, how exactly? I have few questions regarding scalability:
How to scale aggregates? If I will create multiple instances of aggregate A, how to sync them? If one of the instances process the command and create an event, this event should be propagated to every instance of that agregate?
Shouldn't be there some business logic present which instance of the agregate to request? So if I am issuing multiple commands which applies to aggregate A (ORDERS) and applies to one specific order, it make sense to deliver it to the same instance. Or?
In this article: https://initiate.andela.com/event-sourcing-and-cqrs-a-look-at-kafka-e0c1b90d17d8,
they are using Kafka with a partitioning. So the user management service - aggregate is scaled but is subscribed only to specific partition of the topic, which contains all events of a particular user.
How to scale aggregates?
choose aggregates carefully, make sure your commands spread reasonably among many aggregates. You don't want to have an aggregate that likely to receive high number of command from concurrent users.
Serialize commands sent to aggregate instance. This can be done with aggregate repository and command bus/queue. But for me, the simplest way is to make optimistic locking with aggregate versioning as described in this post by Michiel Rook
which instance of the agregate to request?
In our reSolve framework we are creating instance of aggregate on every command and don't keep it between requests. This works surprisingly fast - it is faster to fetch 100 events and reduce them to aggregate state, than to find a right aggregate instance in a cluster.
This approach is scalable, lets you go serverless - one lambda invocation per command and no shared state in between. Those rare cases when aggregate has too many events are solved by snapshots.
How to scale aggregates?
The Aggregate instances are represented by their stream of events. Every Aggregate instance has its own stream of events. Events from one Aggregate instance are NOT used by other Aggregate instances. For example, if Order Aggregate with ID=1 creates an OrderWasCreated event with ID=1001, that Event will NEVER be used to rehydrate other Order Aggregate instances (with ID=2,3,4...).
That being said, you scale the Aggregates horizontally by creating shards on the Event store based on the Aggregate ID.
If I will create multiple instances of aggregate A, how to sync them? If one of the instances process the command and create an event, this event should be propagated to every instance of that agregate?
You don't. Each Aggregate instance is completely separated from other instances.
In order to be able to scale horizontally the processing of commands, it is recommended to load each time an Aggregate instance from the Event store, by replaying all its previously generated events. There is one optimization that you can do to boost performance: Aggregate snapshots, but it is recommended to do it only if it's really needed. This answer could help.
Shouldn't be there some business logic present which instance of the agregate to request? So if I am issuing multiple commands which applies to aggregate A (ORDERS) and applies to one specific order, it make sense to deliver it to the same instance. Or?
You assume that the Aggregate instances are running continuously on some servers' RAM. You could do that but such an architecture is very complex. For example, what happens when one of the servers goes down and it must be replaced by other? It's hard to determine what instances where living there and to restart them. Instead, you could have many stateless servers that could handle commands for any of the aggregate instances. When a command arrives, you identity the Aggregate ID, you load it from the Event store by replaying all its previous events and then it can execute the command. After the command is executed and the new events are persisted to the Event store, you can discard the Aggregate instance. The next command that arrives for the same Aggregate instance could be handled by any other stateless server. So, scalability is dictated only by the scalability of the Event store itself.
How to scale aggregates?
Each piece of information in the system has a single logical authority. Multiple authorities for a single piece of data gets you contention. You scale the writes by creating smaller non overlapping boundaries -- each authority has a smaller area of responsibility
To borrow from your example, an example of smaller responsibilities would
be to shift from one aggregate for all ORDERS to one aggregate for _each_
It's analogous to the difference between having a key value store with
all ORDERS stored in a document under one key, vs each ORDER being stored
using its own key.
Reads are safe, you can scale them out with multiple copies. Those copies are only eventually consistent, however. This means that if you ask "what is the bid price of FCOJ now?" you may get different answers from each copy. Alternatively, if you ask "what was the bid price of FCOJ at 10:09:02?" then each copy will either give you a single answer or say "I don't know yet".
But if the granularity is already one command per aggregate, what is not very often possible in my opinion, and you have really many concurrent accesses, how to solve it? How to spread the load and stay without the conflict as much as possible?
Rough sketch - each aggregate it stored via a key that can be computed from the contents of the command message. Update to the aggregate is achieved by a compare-and-swap operation using that key.
Acquire a message
Compute the storage key
Load a versioned representation from storage
Compute a new versioned representation
Store.compare and swap the new representation for the old
To provide additional traffic throughput, you add more stateless compute.
To provide storage throughput, you distribute the keys across more storage appliances.
A routing layer can be used to group messages together - the routers uses the same storage key calculation as before, but uses that to choose where in the compute farm to forward the message. The compute can then check each batch of messages it receives for duplicate keys, and process those messages together (trading some extra compute to reduce the number of compare and swaps).
Sane message protocols are important; see Marc de Graauw's Nobody Needs Reliable Messaging.

Using MongoDB to store immutable data?

We investigation options to store and read a lot of immutable data (events) and I'd like some feedback on whether MongoDB would be a good fit.
We'll need to store about 10 events per seconds (but the rate will increase). Each event is small, about 1 Kb. Would it be fine to store all of these events in the same collection?
A really important requirement is that we need to be able to replay all events in order. I've read here that MongoDB have a limit of 32 Mb when sorting documents using cursors. For us it would be fine to read all data in insertion order (like a table scan) so an explicit sort might not be necessary? Are cursors the way to go and would they be able to fullfil this requirement?
If MongoDB would be a good fit for this there some configuration or setting one can tune to increase performance or reliability for immutable data?
This is very similar to storing logs: lots of writes, and the data is read back in order. Luckily the Mongo Site has a recipe for this:
Regarding immutability of the data, that's not a problem for MongoDB.
Edit 2022-02-19:
Replacement link:
Snippet of content from page:
This document outlines the basic patterns and principles for using
MongoDB as a persistent storage engine for log data from servers and
other machine data.
Problem Servers generate a large number of events (i.e. logging,) that
contain useful information about their operation including errors,
warnings, and users behavior. By default, most servers, store these
data in plain text log files on their local file systems.
While plain-text logs are accessible and human-readable, they are
difficult to use, reference, and analyze without holistic systems for
aggregating and storing these data.
Solution The solution described below assumes that each server
generates events also consumes event data and that each server can
access the MongoDB instance. Furthermore, this design assumes that the
query rate for this logging data is substantially lower than common
for logging applications with a high-bandwidth event stream.
This case assumes that you’re using a standard uncapped collection for
this event data, unless otherwise noted. See the section on capped
Schema Design The schema for storing log data in MongoDB depends on
the format of the event data that you’re storing. For a simple
example, consider standard request logs in the combined format from
the Apache HTTP Server. A line from these logs may resemble the

Apache spark streaming - cache dataset for joining

I'm considering using Apache Spark streaming for some real-time work but I'm not sure how to cache a dataset for use in a join/lookup.
The main input will be json records coming from Kafka that contain an Id, I want to translate that id into a name using a lookup dataset. The lookup dataset resides in Mongo Db but I want to be able to cache it inside the spark process as the dataset changes very rarely (once every couple of hours) so I don't want to hit mongo for every input record or reload all the records in every spark batch but I need to be able to update the data held in spark periodically (e.g. every 2 hours).
What is the best way to do this?
I've thought long and hard about this myself. In particular I've wondered is it possible to actually implement a database DB in Spark of sorts.
Well the answer is kind of yes. First you want a program that first caches the main data set into memory, then every couple of hours does an optimized join-with-tiny to update the main data set. Now apparently Spark will have a method that does a join-with-tiny (maybe it's already out in 1.0.0 - my stack is stuck on 0.9.0 until CDH 5.1.0 is out).
Anyway, you can manually implement a join-with-tiny, by taking the periodic bi-hourly dataset and turning it into a HashMap then broadcasting it as a broadcast variable. What this means is that the HashMap will be copied, but only once per node (compare this with just referencing the Map - it would be copied once per task - a much greater cost). Then you take your main dataset and add on the new records using the broadcasted map. You can then periodically (nightly) save to hdfs or something.
So here is some scruffy pseudo code to elucidate:
var mainDataSet: RDD[KeyType, DataType] = sc.textFile("/path/to/main/dataset")
everyTwoHoursDo {
val newData: Map[KeyType, DataType] = sc.textFile("/path/to/last/two/hours")
val mainDataSetNew =
mainDataSet.map((key, oldValue) => (key,
newData.get(key).map(newDataValue =>
update(oldValue, newDataValue))
mainDataSetNew.someAction() // to force execution
mainDataSet = mainDataSetNew
I've also thought that you could be very clever and use a custom partioner with your own custom index, and then use a custom way of updating the partitions so that each partition itself holds a submap. Then you can skip updating partitions that you know won't hold any keys that occur in the newData, and also optimize the updating process.
I personally think this is a really cool idea, and the nice thing is your dataset is already ready in memory for some analysis / machine learning. The down side is your kinda reinventing the wheel a bit. It might be a better idea to look at using Cassandra as Datastax is partnering with Databricks (people who make Spark) and might end up supporting some kind of thing like this out of box.
Further reading:
Here is a fairly simple work-flow:
For each batch of data:
Convert the batch of JSON data to a DataFrame (b_df).
Read the lookup dataset from MongoDB as a DataFrame (m_df). Then cache, m_df.cache()
Join the data using b_df.join(m_df, "join_field")
Perform your required aggregation and then write to a data source.