Kafka stream enrichment - Sourcing a lookup table [duplicate] - apache-kafka

This question already has an answer here:
Is it a good practice to do sync database query or restful call in Kafka streams jobs?
(1 answer)
Closed 4 years ago.
There is a Kafka stream component that fetches JSON data from a topic. Now I have to do the following:
Parse that input JSON data and fetch the value of a certain ID
(identifier) attribute
Do a lookup against a particular table in Oracle database
Enrich that input JSON with data from the lookup table
Publish the enriched JSON data to another topic
What is the best design approach to achieve Step#2? I have a fair idea on how I can do the other steps. Any help is very much appreciated.

Depending on the size of the dataset you're talking about, and of the volume of the stream, I'd try to cache the database as much as possible (assuming it doesn't change that often). Augmenting data by querying a database on every record is very expensive in terms of latency and performance.
The way I've done this before is instantiating a thread whose only task is to maintain a fresh local cache (usually a ConcurrentHashMap), and make that available to the process that requires it. In this case, you'll probably want to create a processor, give it the reference to the ConcurrentHashMap described above, and when the Kafka record comes in, lookup the data with the key, augment the record, and send it to either a Sink processor, or to another Streams processor, depending on what you want do with it.
In case the lookup fails, you can fallback to actually do a query on demand to the database, but you probably want to test and profile this, because in the worst case scenario of 100% cache misses, you're going to be querying the database a lot.

Related

Spring batch -preload chunk related data

I am reading records from file and I need them to associate with records that are already in database.
The related database record is specified within line in the file (there is id of that record). One item read should have one related record in database. I do not want to read single record from database per item due to performance issues it might have.
Therefore I would like to read all related records from database that are related to currently processed lines within chunk. Is there a way? Or is there a way to access all items that are being processed as a part of single chunk (they should be all in memory anyway)?
I know that I could load all records that are likely to be needed, but assume there is millions of such records in database and i am only processing file that has like thousands lines.
This is clearly a case of custom reader - remember that Spring Batch is simply a framework that tries to give structure to your code & infra but doesn't impose much restrictions as what logic or code you write on your own as long as it conforms to interfaces.
Having said that, if you are not transforming any of read items in ItemProcessor, a List of read items should be available at ItemWriter & those are the read items from the file as part of chunk.
If your file is really small, you can read all items in one go using your custom file reader / parser instead of reading one by one by API provided reader & then can load only those items from DB in one go
Instead of having a Single Step Job, you can have a Two Step Job where your first step dumps file read records to a DB table & in second step you do SQL join among these two tables to find out common records.
These are simply broad ideas & implementation is up-to you. It would become hard if you start looking for ready made APIs for all the custom cases encountered in practical scenarios.
I do not want to read single record from database per item due to performance issues it might have.
What if you read all related items at once for the current item? You can achieve that using the driving query pattern. The idea is to use an item processor that does the query to the database to fetch all records related to the current item.

Understanding CQRS and EventSourcing

I read several blogs and watched video about usefulness of CQRS and ES. I am left with implementation confusion.
CQRS: when use separate table, one for "Write, Update and delete" and other for Read operation. So then how the data sync from write table to read table. Do we required to use cron job to sync data to read only table from write table or any other available options ?
Event Sourcing: Do we store only all Immutable sequential operation as record for each update happened upon once created in one storage. Or do we also store mutable record I mean the same record is updated in another storage
And Please explain RDBMS, NoSQL and Messaging to be used and where they fit into it
when use separate table, one for "Write, Update and delete" and other for Read operation. So then how the data sync from write table to read table.
You design an asynchronous process that understands how to transform the data from its "write" representation to its "read" representation, and you design a scheduler to decide when that asynchronous process runs.
Part of the point is that it's just plumbing, and you can choose whatever plumbing you want that satisfies your operational needs.
Event Sourcing
On the happy path, each "event stream" is a append only sequence of immutable events. In the case where you are enforcing a domain invariant over the contents of the stream, you'll normally have a "first writer wins" conflict policy.
But "the" stream is the authoritative copy of the events. There may also be non-authoritative copies (for instance, events published to a message bus). They are typically all immutable.
In some domains, where you have to worry about privacy and "the right to be forgotten", you may need affordances that allow you to remove information from a previously stored event. Depending on your design choices, you may need mutable events there.
RDBMS
For many sorts of queries, especially those which span multiple event streams, being able to describe the desired results in terms of relations makes the programming task much easier. So a common design is to have asynchronous process that read information from the event streams and update the RDBMS. The usual derived benefit is that you get low latency queries (but the data returned by those queries may be stale).
RDBMS can also be used as the core of the design of the event store / message store itself. Events are common written as blob data, with interesting metadata exposed as additional columns. The message store used by eventide-project is based on postgresql.
NoSQL
Again, can potentially be used as your cache of readable views, or as your message store, depending on your needs. Event Store would be an example of a NoSQL message store.
Messaging
Messaging is a pattern for temporal decoupling; the ability to store/retrieve messages in a stable central area affords the ability to shut down a message producer without blocking the message consumer, and vice versa. Message stores also afford some abstraction - the producer of a message doesn't necessarily know all of the consumers, and the consumer doesn't necessarily know all of the producers.
My Question is about Event Sourcing. Do we required only immutable sequence events to be stored and where to be stored ?
In event sourcing, the authoritative representation of the state is the sequence of events - your durable copy of that event sequence is the book of truth.
As for where they go? Well, that is going to depend on your architecture and storage choices. You could manage files on disk yourself, you could write them in to your own RDBMS; you could use an RDBMS designed by somebody else, you could use a NoSQL document store, you could use a dedicated message store.
There could be multiple stores -- for instance, in a micro service architecture, the service that accepts orders might be different from the service that tracks order fulfillment, and they could each be writing events into different storage appliances.

Kafka Interactive Queries - Accessing large data across instances

We are planning to run kafka streams application distributed in two machines. Each instance stores its Ktable data on its own machine.
The challenge we face here is,
We have a million records pushed to Ktable. We need to iterate the
whole Ktable (RocksDB) data and generate the report.
Let's say 500K records stored in each instance. It's not possible to get all records from other instance in a single GET over http
(unless there is any streaming TCP technique available) . Basically
we need two instance data in a single call and generate the report.
Proposed Solution:
We are thinking to have a shared location (state.dir) for these two instances.So that these two instances will store the Ktable data at same directory and the idea is to get all data from a single instance without interactive query by just calling,
final ReadOnlyKeyValueStore<Key, Result> allDataFromTwoInstance =
streams.store("result",
QueryableStoreTypes.<Key, Result>keyValueStore())
KeyValueIterator<Key, ReconResult> iterator = allDataFromTwoInstance.all();
while (iterator.hasNext()) {
//append to excel report
}
Question:
Will the above solution work without any issues? If not, is there any alternative solution for this?
Please suggest. Thanks in Advance
GlobalKTable is the most natural first choice, but it means each node where the global table is defined contains the entire dataset.
The other alternative that comes to mind is indeed to stream the data between the nodes on demand. This makes sense especially if creating the report is an infrequent operation or when the dataset cannot fit a single node. Basically, you can follow the documentation guidelines for querying remote Kafka Streams nodes here:
http://kafka.apache.org/0110/documentation/streams/developer-guide#streams_developer-guide_interactive-queries_discovery
and for RPC use a framework that supports streaming, e.g. akka-http.
Server-side streaming:
http://doc.akka.io/docs/akka-http/current/java/http/routing-dsl/source-streaming-support.html
Consuming a streaming response:
http://doc.akka.io/docs/akka-http/current/java/http/implications-of-streaming-http-entity.html#client-side-handling-of-streaming-http-entities
This will not work. Even if you have a shared state.dir, each instance only loads its own share/shard of the data and is not aware about the other data.
I think you should use GlobalKTable to get a full local copy of the data.

For extensive Read and write operation MongoDB vs Cassandra

I have used MongoDB but new to Cassandra. I have worked on applications which are using MongoDB and are not very large applications. Read and Write operations are not very much intensive. MongoDB worked well for me in that scenario. Now I am building a new application(w/ some feature like Stack Overflow[voting, totals views, suggestions, comments etc.]) with lots of Concurrent write operations on the same item into the database(in future!). So according to the information, I gathered via online, MongoDB is not the best choice (but Cassandra is). But the problem I am finding in Cassandra is Picking the right data model.
Construct Models around your queries. Not around relations and
objects.
I also looked at the solution of using Mongo + Redis. Is it efficient to update Mongo database first and then updating Redis DB for all multiple write requests for the same data item?
I want to verify which one will be the best to solve this issue Mongo + redis or Cassandra?
Any help would be highly appreciated.
Picking a database is very subjective. I'd say that modern MongoDB 3.2+ using the new WiredTiger Storage Engine handles concurrency pretty well.
When selecting a distributed NoSQL (or SQL) datastore, you can generally only pick two of these three:
Consistency (all nodes see the same data at the same time)
Availability (every request receives a response about whether it succeeded or failed)
Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures)
This is called the CAP Theorem.
MongoDB has C and P, Cassandra has A and P. Cassandra is also a Column-Oriented Database, and will take a bit of a different approach to storing and retrieving data than, say, MongoDB does (which is a Document-Oriented Database). The reality is that either database should be able to scale to your needs easily. I would worry about how well the data storage and retrieval semantics fit your application's data model, and how useful the features provided are.
Deciding which database is best for your app is highly subjective, and borders on an "opinion-based question" on Stack Overflow.
Using Redis as an LRU cache is definitely a component of an effective scaling strategy. The typical model is, when reading cacheable data, to first check if the data exists in the cache (Redis), and if it does not, to query it from the database, store the result in the cache, and return it. While maybe appropriate in some cases, it's not common to just write everything to both Redis and the database. You need to figure out what's cacheable and how long each cached item should live, and either cache it at read time as I explained above, or at write time.
It only depends on what your application is for. For extensive write apps it is way better to go with Cassandra

Using MongoDB to store immutable data?

We investigation options to store and read a lot of immutable data (events) and I'd like some feedback on whether MongoDB would be a good fit.
Requirements:
We'll need to store about 10 events per seconds (but the rate will increase). Each event is small, about 1 Kb. Would it be fine to store all of these events in the same collection?
A really important requirement is that we need to be able to replay all events in order. I've read here that MongoDB have a limit of 32 Mb when sorting documents using cursors. For us it would be fine to read all data in insertion order (like a table scan) so an explicit sort might not be necessary? Are cursors the way to go and would they be able to fullfil this requirement?
If MongoDB would be a good fit for this there some configuration or setting one can tune to increase performance or reliability for immutable data?
This is very similar to storing logs: lots of writes, and the data is read back in order. Luckily the Mongo Site has a recipe for this:
https://docs.mongodb.org/ecosystem/use-cases/storing-log-data/
Regarding immutability of the data, that's not a problem for MongoDB.
Edit 2022-02-19:
Replacement link:
https://web.archive.org/web/20150917095005/docs.mongodb.org/ecosystem/use-cases/storing-log-data/
Snippet of content from page:
This document outlines the basic patterns and principles for using
MongoDB as a persistent storage engine for log data from servers and
other machine data.
Problem Servers generate a large number of events (i.e. logging,) that
contain useful information about their operation including errors,
warnings, and users behavior. By default, most servers, store these
data in plain text log files on their local file systems.
While plain-text logs are accessible and human-readable, they are
difficult to use, reference, and analyze without holistic systems for
aggregating and storing these data.
Solution The solution described below assumes that each server
generates events also consumes event data and that each server can
access the MongoDB instance. Furthermore, this design assumes that the
query rate for this logging data is substantially lower than common
for logging applications with a high-bandwidth event stream.
NOTE
This case assumes that you’re using a standard uncapped collection for
this event data, unless otherwise noted. See the section on capped
collections
Schema Design The schema for storing log data in MongoDB depends on
the format of the event data that you’re storing. For a simple
example, consider standard request logs in the combined format from
the Apache HTTP Server. A line from these logs may resemble the
following: