KSQLDB streams flow visualisation - apache-kafka

KSQLDB allows stream joins which is quite a handy solution. As the queries get complicated and the time pass on it is handy to see how the data flow is designed in a visual manner.
My question is are there any existing tooling that allow to visualise the current message flow designed by the KSQL queries? Even perhaps the underlying Kafka streams visualisation would be a good start too.

Related

Distributed database which allows custom CRDT merging

I‘m rather new to distributed databases, though I have already studied related literature (e.g. CAP theorem, CRDT) and implemented some POC to allow scaling my application horizontally.
Now I however face a challenging problem. In ordere to scale the app horizontally, communication between services is done via a distributed queue. As a background here, I do require a custom CRDT method to keep the data eventually consistent, and I do require my application to work like a cache (remotely related to REDIS).
The challenge is now that I also need to persist the data. That requires me to keep the data within the application cache and database eventually consistent. I‘ve checked Cassandra, I saw a ticket [1] where somebody tried to add functionality for custom CRDT merge functionality (which as I mentioned do require for a reason). That never made it into Cassandra, and seems to have a few issues to resolve.
What are my options, either in form of a concrete distributed database engine allowing custom merging, or an algorithm that could help solve the problem (e.g. in form of a db trigger or something like this).
[1] https://issues.apache.org/jira/browse/CASSANDRA-6412
As far as I know, there are very few databases that allow you to specify your own custom conflict resolution algorithms. Tbh. the only one I really found - disclaimer: I'm not a Microsoft Advocate - is Azure CosmosDB. It has MongoDB-compatible API and can be configured to use master-master replication strategy, where you need to specify your own conflict resolution algorithm (using JavaScript). You can use it to define your own merge operation.
If you'll take a look outside of database-native solutions into application-level ones, there are several tools, like ie. Akka (available in both JVM or .NET version) which enables you to write custom CRDTs inside of distributed-data module. JVM version additionally supports multi-datacenter persistence, which is conceptually closer to how commutative CRDTs work and can be integrated with Cassandra backend.
I've implemented a MerkleClock CRDT at my merkle-crdt repository.
You could use an approach that when you update the database record column, you fetch the column's value and then you merge it with your CRDT of your current state and then when you save, you serialise the CRDT as JSON and store it in the database.

What is the equivalent of Kafka Table on Azure Service bus?

Kafka has the concept of streams and tables. Streams represent the happening of events whereas tables represent the state.
Is there a corresponding entity on the Azure resource? Looked at Event Hubs, Event Grids, Servicebus queues and topics. Can't seem to find an equivalent?
As far as I know there is no equivalent.
I'd say the service that comes closest to Kafka is Azure Event Hubs especially in terms of real-time event processing which Azure Service Bus is not dedicated for. But still it does not provide the same features, e.g. there are no materialized views (or least I could not find anything about it until now) which relate to Kafka Tables. There is an article relating to Kafka integration with Azure Event Hubs where they also explain the equivalents as well as unsupported features:
https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-for-kafka-ecosystem-overview
Whenever I stumble across an article of Microsoft or some other blogs when they talk about event streaming (or event sourcing) and materialized views, for the materialized views part they refer to an Azure Storage service such as Table Storage, CosmosDB or similar in combination with Azure Event Hubs or even Service Bus.
So I am afraid there is no comparable cloud managed service on Azure as a direct substitute to Kafka at the moment. Would be glad to see another answer that proves me wrong as I would also be interested in that myself.
During my search I came across some interesting GitHub projects that try to address similar problems on the Azure and .Net stack:
https://github.com/Lokad/AzureEventStore
https://github.com/AzureCosmosDB/solution-accelerator
https://github.com/dannormington/azure-cqrs

Possible to search all topics data in Kafka?

I need a solution preferably something inbuilt (rather than creating my own application) which would help management search through multiple/all topics in Kafka. We are using Confluent Platform. Basically user should be able to search a keyword in a UI and it should search current log of multiple/all Kafka topics and return the data. All the topics in our environment use json to communicate.
So this search would enable us to track flow for example, multiple microservices send data from one system to another system and this flow can be tracked via a correlation id which is present in all the jsons. So if someone searchers this correlation id he should be able to see the messages involved in the flow. This search would have more use cases later on.
We need a solution which would have minimal coding involved. We would prefer to use a UI like Kibana.
On basic reading I suspect below solutions but not really sure as I am new to Confluent (used open-source Apache Kafka earlier):
Sol 1: use ksqldb. (need more help on how to use it)
Sol 2: Stream all topics data using Kafka Connect to Elastic Search by using inbuilt plugin and use Kibana on top of Elastic.
Kindly help to find the best case alternative.
You could use Elastic, sure.
You could also use Splunk, though.
There is also the pdk tool offered by Pilosa that creates a distributed index over Kafka events. (no affiliation)
Another option would be distributed tracing using interceptors between clients, not "on all topics", which sounds like what you actually need

Using Apache Kafka as an alternative to FTP

I'm new to the open-source technology.
I just want to know whether we can use Apache Kafka as an alternative to our regular FTP where we keep files at a certain location from where the end user accesses them.
The source for my files will mainly be SAP HANA. From where I want to push files into Kafka, from where the end user will be able to consume it.
Can someone suggest from where to start or list down the steps in achieving this ?
Kafka is not a 1:1 replacement, no.
Can you use Apache Kafka for streaming data integration between systems, in a more scalable and less brittle way than FTP? Sure. Can you just switch one out for the other? No.
Have a look at these resources to understand more about what Kafka can be used for and how to use it:
http://go.rmoff.net/devoxx18-embrace-the-anarchy
http://go.rmoff.net/devoxx18-build-streaming-pipeline
http://rmoff.dev/ksny19-no-more-silos
Kafka is typically not meant for large data such as files. I suppose that you would want to do some operations on those files. The way you can do is to pass the references to those files to a Kafka topic and let your consumers read the data from those files using those references.
I don't know about SAP Hana. But you may be interested in
SAP Hana Connector for Kafka

NoSQL Database for Blog / Content Management System? (MongoDB / Cassandra)

My company has been used Oracle for a long time but we would like to look for a NoSQL database as a replacement for faster querying and flexible schema design.
I have tried to use MongoDB which would be the most popular NoSQL database nowadays. I connected it to Spring Data to do some simple queries, which is quite easy to be set up and code simply. Since we are using Spring MVC for web development, Spring Data seems quite suitable for integration.
However, I heard that Cassandra would have better performance in write and read, especially in large scaling system. I am not sure whether it is worth to move to Cassandra and not sure how to measure the performance between MongoDB and Cassandra.
Here are some requirements for my system:
focusing on article fetching
tagging for articles for users to easily search for their favors or related articles
non-distributed system, but have load-balancing and fail-over
Java based, Spring MVC for web development
articles would be stored as XML
probably provide user-defined tables (collections) and fields (keys)
Therefore I would like to raise some questions:
Which Database is the most suitable for my case? You may also raise other databases apart from MongoDB and Cassandra.
If I use Cassandra, which framework would be suitable for integrating to Spring MVC?
Thank you so much in advanced.
I have experience using Spring and Cassandra together. But I always have written my own data access layer.
Using the ORMs out there for Cassandra will not allow you to leverage its full power, and you will, most likely, introduce bugs because your SQL background will make you expect certain behaviours that are just not what Cassandra will give you.
My advice write the code that will access Cassandra yourself and do not be afraid to denormalize A LOT. Think more about how you want to query (or find it) your data than the format in which you want to save it.
I also strongly recommend reading this amazing article: Cassandra Data Modeling Best Practices part 1 part 2
Another DB which might suit your application better is CouchDB (I like using BigCouch). It is another Document based NoSQL database and is in my opinion superior to MongoDB. It offers better solution for scaling and gives emphasis to Availability (just like Cassandra).
I'd like to point you to this question about the difference between CouchDB and MongoDB.
As far as framework goes Play framework has a lot of plugin to work with NoSQL systems, so you might give it a try. You could try playorm which is the last I experimented on.
EDIT : I forgot to mention Kundera as well as an ORM for Cassandra
Choosing between Cassandra and MongoDB depends on type of storage. MongoDB is primarily for document based storage where you get an edge by having various sql like features.
If you require columnar database with high availability and multi dc replication? go for Cassandra.
http://db-engines.com/en/system/Cassandra%3BHBase%3BMongoDB