ksqlDB recommendations for deploying large set of queries - apache-kafka

I am running a ksqlDB streaming application that consists of a large number of queries (>60 queries), including many joins and aggregations. My data comes from various sources, and requires plenty of manipulation to produce the desired processed data, hence the large number of queries. I've run this set of queries on a single machine, using interactive mode, and it produces the right results. But I observe an increasing consumer lag when I increase the amount of data fed into the application.
I read on ksqlDB's Capacity Planning page that I can scale by adding more servers, which is what I plan to do.
Under Important Sizing Factors, it's also stated that "You should avoid running a large number of queries on one ksqlDB cluster. Instead, use interactive mode to play with your data and develop sets of queries that function together. Then, run these in their own headless cluster." However, I am unsure how to do this- my queries are all dependent on each other.
Does anyone have any general recommendations on how to deploy a large number of interdependent ksql queries? As an added requirement, the data is refreshed each day and is independent for the each new day, so I need to do some sort of refresh of the queries each day.

I think that's just a recommendation if you can group queries that depend each other, and then split those groups into headless mode servers.
Another way, if you use interactive mode, is to partitioned your topics and add more ksql servers to your cluster. This will allow ksql to split the workload across the cluster, each server consuming and processing one partition. Say you have 4 partitions per topic and 2 servers, then you'll have 1 server processing 2 partitions and another server other 2 partitions. This should decrease the workload on each server.
Another improvement is to reduce the number of streams threads. Each query you create runs with 4 kafka streams threads by default. The more number of threads, the more parallel work is done in the server. With a large number of queries, performance decreases and lag is incremented. Try with 1 thread and see if that works. Set ksql.streams.num.stream.threads=1 in the ksql-server.properties to configure it.


Do cross-partition queries break infinite CosmosDB horizontal scalability?

As I understand, when you perform a query that doesn't filter by one primary key, you perform a cross-partition query. For this to be executed, the query is sent to all physical partitions of your CDB collection, executed in parallel in each of them, and then returned.
As you scale to tens of thousands of requests per second, that means that each of the tens of thousands of requests is executed on each physical partition.
Does this mean that eventually each partition will reach its limit of requests per second it can serve, and horizontal scaling will no longer give any benefit? Because for every new physical partition CDB adds, it will need to serve all requests coming in, so it's not adding new throughout capacity, only storage.
The downstream implication being that even if at a small scale you're ok with incurring the increased RU cost for cross-partition queries, to truly be able to scale indefinitely your data model should ensure queries hit only one partition (possibly by denormalizing it).
Yes, cross partition queries will not allow a database like Cosmos DB (or any horizontally scalable database) to scale.
Databases like Cosmos DB provide unlimited scale because it scales horizontally. The objective for your partition strategy should be to answer your high volume queries with one, or at a minimum, a bounded set of partitions. The effort around partition strategy is to chose a property that is nearly always passed in queries. Denormalization is generally more a function of modeling data around requests. It has less to do with partitioning directly.
If you would like to learn more about partitioning and modeling with Cosmos DB I highly recommend watching this video. It presents the topics very well, Data modeling & partitioning: What every relational database dev needs to know

Do partitions increase performance when only using one computer/node?

I know that partitions will boost the performance by doing parallel tasks on different nodes in a cluster. But will partitions help me get better performance when I am only using one single computer? I am using Spark and Scala.
Yes it will increase performance.
Make sure your CPU have more than one core.
when you making your local sparksession, make sure to use multiple core :
local to run locally with one thread, or local[N] to run locally with N thread, i suggest you to use local[*]
and make sure your RDD/Dataset have multiple partition, i good number of partition is 2 to 4 time the number of core.
Apache Spark scacles as well vertically (CPU, Ram, ...) and horizontally (Nodes). I assume, that your computer/node has a CPU with more than one core. The partitions are then processed in parallel.

Apache Nifi : Oracle To Mongodb data transfer

I want to transfer data from oracle to MongoDB using apache nifi. Oracle has a total of 9 million records.
I have created nifi flow using QueryDatabaseTable and PutMongoRecord processors. This flow is working fine but has some performance issues.
After starting the nifi flow, records in the queue for SplitJson -> PutMongoRecord are increasing.
Is there any way to slow down records putting into the queue by SplitJson processor?
Increase the rate of insertion in PutMongoRecord?
Right now, in 30 minutes 100k records are inserted, how to speed up this process?
#Vishal. The solution you are looking for is to increase the concurrency of PutMongoRecord:
You can also experiment with the the BATCH size in the configuration tab:
You can also reduce the execution time splitJson. However you should remember this process is going to take 1 flowfile and make ALOT of flowfiles regardless of the timing.
How much you can increase concurrency is going to depend on how many nifi nodes you have, and how many CPU Cores each node has. Be experimental and methodical here. Move up in single increments (1-2-3-etc) and test your file in each increment. If you only have 1 node, you may not be able to tune the flow to your performance expectations. Tune the flow instead for stability and as fast as you can get it. Then consider scaling.
How much you can increase concurrency and batch is also going to depend on the MongoDB Data Source and the total number of connections you can get fro NiFi to Mongo.
In addition to Steven's answer, there are two properties on QueryDatabaseTable that you should experiment with:
Max Results Per Flowfile
Use Avro logical types
With the latter, you might be able to do a direct shift from Oracle to MongoDB because it'll convert Oracle date types into Avro ones and those should in turn by converted directly into proper Mongo date types. Max results per flowfile should also allow you to specify appropriate batching without having to use the extra processors.

Can Triggers be used in Cassandra for production for a multi datacenter environment?

I have a multi datacenter(DC1, DC2) environment having 3 nodes in each datacenter with RF=3 per datacenter.
Wanted to know if triggers can be used in production in a multi-datacenter environment. If so, how can this be achieved?
Case A: If I start inserting the data to DC1, it would have 3 replicas with in DC1 and is responsible of replicating the data to other data center DC2. Every time an insert into DC2 takes place, I would like to have an trigger event to occur and notify about the latest inserted value in the application. Is it possible?
Case B: If not point 2, is it good to insert the data simultaneously on to two datacenters DC1, DC2 (pointing to a single table) and avoid triggers concept?
Will it have any impact with the network traffic? Based on the latest timestamp, the table would have the last insert to the table which serves the purpose when queried from either of the regions.
Consistency level as LOCAL_QUORUM for Read
Consistency level as ONE for write
dse 4.8.2
With these Consistency levels, good consistency can be achieved lowering the latency for write operation across the datacenters.
We have an application (2 domains) for two different regions(DC1 &
DC2). Users of DC1 region uses domain 1 to access the application and
users of DC2 region uses domain 2 for the same. The data is ingested
to DC1 for the same region and when this replicates in its DC, the
coordinator of DC1 would replicate the data in other DC (DC2). The
moment Dc2 receives the data from DC1, we want to let the application
know about the latest information (Polling_ available using some
trigger event mechanism. Just wanted to know if this can be
implemented with cassandra triggers.
Can someone give the feedback on Case A and Case B? and which would be efficient in production.
In either case stated above I am not sure why you want to use a trigger to notify your application that a value was inserted. In the scenario as I understand it your application already knows the newest value. Once the write has been successful you can notify your application with the newest value.
In both cases A and B you are working against some of the basic principals of how Cassandra functions. At an application level you should now need to worry about ensuring replication or eventual consistency of your data across multiple nodes and data centers. That is a large part of what Cassandra brings to the table.
In both Case A and B you are going to get multiple inserts of the same data for each write in each node it is replicated to in both data centers. As you write to DC1 it will also be written to DC2. If you then write to DC2 it will be written back to DC1. This will end with a large number of rows containing the same data and will increase disk requirements and compaction frequency. This will also increase network traffic as the two DC's talk back and forth to gain eventual consistency.
From what I can see here I also have to ask why you are doing an RF=3 on a 3 node cluster. This means that each node in each data center will have all the data essentially making each server a complete replica of the others. This seems like it may be overkill (depending on the data of course) as you are not going to get a lot of the scalability benefits that Cassandra offers.
Cassandra will handle the syncing of data between the data centers and across nodes so your application does not need to worry about this.
One other quick note - Currently your writes are using a CL=ONE. This means that you may end up with cross-DC latency on a write request. If you change this to LOCAL_ONE then you limit your CL query until one of the nodes in the local DC has written the value instead of possibly a node in the other DC. Cassandra will still handle the replication and syncing of the data.
Generally, multiple data center concept is used for workload separation(say different DCs for real-time query,analytic and search). Cassandra by itself takes care of replicating the data across multiple DCs.
So, coming to your question Case B doesn't seems a right option because:
Cassandra automatically replicates data across multiple DCs link
Case A is feasible.alerts/notifications using triggers
Hope, it will be helpful.

Do NoSQL datacenter aware features enable fast reads and writes when nodes are distributed across high-latency connections?

We have a data system in which writes and reads can be made in a couple of geographic locations which have high network latency between them (crossing a few continents, but not this slow). We can live with 'last write wins' conflict resolution, especially since edits can't be meaningfully merged.
I'd ideally like to use a distributed system that allows fast, local reads and writes, and copes with the replication and write propagation over the slow connection in the background. Do the datacenter-aware features in e.g. Voldemort or Cassandra deliver this?
It's either this, or we roll our own, probably based on collecting writes using something like
rsync and sorting out the conflict resolution ourselves.
You should be able to get the behavior you're looking for using Voldemort. (I can't speak to Cassandra, but imagine that it's similarly possible using it.)
The key settings in the configuration will be:
replication-factor — This is the total number of times the data is stored. Each put or delete operation must eventually hit this many nodes. A replication factor of n means it can be possible to tolerate up to n - 1 node failures without data loss.
required-reads — The least number of reads that can succeed without throwing an exception.
required-writes — The least number of writes that can succeed without the client getting back an exception.
So for your situation, the replication would be set to whatever number made sense for your redundancy requirements, while both required-reads and required-writes would be set to 1. Reads and writes would return quickly, with a concomitant risk of stale or lost data, and the data would only be replicated to the other nodes afterwards.
I have no experience with Voldemort, so I can only comment on Cassandra.
You can deploy Cassandra to multiple datacenters with an inter-DC latency higher than a few milliseconds (see http://spyced.blogspot.com/2010/04/cassandra-fact-vs-fiction.html).
To ensure fast local reads, you can configure the cluster to replicate your data to a certain number of nodes in each datacenter (see "Network Topology Strategy"). For example, you specify that there should always be two replica in each data center. So even when you lose a node in a data center, you will still be able to read your data locally.
Write requests can be sent to any node in a Cassandra cluster. So for fast writes, your clients would always speak to a local node. The node receiving the request (the "coordinator") will replicate the data to other nodes (in other datacenters) in the background. If nodes are down, the write request will still succeed and the coordinator will replicate the data to the failed nodes at a later time ("hinted handoff").
Conflict resolution is based on a client-supplied timestamp.
If you need more than eventual consistency, Cassandra offers several consistency options (including datacenter-aware options).