Remote chunking with Spring Batch job distribution - spring-batch

I have a technical issue running my Spring batch jobs.
The Job simply reading records from the DB (MongoDB) , making some calculations on the record (aggregations) and writing the record result to another table.
Reading A , Processing A , writing to record B
B is an aggregations of many records of A.
I want to use remote chunking to vertically scaling my system causing the processing part be scaled and quick.
The problem I face that I need to synchronize the A records so processing them will not conflict when writing the result to B.
If I distribute 10 A records to 4 slaves they will conflict when writing the aggregate result to B .
Any idea , how to add synchronizing policy when sending messages from the master to the slaves ?
Thanks in advance ...

If you need to synchronize data like you're describing, I'd recommend not going with remote chunking and using partitioning instead. This would allow you to partition by A and eliminate the synchronization issues you're facing. It would also provide additional throughput as you'd be running one processor per slave (same as in remote chunking).

Related

Single Batch job performing heavy database reads

I have a Spring Batch solution which reads several tables in Oracle database, does some flattening and cleaning of data, and sends it to a Restful Api which is our BI platform. The Spring Batch breaks down this data in chunks by date and not by size. It may happen that on a particular day, one chunk could consist of million rows. We are running the complete end-to-end flow in the following way:
Control-M sends a trigger to Load Balancer at a scheduled time
Through Load Balancer request lands on to an instance of Spring Batch app
Spring Batch reads data for that day in chunks from Oracle database
Chunks are then sent to target API
My problems are:
The chunks can get heavier. If it contains of million rows then the instance's heap size increases and at one point chunks will get processed at trickling pace
One instance bears the load of entire batch processing
How can I distribute this processing across a group of instances? Is parallel processing achievable and if yes then how can I make sure that the same rows are not read by multiple instances (to avoid duplication)? Any other suggestions?
Thanks.
You can use a (locally or remotely) partitioned step where each worker step is assigned a distinct dataset. You can find more details and a code example in the documentation here:
https://docs.spring.io/spring-batch/docs/current/reference/html/spring-batch-integration.html#remote-partitioning
https://github.com/spring-projects/spring-batch/tree/main/spring-batch-samples#partitioning-sample

Apache Nifi : Oracle To Mongodb data transfer

I want to transfer data from oracle to MongoDB using apache nifi. Oracle has a total of 9 million records.
I have created nifi flow using QueryDatabaseTable and PutMongoRecord processors. This flow is working fine but has some performance issues.
After starting the nifi flow, records in the queue for SplitJson -> PutMongoRecord are increasing.
Is there any way to slow down records putting into the queue by SplitJson processor?
OR
Increase the rate of insertion in PutMongoRecord?
Right now, in 30 minutes 100k records are inserted, how to speed up this process?
#Vishal. The solution you are looking for is to increase the concurrency of PutMongoRecord:
You can also experiment with the the BATCH size in the configuration tab:
You can also reduce the execution time splitJson. However you should remember this process is going to take 1 flowfile and make ALOT of flowfiles regardless of the timing.
How much you can increase concurrency is going to depend on how many nifi nodes you have, and how many CPU Cores each node has. Be experimental and methodical here. Move up in single increments (1-2-3-etc) and test your file in each increment. If you only have 1 node, you may not be able to tune the flow to your performance expectations. Tune the flow instead for stability and as fast as you can get it. Then consider scaling.
How much you can increase concurrency and batch is also going to depend on the MongoDB Data Source and the total number of connections you can get fro NiFi to Mongo.
In addition to Steven's answer, there are two properties on QueryDatabaseTable that you should experiment with:
Max Results Per Flowfile
Use Avro logical types
With the latter, you might be able to do a direct shift from Oracle to MongoDB because it'll convert Oracle date types into Avro ones and those should in turn by converted directly into proper Mongo date types. Max results per flowfile should also allow you to specify appropriate batching without having to use the extra processors.

Spring batch -Combining Remote Partitioning with remote chunking

Trying to see if I can design a job where I need both partitioning and remote chunking. We could have something like Table A holds rows (one of the columns in table A will be the partition key) and for every Row in Table A, we would have Table B that contains many child records for a given foreign/partition key in Table A . We would need to run a query that filters the partition keys from Table A based on a query and for every partition key, process all the child records in Table B (here again we would have several million records in Table B, so we would need parallelism for record processing and hence remote chunking)
What would be the right way to think through the spring batch job design for something like that?
enter image description here
so we would need parallelism for record processing and hence remote chunking
Not necessarily. Nothing prevents you from using remote chunking in the workers of a partitioned step, but IMO this would complicate things.
A simpler approach is to use multiple jobs. Each job would handle a different partition and process items in parallel using a multi-threaded step. In other words, the partition key is a job parameter here. This approach has the following advantages:
Easier to scale: since you have parallelism at two levels:
run multiple jobs in parallel using multiple JVMs (either on the same machine or on different machines)
and with-in each JVM, use multiple threads to process items in parallel.
Easier to implement: Remote partitioning and chunking are not the easiest setups to configure. Running multiple jobs where each one reads select * from TableA where partitionKey = ? items and uses a multi-threaded step (it requires a single line of code, adding a task executor .taskExecutor(taskExecutor)) is much easier.

System architecture - ETL

We are in the process of designing an ETL process, where we’ll be getting a daily account file (maybe half a million records, could grow) from client and we’ll be loading that file to our database.
Our current process splits the file into smaller files and load it to staging...sometime or if the process fails, we try to figure out how many records we have processed and then start again from that point. Is there any other better alternative to this problem?
We are thinking about using Kafka. I’m pretty new to Kafka. I would really appreciate some feedback if kafka is the way to go or we’re just over-killing a simple ETL process where we just load the data to a staging table and finally to destination table.
Apache Kafka® is a distributed streaming platform. What exactly does
that mean?
A streaming platform has three key capabilities:
Publish and subscribe to streams of records, similar to a message
queue or enterprise messaging system.
Store streams of records in a fault-tolerant durable way.
Process streams of records as they occur.
Kafka is generally used for two broad classes of applications:
Building real-time streaming data pipelines that reliably get data
between systems or applications
Building real-time streaming applications that transform or react to
the streams of data
https://kafka.apache.org/intro
If you encounter errors which make you check the last commited record to your staging database and need system to auto manage this stuff, Kafka can help you ease the process.
Though Kafka is built to work with massive data loads and spread across a cluster, you certainly can use it for smaller problems and utilize it's queuing functionalities and offset management, even with one broker (server) and low number of partitions (level of parallelism).
If you don't anticipate any scale at all, I would suggest you to consider RabbitMQ.
RabbitMQ is a message-queueing software also known as a message
broker or queue manager. Simply said; it is software where queues are
defined, to which applications connect in order to transfer a message
or messages.
https://www.cloudamqp.com/blog/2015-05-18-part1-rabbitmq-for-beginners-what-is-rabbitmq.html
“How to know if Apache Kafka is right for you” by Amit Rathi
https://link.medium.com/enGzNaNvT4
In case you chose Kafka:
When you receive a file, create a process which iterates all over it's lines and sends them to Kafka (Kafka Producer).
Create another process which continuously receive events from kafka (Kafka Consumer) and writes them in mini batches to the database (similar to your small files).
Setup Kafka:
https://dzone.com/articles/kafka-setup
Kafka Consumer/Producer simple example:
http://www.stackframelayout.com/programowanie/kafka-simple-producer-consumer-example/
Don't assume importing data is as easy as dumping it in your database and having the computer handle all the processing work. As you've discovered, an automated load can have problems.
First, database ELT processes depreciate the hard drive. Do not stage the data into one table prior to inserting it in its native table. Your process should only import the data one time to its native table to protect hardware.
Second, you don't need third-party software to middle-man the work. You need control so you're not manually inspecting what was inserted. This means your process is to first clean / transform the data prior to import. You want to prevent all problems prior to load by cleaning and structuring and even processing the data. The load should only be an SQL insert script. I have torn apart many T-SQL scripts where someone thought it convenient to integrate processing with database commands. Don't do it.
Here's how I manage imports from spreadsheet reports. Excel formulas are better than learning ETL tools like SSIS. I use cell formulas to validate whether the record is valid to go into our system. This result is its own column, and then if that column is true, a concatentation column displays an insert script.
=if(J1, concatenate("('", A1, "', ", B1, "),"), "")
If the column is false, the concat column shows nothing. This allows me to copy/paste the inserts into SSMS and conduct mass inserts via "insert into table values" scripts.
If this is actually updating existing records, as your comment appears to suggest, then you need to master the data, organizing what's changed in logs for your users.
Synchronization steps:
Log what is there before you update
Download and compare local vs remote copies for differences; you cannot compare the two without a) having them both in the same physical location or b) controlling the other system
Log what you're updating with, and timestamp when you're updating it
Save and close the logs
Only when 1-4 are done should you post an update to production
My guide to synchronizing data sources and handling Creates/Updates/Deletes:
sync local files with server files

Can Triggers be used in Cassandra for production for a multi datacenter environment?

I have a multi datacenter(DC1, DC2) environment having 3 nodes in each datacenter with RF=3 per datacenter.
Wanted to know if triggers can be used in production in a multi-datacenter environment. If so, how can this be achieved?
Case A: If I start inserting the data to DC1, it would have 3 replicas with in DC1 and is responsible of replicating the data to other data center DC2. Every time an insert into DC2 takes place, I would like to have an trigger event to occur and notify about the latest inserted value in the application. Is it possible?
Case B: If not point 2, is it good to insert the data simultaneously on to two datacenters DC1, DC2 (pointing to a single table) and avoid triggers concept?
Will it have any impact with the network traffic? Based on the latest timestamp, the table would have the last insert to the table which serves the purpose when queried from either of the regions.
Consistency level as LOCAL_QUORUM for Read
Consistency level as ONE for write
dse 4.8.2
With these Consistency levels, good consistency can be achieved lowering the latency for write operation across the datacenters.
Usecase:
We have an application (2 domains) for two different regions(DC1 &
DC2). Users of DC1 region uses domain 1 to access the application and
users of DC2 region uses domain 2 for the same. The data is ingested
to DC1 for the same region and when this replicates in its DC, the
coordinator of DC1 would replicate the data in other DC (DC2). The
moment Dc2 receives the data from DC1, we want to let the application
know about the latest information (Polling_ available using some
trigger event mechanism. Just wanted to know if this can be
implemented with cassandra triggers.
Can someone give the feedback on Case A and Case B? and which would be efficient in production.
Thanks
In either case stated above I am not sure why you want to use a trigger to notify your application that a value was inserted. In the scenario as I understand it your application already knows the newest value. Once the write has been successful you can notify your application with the newest value.
In both cases A and B you are working against some of the basic principals of how Cassandra functions. At an application level you should now need to worry about ensuring replication or eventual consistency of your data across multiple nodes and data centers. That is a large part of what Cassandra brings to the table.
In both Case A and B you are going to get multiple inserts of the same data for each write in each node it is replicated to in both data centers. As you write to DC1 it will also be written to DC2. If you then write to DC2 it will be written back to DC1. This will end with a large number of rows containing the same data and will increase disk requirements and compaction frequency. This will also increase network traffic as the two DC's talk back and forth to gain eventual consistency.
From what I can see here I also have to ask why you are doing an RF=3 on a 3 node cluster. This means that each node in each data center will have all the data essentially making each server a complete replica of the others. This seems like it may be overkill (depending on the data of course) as you are not going to get a lot of the scalability benefits that Cassandra offers.
Cassandra will handle the syncing of data between the data centers and across nodes so your application does not need to worry about this.
One other quick note - Currently your writes are using a CL=ONE. This means that you may end up with cross-DC latency on a write request. If you change this to LOCAL_ONE then you limit your CL query until one of the nodes in the local DC has written the value instead of possibly a node in the other DC. Cassandra will still handle the replication and syncing of the data.
Generally, multiple data center concept is used for workload separation(say different DCs for real-time query,analytic and search). Cassandra by itself takes care of replicating the data across multiple DCs.
So, coming to your question Case B doesn't seems a right option because:
Cassandra automatically replicates data across multiple DCs link
Case A is feasible.alerts/notifications using triggers
Hope, it will be helpful.