Increase Spring data ingest speed - spring-data-jpa

We have an IOT app that receives data on kafka and processes it saves in rdbms. The db that we are using(MemSql) supports more than 20000 inserts per second. But with my regular repository.save method I have been able to achieve only 50 inserts per sec. I have made a simple code that I am testing on a high network speed aws ec2 instance
void saveA(){
for(int i=0;i<1000;i++){
A obj = new A();
aRepository.save(obj);
}
}
This takes 20 seconds to complete. I wish to achieve around 1000k inserts per sec. How do increase this ingestion speed? Should I create a create a thread pool of size 1000 and call save from separate thread? In that case case do I need to care about properties like spring.datasource.tomcat.max-active to increase number of connections in the pool? Would spring data automatically pick a separate connection from pool for each thread?
I can't do batch inserts as I am reading data from kafka one at a time and also because there could be some duplicate data that I need to catch as DataIntegrityViolationException and update.

You don't describe how complex the objects are that you are saving, but it sounds like you have fairly simple objects, i.e. the ration of inserts per save operation is close to 1 and you also don't seem to do many updates, if at all.
If that is the case I'd recommend ditching JPA and going straight for JDBC (using the JdbcTemplate)
The reason is that JPA does a lot of thing for making the typical JPA process work: Load an entity graph, manipulate it, and flush it back to the database.
But you don't do that and so JPA might not help much and makes your life hard because you need to tune JPA and JDBC.
Start with performing the inserts directly using JdbcTemplate.
The next step would be to perform batch inserts.
You write you can't do that but I don't see why you can't collect a couple of rows before writing them to the database.

Related

What is the correct streaming pattern to replace database table polling?

I am trying to architect an event streaming system to replace our existing database table polling mechanism. We currently have a process where Application ABC will query/scan the entire XYZ (MySQL) table every 5 minutes so that we may get any updates to our data and cache them on Application ABC. As our data grows this will not be scalable or performant.
Instead, I want to have Application ABC read from a Kafka stream that contains any new events around the XYZ table, and use that to modify Application ABC's in-memory cache.
Where I'm having a hard time formulating a good solution is the initial database table load onto the Kafka stream. Since all the XYZ data that would be consumed by Application ABC is cached, we lose that data when we redeploy all of the Application ABC nodes. So we would need some kind of mechanism to be able to get all the XYZ data from the initial load onto the stream. I know Kafka streams are supposed to allow for infinite retention but I'm not sure if infinite retention is a realistic solution in this case due to cost.
What's the usually prescribed solution around this initial load case where Application ABC would need to reload the entire database again off of the stream (every time a new instance is spun up)? Also trying to think about what is the most performant solution here so that Application ABC has the lowest latency to be able to gather all the data it needs from XYZ Table.
Another constraint to mention is that Application ABC needs to have this data in memory for performance reasons. We need to be able to iterate over the entire XYZ data set at all times. We cannot do simple queries by ID.
There is a bit to unpack here but here are is some info.
Instead of polling the DB, consider using a source connector to get the data into Kafka. Debezium is made for this. You havent specified what type of database you are using, but it does support quite a few variants. The mechanism is called CDC - Change Data Capture, and it needs to be enabled on the database and each of the tables first.
As for the Application ABC side - consider using a distributed cache with persistence enabled. Redis is a good option for this. This way it will retain the data even if your application is restarted. Reloading all the data back from Kafka is not a good idea, this will take a long time (depending on the amount of data) the application will be unavailable for that duration after a restart.

Single Batch job performing heavy database reads

I have a Spring Batch solution which reads several tables in Oracle database, does some flattening and cleaning of data, and sends it to a Restful Api which is our BI platform. The Spring Batch breaks down this data in chunks by date and not by size. It may happen that on a particular day, one chunk could consist of million rows. We are running the complete end-to-end flow in the following way:
Control-M sends a trigger to Load Balancer at a scheduled time
Through Load Balancer request lands on to an instance of Spring Batch app
Spring Batch reads data for that day in chunks from Oracle database
Chunks are then sent to target API
My problems are:
The chunks can get heavier. If it contains of million rows then the instance's heap size increases and at one point chunks will get processed at trickling pace
One instance bears the load of entire batch processing
How can I distribute this processing across a group of instances? Is parallel processing achievable and if yes then how can I make sure that the same rows are not read by multiple instances (to avoid duplication)? Any other suggestions?
Thanks.
You can use a (locally or remotely) partitioned step where each worker step is assigned a distinct dataset. You can find more details and a code example in the documentation here:
https://docs.spring.io/spring-batch/docs/current/reference/html/spring-batch-integration.html#remote-partitioning
https://github.com/spring-projects/spring-batch/tree/main/spring-batch-samples#partitioning-sample

KDB: parallel insertion to table

I created a multi-threaded connections from Java to KDB then have records inserted to a single table concurrently.
But it seems that the sum of the individual duration and the overall duration is almost the same as if no concurrent insertion happened.
Would you know if KDB supports parallel insertion?
If so, is there any setting I should do?
Does it have a record-level or table-level locking?
kdb does not support parallel inserts into in-memory tables. In fact updates to in-memory data may only be made from the q main thread. This means that tables are 'locked' (can't be amended) essentially to all clients if a q server is started with a negative port, and the issue is irrelevant if the q session is in single threaded mode (as most sessions tend to be). The situation is a little different for tables stored on disk (I can expand on that later if required).
In order to accelerate your inserts I would suggest looking at the following:
a) Are the inserts batched, rather than as a series of single inserts? One insert of 1k rows will take much less time that 1k inserts of one row.
b) Are the inserts sent async or sync? Changing between these two may speed up insertion rates but at the cost of knowing if the inserts executed correctly.
Can you share more about your use case? Is your Java client sending market data? if so would a TP style setup be more appropriate? See kdb+ tick and its derivatives such as TorQ (note that TorQ is developed by my employer).
A KDB process is a single-threaded process in general (except when running in multiple slave thread/process mode) https://code.kx.com/q/ref/cmdline/#-s-slaves
Though you have multiple java threads writing data to q process, the data is getting written in KDB in a sequential manner, hence it is not giving any performance benefit. it does not need the table/row level locking due to this
though I would recommend that you stream the data in async mode (negative handle), this will let your java threads come quickly rather than waiting for KDB to complete the operation, this will definitely improve the performance at the writing side.
While using parallel processing mode(slave threads - positive number), the slave threads are not allowed writing to the global tables/variables; you would need to use multi-process mode to achive that(negative number while launching the q process)

PagingItemReader vs CursorItemReader in Spring batch

I have a spring batch with multiple steps, some sequential and some parallel. Some of these steps involve fetching millions of rows and the query has multiple joins and left joins. I tried using JdbcPagingItemReader but the order by clause simply hangs the query. I don't get results even after 30 minutes of waiting. So I switched to JdbcCursorItemReader.
Is that approach fine ? I understand that the JdbcCursorItemReader fetches all the data at once and writes it out based on the commit interval. Is there any option to specify to the reader to fetch, say, 50000 records at a time, so that my application and the system is not overloaded ?
Thank you for your response, Michael. I have 22 customized Item readers which are extended from jdbcCursorItemReader. If there are multiple threads, how would the Spring batch handle the resultset? Is there a possibility of multiple threads reading from the same resultset in this case, also?
The JdbcCursorItemReader has the ability to configure the fetchSize (how many records are returned from the db with each request), however that depends on your database and it's configuration. For example, most databases you can configure the fetch size and it's honored. However, MySql requires you set the fetch side to Integer.MIN_VALUE in order to stream results. Sqlite is another that has special requirements.
That being said, it is important to know that JdbcCursorItemReader is not thread safe (multiple threads would be reading from the same ResultSet).
I personally would advocate for tuning your query but assuming the above conditions, you should be able to use the JdbcCursorItemReader fine.

Read from mongodb without lock

We're using MongoDB 2.2.0 at work. The DB contains about 51GB of data (at the moment) and I'd like to do some analytics on the user data that we've collected so far. Problem is, it's the live machine and we can't afford another slave at the moment. I know MongoDB has a read lock which may affect any writes that happen especially with complex queries. Is there a way to tell MongoDB to treat my (particular) query with the lowest priority?
In MongoDB reads and writes do affect each other. Read locks are shared, but read locks block write locks from being acquired and of course no other reads or writes are happening while a write lock is held. MongoDB operations yield periodically to keep other threads waiting for locks from starving. You can read more about the details of that here.
What does that mean for your use case? Because there is no way to tell MongoDB to access the data without a read lock, nor is there a way to prioritize the requests (at least not yet) whether the reads significantly affect the performance of your writes depends on how much "headroom" you have available while write activity is going on.
One suggestion I can make is when figuring out how to run analytics, rather than scanning the entire data set (i.e. doing an aggregation query over all historical data) try running smaller aggregation queries on short time slices. This will accomplish two things:
reads jobs will be shorter lived and therefore will finish quicker, this will give you a chance to assess what impact the queries have on your "live" performance.
you won't be pulling all old data into RAM at once - by spacing out these analytical queries over time you will minimize the impact it will have on current write performance.
Depending on what it is you can't afford about getting another server - you might consider getting a short lived AWS instance which may be not very powerful but would be available to run a long analytical query against a copy of your data set. Just be careful when making it a copy of your data - doing a full sync off of the production system will place a heavy load on it (more effective way would be to use a recent backup/file snapshot to resume from).
Such operations are best left for slaves of a replica set. For one thing, read locks can be shared to allow many reads at once, but write locks will block reads. And, while you can't prioritize queries, mongodb yields long running read/write queries. Their concurrency docs should help
If you can't afford another server, you can setup a slave on the same machine, provided you have some spare RAM/Disk headroom, and you use the slave lightly/occasionally. You must be careful though, your disk I/O will increase significantly.