Unloading huge data from Cassandra table - spring-batch

We have a table with 15 million records and one of the columns stores huge XML. Requirement is generate 30 different text files with different fields of XML with all the data (15+ million) from table.
And all   these 30 jobs will run at the same time.
Often we are getting into ReadTimeoutException. Due to time constraints, we can’t think of caching solutions.
How can we mitigate these readtimeout exception? Any help will be greatful.
Below are the spring batch and Cassandra version used
Cassandra – 3.11, using Spring Batch – 3.x as unloads framework.

Related

Optimizing save time for bulk save in spring data jpa

I have a use-case(most common i guess), where i have to insert 300k records, which is everyday refresh, i have used spring jpa save (using batch), currently it is taking more than 1hr to save all records.
I have used batching, but it dint helped much, database is mariadb
is there any better approach for this optimize save time.

Is it possible to configure Hibernate for flush only but never commit ( A kind of commit simulation)

I need to migrate from an old postgreSql database with an old schema (58 tables) to a new database with a new schema (40 tables). The patterns are completely different.
It is not a simple migration (copy and paste). But rather a copy-transform-paste.
I decided to write a batch and use spring batch, spring data and jpa. So I have two dataSources and a chainedTransaction. My config spring is mainly made up of chunck Task with a JpaPagingItemReader and an ItemWriterAdapter.
For performance needs, I also configured Partitioner which allows me to partition my source tables into several sub-tables and a chunckSize = 500000
Everything works smoothly. But considering the size of my old table it takes me a week to migrate all the data.
I will want to do a test which will consist of running my Batch without committing. Just that hibernate generates all sql requests in a ".sql" file, but does not commit the data to the database.
This will allow me to see if the commit is costly in execution time.
Is it possible to configure hibernate to flush only but never commit? A kind of commit simulation ?
Thank's
Usually, the costly part is foreign key and unique key checks as well as index maintenance, but since you don't write how you fetch data, it could very well be the case that you are accessing your data in an inefficient manner.
In general, I would recommend you to create a dump with pg_dump, restore that and then try to do the migration in an SQL only way. This way, no data has to flow around but can stay on the machine which is generally much more efficient.

Spring batch 2.2.0 not writing data to file

We have a spring batch application which inserts data into few tables and then selects data from few tables based on multiple business conditions and writes the data in feed file(flat text file). The application while run generates empty feed file only with headers and no data. The select query when ran separately in SQL developer runs for 2 hours and fetches the data (approx 50 million records). We are using the below components in the application JdbcCursorItemReader and FlatFileWrtier. Below is the configuration details used.
maxBatchSize=100
fileFetchSize=1000
commitInterval=10000
There are no errors or exceptions while the application is run. Wanted to know if we are missing anything here or is any spring batch components not properly used.Any pointers in this regard would be really helpful.

How do I use Redshift Database for Transformation and Reporting?

I have 3 tables in my redshift database and data is coming from 3 different csv files from S3 every few seconds. One table has ~3 billion records and other 2 has ~100 million record. For the near realtime reporting purpose, I have to merge this table into 1 table. How do I achieve this in redshift ?
Near Real Time Data Loads in Amazon Redshift
I would say that the first step is to consider whether Redshift is the best platform for the workload you are considering. Redshift is not an optimal platform for streaming data.
Redshift's architecture is better suited for batch inserts than streaming inserts. "COMMIT"s are "costly" in Redshift.
You need to consider the performance impact of VACUUM and ANALYZE if those operations are going to compete for resources with streaming data.
It might still make sense to use Redshift on your project depending on the entire set of requirements and workload, but bear in mind that in order to use Redshift you are going to engineer around it, and probably change your workload from a "near-real-time" to a micro batch architecture.
This blog posts details all the recommendations for micro batch loads in Redshift. Read the Micro-batch article here.
In order to summarize it:
Break input files --- Break your load files in several smaller files
that are a multiple of the number of slices
Column encoding --- Have column encoding pre-defined in your DDL.
COPY Settings --- Ensure COPY does not attempt to evaluate the best
encoding for each load
Load in SORT key order --- If possible your input files should have
the same "natural order" as your sort key
Staging Tables --- Use multiple staging tables and load them in
parallel.
Multiple Time Series Tables --- This documented approach for dealing
with time-series in Redshift
ELT --- Do transformations in-database using SQL to load into the
main fact table.
Of course all the recommendations for data loading in Redshift still apply. Look at this article here.
Last but not least, enable Workload Management to ensure the online queries can access the proper amount of resources. Here is an article on how to do it.

getting data from DB in spring batch and store in memory

In the spring batch program, I am reading the records from a file and comparing with the DB if the data say column1 from file is already exists in table1.
Table1 is fairly small and static. Is there a way I can get all the data from table1 and store it in memory in the spring batch code? Right now for every record in the file, the select query is hitting the DB.
The file is having 3 columns delimited with "|".
The file I am reading is having on an average 12 million records and it is taking around 5 hours to complete the job.
Preload in memory using a StepExecutionListener.beforeStep (or #BeforeStep).
Using this trick data will be loaded once before step execution.
This also works for step restarting.
I'd use caching like a standard web app. Add service caching using Spring's caching abstractions and that should take care of it IMHO.
Load static table in JobExecutionListener.beforeJob(-) and keep this in jobContext and you can access through multiple steps using 'Late Binding of Job and Step Attributes'.
You may refer 5.4 section of this link http://docs.spring.io/spring-batch/reference/html/configureStep.html