Loading 2 million records in memory for batch is okay? - mongodb

I have to run a spring batch job. I have to read around 2 million documents from mongo. Documents have 15 fields fixed. They contain strings, dates and _id.
My question is, what is the best way to process this? Just do in 1 step or spread thru many steps? What is the best practice? Isn't loading 2 million records into memory bad? I know when loading records thru Apache spark, it streams data which is good. But I am not using Apache spark.

The best way is to use a chunk-oriented step. See chunk-oriented processing section of the docs.
Loading 2 millions records in-memory is not a good idea (even if you can manage to do it by adding more memory to your JVM) because you will have a single transaction to handle those 2 million records. If your job crashes let's say after processing 1 million records, the processing of this first half would be lost. The idea is to process documents in chunks and commit a transaction for every chunk. This type of precessing is:
efficient: since it does not load the whole input data set in memory at once
robust: since a job crash would not require you to reprocess the already processed documents
Hope this helps.

Related

Single Batch job performing heavy database reads

I have a Spring Batch solution which reads several tables in Oracle database, does some flattening and cleaning of data, and sends it to a Restful Api which is our BI platform. The Spring Batch breaks down this data in chunks by date and not by size. It may happen that on a particular day, one chunk could consist of million rows. We are running the complete end-to-end flow in the following way:
Control-M sends a trigger to Load Balancer at a scheduled time
Through Load Balancer request lands on to an instance of Spring Batch app
Spring Batch reads data for that day in chunks from Oracle database
Chunks are then sent to target API
My problems are:
The chunks can get heavier. If it contains of million rows then the instance's heap size increases and at one point chunks will get processed at trickling pace
One instance bears the load of entire batch processing
How can I distribute this processing across a group of instances? Is parallel processing achievable and if yes then how can I make sure that the same rows are not read by multiple instances (to avoid duplication)? Any other suggestions?
Thanks.
You can use a (locally or remotely) partitioned step where each worker step is assigned a distinct dataset. You can find more details and a code example in the documentation here:
https://docs.spring.io/spring-batch/docs/current/reference/html/spring-batch-integration.html#remote-partitioning
https://github.com/spring-projects/spring-batch/tree/main/spring-batch-samples#partitioning-sample

Difference between DynamoDbEnhancedClient.BatchWriteItem(Dynamo DB Version2) and DynamoDBMapper.BatchSave(Version 1)

I have upgraded to DynamoDB Version 2 from Version 1.
In Version1, DynamoDBMapper.BatchSave() method has no batch size limitations I guess. Even if I pass 100+ records it will run successfully.
In Version2, I'm using DynamoDbEnhancedClient.BatchWriteItem(). It has batch size limitations. Up to only 25 records are processed by a Batch. So, for processing 100+ records I'm doing iterations.
Reference documentation on Batch Size limitations:
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html
Does anyone have an Idea, why on Version1 it is handled dynamically, and on Version2 we have to do separate iterations?
Are there any other efficient ways to do Batch operations in Version2?

Spring batch partitioning master can read database and pass data to workers?

I am new to spring batch and trying to design a new application which has to read 20 million records from database and process it.
I don’t think we can do this with one single JOB and Step(in sequential with one thread).
I was thinking we can do this in Partitioning where step is divided into master and multiple workers (each worker is a thread which does its own process can run parallel).
We have to read a table(existing table) which has 20 million records and process them but in this table we do not have any auto generated sequence number and it have primary key like employer number with 10 digits.
I checked few sample codes for Partitioning where we can pass the range to each worker and worker process given range like worker1 from 1 to 100 and worker2 101 to 200…but in my case which is not going work because we don’t have sequence number to pass as range to each worker.
In Partitioning can master read the data from database (like 1000 records) and pass it to each worker in place for sending range ? .
Or for the above scenario do you suggest any other better approach.
In principle any query that returns result rows in a deterministic order is amenable to partitioning as in the examples you mentioned by means of OFFSET and LIMIT options. The ORDER BY may considerably increase the query execution time, although if you order by the table's primary key then this effect should be less noticeable as the table's index will already be ordered. So I would give this approach a try first, as it is the most elegant IMHO.
Note however that you might run into other problems processing a huge result set straight from a JdbcCursorItemReader, because some RDBMSs (like MySQL) won't be happy with the rate at which you'd be fetching rows interlocked with processing. So depending on the complexity of your processing I would recommend validating the design in that regard early on.
Unfortunately it is not possible to retrieve a partition's entire set of table rows and pass it as a parameter to the worker step as you suggested, because the parameter must not serialize to more than a kilobyte (or something in that order of magnitude).
An alternative would be to retrieve each partition's data and store it somewhere (in a map entry in memory if size allows, or in a file) and pass the reference to that resource in a parameter to the worker step which then reads and processes it.

Writing more than 50 millions from Pyspark df to PostgresSQL, best efficient approach

What would be the most efficient way to insert millions of records say 50-million from a Spark dataframe to Postgres Tables.
I have done this from spark to
MSSQL in the past by making use of bulk copy and batch size option which was successful too.
Is there something similar that can be here for Postgres?
Adding the code I have tried and the time it took to run the process:
def inserter():
start = timer()
sql_res.write.format("jdbc").option("numPartitions","5").option("batchsize","200000")\
.option("url", "jdbc:postgresql://xyz.com:5435/abc_db") \
.option("dbtable", "public.full_load").option("user", "root").option("password", "password").save()
end = timer()
print(timedelta(seconds=end-start))
inserter()
So I did the above approach for 10 million records and had 5 parallel connections as specified in numPartitions and also tried batch size of 200k.
The total time it took for the process was 0:14:05.760926 (fourteen minutes and five seconds).
Is there any other efficient approach which would reduce the time?
What would be the efficient or optimal batch size I can use ? Will increasing my batch size do the job quicker ? Or opening multiple connections i.e > 5 help me make the process quicker ?
On an average 14 mins for 10 million records is not bad, but looking for people out there who would have done this before to help answer this question.
I actually did kind of the same work a while ago but using Apache Sqoop.
I would say that for answering this questions we have to try to optimize the communication between Spark and PostgresSQL, specifically the data flowing from Spark to PostgreSql.
But be careful, do not forget Spark side. It does not make sense to execute mapPartitions if the number of partitions is too high compared with the number of maximum connections which
PostgreSQL support, if you have too many partitions and you are opening a connection for each one, you will probably have the following error org.postgresql.util.PSQLException: FATAL: sorry, too many clients already.
In order to tune the insertion process I would approach the problem following the next steps:
Remember the number of partitions is important. Check the number of partitions and then adjust it based on the number of parallel connection you want to have. You might want to have one connection per partition, so I would suggest to check coalesce, as is mentioned here.
Check the max number of connections which your postgreSQL instance support and you want to increase the number.
For inserting data into PostgreSQL is recommended using COPY command. Here is also a more elaborated answer about how to speed up postgreSQL insertion.
Finally, there is no silver bullet to do this job. You can use all the tips I mentioned above but it will really depends on your data and use cases.

s3 parquet write - too many partitions, slow writing

I have my scala spark job to write in to s3 as parquet file. Its 6 billion records so far and it will keep growing daily. As per the use case, our api will query the parquet based on id. So to make the query results faster, i am writing the parquet with partitions on id. However, we have 1330360 unique ids and so this is creating 1330360 parquet files while writing, so the writing step is very slow, writing for past 9 hours and its still running.
output.write.mode("append").partitionBy("id").parquet("s3a://datalake/db/")
Is there anyway, i can reduce the number of partitions and still make the read query faster ? Or any other better way to handle this scenario ? Thanks.
EDIT : - id is an integer column with random numbers.
you can partition by ranges of ids (you didn't say anything about the ids so I can't suggest something specific) and/or use buckets instead of partitions https://www.slideshare.net/TejasPatil1/hive-bucketing-in-apache-spark