How can I read a larger dataset within other minor reads from a database using chuck based approach in Spring Batch? - spring-batch

I have to make several data reads from a database including a larger dataset in between and write the content to a file after processing.
Example
| <- read account data from the database
| <- Process & Read a smaller account dataset from the database
| <- Process & Read smaller data sub-set from the database based on the above data
| <- Process & read a larger dataset from the database (chunk-based approach preferred)
| <- Process & read smaller data sub-set from the database based on the above data
| -> Process & write all the above collected/ processed data to a file
Multi-step processing is possible but it has a lot of overheads in step handling including inter-step data transfer since we have to create a single file from all the above data sets.
Caching the data set and use while processing is not possible for the larger dataset.
Simple data reads (generic) within processors for a larger dataset consume a lot of time & memory.
What would be the possible approaches to convert this to a Spring-Batch-based batch processing service?

I think the chunk-oriented processing model provided by Spring Batch is not suitable to your use case.
Multi-step processing is the way to go in my opinion, because this is what your requirement is about (this is almost dictated by your requirement), at least this is how I see it implemented without Spring Batch.

Related

Why Parquet over some RDBMS like Postgres

I'm working to build a data architecture for my company. A simple ETL with internal and external data with the aim to build static dashboard and other to search trend.
I try to think about every step of the ETL process one by one and now I'm questioning about the Load part.
I plan to use Spark (LocalExcecutor on dev and a service on Azure for production) so I started to think about using Parquet into a Blob service. I know all the advantage of Parquet over CSV or other storage format and I really love this piece of technology. Most of the articles I read about Spark finish with a df.write.parquet(...).
But I cannot figure it out why can I just start a Postgres and save everything here. I understand that we are not producing 100Go per day of data but I want to build something future proof in a fast growing company that gonna produce exponentially data by the business and by the logs and metrics we start recording more and more.
Any pros/cons by more experienced dev ?
EDIT : What also make me questioning this is this tweet : https://twitter.com/markmadsen/status/1044360179213651968
The main trade-off is one of cost and transactional semantics.
Using a DBMS means you can load data transactionally. You also pay for both storage and compute on an on-going basis. The storage costs for the same amount of data are going to be more expensive in a managed DBMS vs a blob store.
It is also harder to scale out processing on a DBMS (it appears the largest size Postgres server Azure offers has 64 vcpus). By storing data into an RDBMs you are likely going to run-up against IO or compute bottlenecks more quickly then you would with Spark + blob storage. However, for many datasets this might not be an issue and as the tweet points out if you can accomplish everything inside a the DB with SQL then it is a much simpler architecture.
If you store Parquet files on a blob-store, updating existing data is difficult without regenerating a large segment of your data (and I don't know the details of Azure but generally can't be done transactionally). The compute costs are separate from the storage costs.
Storing data in Hadoop using raw file formats is terribly inefficient. Parquet is a Row Columnar file format well suited for querying large amounts of data in quick time. As you said above, writing data to Parquet from Spark is pretty easy. Also writing data using a distributed processing engine (Spark) to a distributed file system (Parquet+HDFS) makes the entire flow seamless. This architecture is well suited for OLAP type data.
Postgres on the other hand is a relational database. While it is good for storing and analyzing transactional data, it cannot be scaled horizontally as easily as HDFS can be. Hence when writing/querying large amount of data from Spark to/on Postgres, the database can become a bottleneck. But if the data you are processing is OLTP type, then you can consider this architecture.
Hope this helps
One of the issues I have with a dedicated Postgres server is that it's a fixed resource that's on 24/7. If it's idle for 22 hours per day and under heavy load 2 hours per day (in particular if the hours aren't
continuous and are unpredictable)
then the server sizing during those 2 hours is going to be too low whereas during the other 22 hours it's too high.
If you store your data as parquet on Azure Data Lake Gen 2 and then use Serverless Synapse for SQL queries then you don't pay for anything on a 24/7 basis. When under heavy load, everything scales automatically.
The other benefit is that parquet files are compressed whereas Postgres doesn't store data compressed.
The downfall is "latency" (probably not the right term but it's how I think of it). If you want to query a small amount of data then, in my experience, it's slower with the file + Serverless approach compared to a well indexed clustered or partitioned Postgres table. Additionally, it's really hard to forecast your bill with the Serverless model coming from the server model. There's definitely going to be usage patterns where Serverless is going to be more expensive than a dedicated server. In particular if you do a lot of queries that have to read all or most of the data.
It's easier/faster to save a parquet than to do a lot of inserts. This is a double edged sword because the db guarantees acidity whereas saving parquet files doesn't.
Parquet storage optimization is its own task. Postgres has autovacuum. If the data you're consuming is published daily but you want it on a node/attribute/feature partition scheme then you need to do that manually (perhaps with spark pools).

What are some of the most efficient workflows for processing "big data" (250+ GB) from postgreSQL databases?

I am constructing a script that will be processing well-over 250+ GB of data from a single postgreSQL table. The table's shape is ~ 150 cols x 74M rows (150x74M). My goal is to somehow sift through all the data and make sure that each cell entry meets certain criteria that I will be tasked with defining. After the data has been processed I want to pipeline it into an AWS instance. Here are some scenarios I will need to consider:
How can I ensure that each cell entry meets certain criteria of the column it resides in? For example, all entries in the 'Date' column should be in the format 'yyyy-mm-dd', etc.
What tools/languages are best for handling such large data? I use Python and the Pandas module often for DataFrame manipulation, and am aware of the read_sql function, but I think that this much data will simply take too long to process in Python.
I know how to manually process the data chunk-by-chunk in Python, however I think that this is probably too inefficient and the script could take well over 12 hours.
Simply put or TLDR: I'm looking for a simple, streamlined solution to manipulating and performing QC analysis on postgreSQL data.

Maximum number of rows with web data connector as data source

How many rows can the web data connector handle to import data into Tableau? Or what is the maximum number of rows which I can generally import?
There are no limitations to how many rows of data you bring back with your web data connector; performance scales pretty well as you bring back more and more rows, so it's really just a matter of how much time you are OK dealing with.
The total performance will be a combination of:
The time it takes for you to retrieve data from the API.
The time it takes our database to create an extract with that data once your web data connector passes it back to Tableau.
#2 will be comparable to the time it would take to create an extract from an Excel file with the same schema and size as the data in your web data connector.
On a related note, the underlying database used (Tableau Data Engine) handles a large number of rows well, but is not as suited for handling a large number of columns, thus our guidance is to bring back less than 60 columns if possible.

how to load more data in database through talend ETL

I have 6.5 GB data which consists 900000 rows in my Input table ****(tPostgresqlInput)** ,I am trying to load the same data into my output table(tPostgresqlOutput) , while running the job i am not getting any response from my input table, Is there Any solution to load the data? pls refer my attachment
You made need to develop a strategy to retrieve more manageable chunks of data, for example dividing up the data based on row IDs. That way, it does not take as much memory or time to retrieve the data .
You could also increase the Default memory limit for the job from 1 GB to a higher number .
If you job runs on the same network as your database server, that can also increase performance.
Make sure you enable Use Cursor on the Inputs advanced settings. The default 1k value is fine.
Also enable the batch size on the ouput which does similar.
By enabling this Talend will work with 1k records at a time.
If this two tables are in the same DB you can try to use Talend ELT component
no push down you processing to the database. Take a look on following set of components:
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide60EN/tELTPostgresqlInput
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide60EN/tELTPostgresqlMap
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide60EN/tELTPostgresqlOutput

Storing lots of files

I need to store a lot of files (like millions per day). On average, a file is 20 KB. I also need to store some meta-data for these files (date, source, classification etc.) I need to be able to access and retrieve the files according to queries on metadata (No joins, only filtering with WHERE clauses). Writes must be fast, read times are not as important.
As far as I understand, I have 3 possible ways of storing data:
Use an RDBMS (e.g. PostgreSQL) to store meta-data and store file paths. Execute queries then read matching files from file system
Use only Cassandra (my company uses Cassandra). Store meta-data and file content on Cassandra.
Use Postgres + Cassandra together. Store meta-data and Cassandra keys on Postgres, query Postgres and retrieve Cassandra keys,then get actual file content from Cassandra
What are the advantages, disadvantages for these options? I am thinking I should go with option 2 but cannot be sure.
Thanks
It really depends on the size of your files. Storing large files in Cassandra generally isn't the best solution. You'd have to chunk your files at some point to store the content in separate columns using wide rows. In this case it would be better to use a distributed file system like ceph.
But if files are only 20k, the overhead using a distributed FS will not be worth it and Cassandra will do a good job storing this amount of data in a single column as a blob. You just need to be aware of the memory footprint while working with those rows. Each time you retrieve such a file from Cassandra, the whole content will be loaded into the heap, if you don't use chunking using a clustering key.