I am working on Spring Boot + Spring Batch example. Where I want to read data (Employee Data) from Oracle Datasource and Department Data from CSV and load it into MongoDB as Employee schema holds embeeded Department details.
I am not getting sure if AnstractItemStreamItemReader is the good choice ?
An item reader is designed to read from a single source. You can't read from two sources at the same time, unless you are ready to write a custom reader. What you can do is create a staging table where you load the content of the file, then join data between employee/department tables and write the result in MongoDB.
A similar question but for two files here: Combine rows from 2 files and write to DB using Spring Batch.
Related
I am using spring batch to fetch, process and write some data. I am using Postgres. I want to fetch list of data instead of fetching data one by one and want to pass the same list to processor. Is this possible in Spring Batch?
The chunk-oriented processing is designed to read one item at a time. Now how to define an "item" is up to you. If you think a "list of items" is one item then you can still use that model, but you would need a custom item reader that returns a list.
All item readers provided by Spring Batch are designed to return a single piece of data, like a line in a flat file, a record in database table, a document from a NoSQL collection, a message from a queue, a file in a folder, etc. Anything that deviates from that requires a custom item reader.
We have a spring batch application which inserts data into few tables and then selects data from few tables based on multiple business conditions and writes the data in feed file(flat text file). The application while run generates empty feed file only with headers and no data. The select query when ran separately in SQL developer runs for 2 hours and fetches the data (approx 50 million records). We are using the below components in the application JdbcCursorItemReader and FlatFileWrtier. Below is the configuration details used.
maxBatchSize=100
fileFetchSize=1000
commitInterval=10000
There are no errors or exceptions while the application is run. Wanted to know if we are missing anything here or is any spring batch components not properly used.Any pointers in this regard would be really helpful.
I am new to talend and need guidance on below scenario:
We have set of 10 Json files with different structure/schema and needs to be loaded into 10 different tables in Redshift db.
Is there a way we can write generic script/job which can iterate through each file and load it into database?
For e.g.:
File Name: abc_< date >.json
Table Name: t_abc
File Name: xyz< date >.json
Table Name: t_xyz
and so on..
Thanks in advance
With Talend Enterprise version one can benefit of dynamic schema. However based on my experiences with json-s they are somewhat nested structures usually. So you'd have to figure out how to flatten them, once thats done it becomes a 1:1 load. However with open studio this will not work due to the missing dynamic schema.
Basically what you could do is: write some java code that transforms your JSON into CSV. Use either psql from commandline or if your Talend contains new enough PostgreSQL JDBC driver then invoke the client side \COPY from it to load the data. If your file and the database table column order matches it should work without needing to specify how many columns you have, so its dynamic, but the data newer "flows" through talend.
Really not cool but also theoretically possible solution: If Redshift supports JSON (Postgres does) then one can create a staging table, with 2 columns: filename, content. Once the whole content is in this staging table, INSERT-SELECT SQL could be created that transforms the JSON into tabular format that can be inserted into the final table.
However, with your toolset you probably have no other choice than to load these files with 1 job per file. And I'd suggest 1 dedicated job to each file. They would each look for their own files and triggered / scheduled individually or be part of a bigger job where you scan the folders and trigger the right job for the right file.
I need to import data from the inner joined tables of one db to a table of another db using Talend ETL tool.How can i do that?
Iam just new to talend.
How can i inner join the tables using condition in talend
Based on your requirement there would be multiple ways to achieve this.
One approach -
Use tMSSqlInput (for Sql Server - this would change based on your source database) and mention the required attributes to make the connection. In "Query" section - write your complete query involving the three different tables -
Once done, use tMap (to transform your data as per destination table) if required and then tMSSqlOutput (for Sql Server - this would change based on your destination database) to write the data to your table which would reside in another database. In connection properties make sure your configure the database correctly.
For tMSSqlOutput do check the properties - Use Batch/Batch Size & Commit every.
Sample job flow -
Now, another approach could be using bulk feature. You would be able to use tMSSqlOutputBulk to output the data retrieved from your source database into a file and then use tMSSqlBulkExec to bulk load the data from file into your destination table in your destination database.
Sample flow -
Note: Always compare which would be solution would be best fit by comparing the performance of all the available solutions.
In the spring batch program, I am reading the records from a file and comparing with the DB if the data say column1 from file is already exists in table1.
Table1 is fairly small and static. Is there a way I can get all the data from table1 and store it in memory in the spring batch code? Right now for every record in the file, the select query is hitting the DB.
The file is having 3 columns delimited with "|".
The file I am reading is having on an average 12 million records and it is taking around 5 hours to complete the job.
Preload in memory using a StepExecutionListener.beforeStep (or #BeforeStep).
Using this trick data will be loaded once before step execution.
This also works for step restarting.
I'd use caching like a standard web app. Add service caching using Spring's caching abstractions and that should take care of it IMHO.
Load static table in JobExecutionListener.beforeJob(-) and keep this in jobContext and you can access through multiple steps using 'Late Binding of Job and Step Attributes'.
You may refer 5.4 section of this link http://docs.spring.io/spring-batch/reference/html/configureStep.html