Talend Open Studio Big Data - Iterate and load multiple files in DB - talend

I am new to talend and need guidance on below scenario:
We have set of 10 Json files with different structure/schema and needs to be loaded into 10 different tables in Redshift db.
Is there a way we can write generic script/job which can iterate through each file and load it into database?
For e.g.:
File Name: abc_< date >.json
Table Name: t_abc
File Name: xyz< date >.json
Table Name: t_xyz
and so on..
Thanks in advance

With Talend Enterprise version one can benefit of dynamic schema. However based on my experiences with json-s they are somewhat nested structures usually. So you'd have to figure out how to flatten them, once thats done it becomes a 1:1 load. However with open studio this will not work due to the missing dynamic schema.
Basically what you could do is: write some java code that transforms your JSON into CSV. Use either psql from commandline or if your Talend contains new enough PostgreSQL JDBC driver then invoke the client side \COPY from it to load the data. If your file and the database table column order matches it should work without needing to specify how many columns you have, so its dynamic, but the data newer "flows" through talend.
Really not cool but also theoretically possible solution: If Redshift supports JSON (Postgres does) then one can create a staging table, with 2 columns: filename, content. Once the whole content is in this staging table, INSERT-SELECT SQL could be created that transforms the JSON into tabular format that can be inserted into the final table.
However, with your toolset you probably have no other choice than to load these files with 1 job per file. And I'd suggest 1 dedicated job to each file. They would each look for their own files and triggered / scheduled individually or be part of a bigger job where you scan the folders and trigger the right job for the right file.

Related

Import from one DB to another DB

I need to import data from the inner joined tables of one db to a table of another db using Talend ETL tool.How can i do that?
Iam just new to talend.
How can i inner join the tables using condition in talend
Based on your requirement there would be multiple ways to achieve this.
One approach -
Use tMSSqlInput (for Sql Server - this would change based on your source database) and mention the required attributes to make the connection. In "Query" section - write your complete query involving the three different tables -
Once done, use tMap (to transform your data as per destination table) if required and then tMSSqlOutput (for Sql Server - this would change based on your destination database) to write the data to your table which would reside in another database. In connection properties make sure your configure the database correctly.
For tMSSqlOutput do check the properties - Use Batch/Batch Size & Commit every.
Sample job flow -
Now, another approach could be using bulk feature. You would be able to use tMSSqlOutputBulk to output the data retrieved from your source database into a file and then use tMSSqlBulkExec to bulk load the data from file into your destination table in your destination database.
Sample flow -
Note: Always compare which would be solution would be best fit by comparing the performance of all the available solutions.

Loadind multiple table from source into multiple table into target in Talend

I have around 25 tables to load to target with same structure and which use the same logic for loading. I have prepared one job which does that, but it's a long process to design all the tables.
Is there any way to pass the table name and load to target, basically a small job (in size).
I am using Talend open studio.
Check my answer to a similar question where I proposed a generic solution for loading a MySQL table to another MySQL table.
You just need to modify the queries that retrieve the tables' metadata (columns) depending on your database type.

How to copy a database to sap hana from postgresql with talend?

well my problem is, how could i copy a database with talend from postgresql to sap hana without needing to write a job for every table ?
The reason for this is, because it could take some long time to prepare all those jobs, while taking in consideration, having at least 200 tables, which at least have 30 columns.
I tried tTransferDatabase plugin, but i can't success to transfer it to sap hana, it gives me an error that it can't copy schema (while it successfully worked copying it to other database in postgresql), and i am sure that the schemas names are right.
here is the error:
Exception in component tTransferDatabase_1
java.lang.NullPointerException
at org.apache.ddlutils.PlatformFactory.createNewPlatformInstance(PlatformFactory.java:86)
at org.apache.ddlutils.PlatformFactory.createNewPlatformInstance(PlatformFactory.java:124)
at com.devjpcb.transferdatabase.TransferDatabase.getPlatformDestine(TransferDatabase.java:179)
at com.devjpcb.transferdatabase.TransferDatabase.copySchemaToDatabase(TransferDatabase.java:249)
at local_project.aaasa_0_1.aaasa.tTransferDatabase_1Process(aaasa.java:836)
at local_project.aaasa_0_1.aaasa.runJobInTOS(aaasa.java:1130)
at local_project.aaasa_0_1.aaasa.main(aaasa.java:951)
Is there maybe a chance to do sth like .. for each table in connection, table guess schema, copy columns from table to other side of tmap, run ?
Any advice would be helpful ;), Thank you !
With some work, you could use the example job created by rbaldwin on Talend Exchange; note that it starts with files, not a database. But you could easily create a job that loops through all your database tables and does an extract to file, to then use as the starting point.
Another option is Bekwam's solution

Dynamically create table from csv

I am faced with a situation where we get a lot of CSV files from different clients but there is always some issue with column count and column length that out target table is expecting.
What is the best way to handle frequently changing CSV files. My goal is load these CSV files into Postgres database.
I checked the \COPY command in Postgres but it does have an option to create a table.
You could try creating a pg_dump compatible file instead which has the appropriate "create table" section and use that to load your data instead.
I recommend using an external ETL tool like CloverETL, Talend Studio, or Pentaho Kettle for data loading when you're having to massage different kinds of data.
\copy is really intended for importing well-formed data in a known structure.

Loading DB2 table rows as Marklogic documents

Is there any tool to quickly convert a DB2 table rows into collection of XML documents that we can load to Marklogic?
DB2 supports the SQL/XML publishing extensions that were introduced in SQL:2003. These functions include XMLSERIALIZE, XMLELEMENT, XMLATTRIBUTE, and XMLFOREST, and are easily added to a SQL SELECT statement to produce a simple, well-formed XML document for each row in the result set. By writing queries that retrieve the table names and column layouts from DB2's catalog views, it is possible to automate the creation of the XML-publishing SELECT statements for a large number of tables.
One way of doing this would be to use the MLSQL toolkit ( http://developer.marklogic.com/code/mlsql ). It allows accessing relational databases from within your XQuery code in MarkLogic. Not sure how the returned data actually looks like, but it should be easy to process it within XQuery, and insert your data as XML into MarkLogic.
Just make sure not to try to load a million records in one statement, but instead try to spawn batches of lets say 1000 records at a time. Spawning will also allow for handling it with multiple threads, so should be faster for that reason too..
HTH!
Do you need to stream from DB2 to MarkLogic? Or can you temporarily dump all the documents to an intermediary filesystem and then read them in? If you can dump, then simply use some DB2 tooling (like #Fred's answer above) to export the rows to a bunch of XML documenets in a filesystem and use one of many methods for reading in a directory full of XML files into MarkLogic (like Information Studio (UI or apis), RecordLoader, and so on).
If you have don't want to store them in the filesystem as an intermediary, then you could write an InformationStudio plugin for MarkLogic that will pull out each row and insert a document into MarkLogic. You'd like need some web-service or rest endpoint that the plugin could call to extract the document data from DB2.
Alternatively, I suspect you could use the DB2 tooling (described by #Fred) that will let you execute some code per row of your table. If you can do that in Java (or .Net), then pull in the MarkLogic XCC APIs which will give you the ability to write documents into MarkLogic.