Load COPY (Cobol) file in Talend tool - talend

I would like to load a file in Talend which is supposed to have compress data inside. I don't know how to do that, I mean I don't know neither load a COPY file nor a COPY file with compress data. May someone help me please?
These are sample files (one of them is the schema): https://www.dropbox.com/sh/bqvcw0dk56hqhh2/AABbs1GRKjo7rycQrcUM_dgta?dl=0
P.S.: I know how to load csv, Excel, data from SQL databases, among others. However, I don't know how to handle this kind of files.
Thanks in advance.

Related

Can I use a sql query or script to create format description files for multiple tables in an IBM DB2 for System I database?

I have an AS400 with an IBM DB2 database and I need to create a Format Description File (FDF) for each table in the DB. I can create the FDF file using the IBM Export tool but it will only create one file at a time which will take several days to complete. I have not found a way to create the files systematically using a tool or query. Is this possible or should this be done using scripting?
First of all, to correct a misunderstanding...
A Format Description File has nothing at all to do with the format of a Db2 table. It actually describes the format of the data in a stream file that you are uploading into the Db2 table. Sure you can turn on an option during the download from Db2 to create the FDF file, but it's still actually describing the data in the stream file you've just downloaded the data into. You can use the resulting FDF file to upload a modified version of the downloaded data or as the starting point for creating an FDF file that matches the actual data you want to upload.
Which explain why there's no built-in way to create an appropriate FDF file for every table on the system.
I question why you think you actually to generate an FDF file for every table.
As I recall, the format of the FDF (or it's newer variant FDFX) is pretty simple; it shouldn't be all that difficult to generate if you really wanted to. But I don't have one handy at the moment, and my Google-FU has failed me.

Can (Open Studio) Talend be used to automate a data load from a folder to vertica?

I have been looking at a way to automate my data loads into vertica instead of manually exporting flat files each time, and stumbled upon the ETL Talend.
I have been working with a test folder containing multiple csv files, and am attempting to find a way to build a job so the file can be put into vertica.
However, I see in the open studio version (free), if your files do not have the same schema, this becomes next to impossible without having the dynamic schema option which is in the enterprise version.
I start with tFileList and attempt to iterate through tFileInputDelimited, but the schemas are not uniform, so of course it will stop the processing.
So, long story short, am I correct in assuming that there is no way to automate data loads in the free version of Talend if you have a folder consisting of files with different schemas?
If anyone has any suggestions for other open source ETLs to look at or a solution that would be great.
You can access the CURRENT_FILE variable from a tFileList compenent and then send a file down different route depening on the file name. You'd then create a tFileInputDelimited for each file. For example if you had two files named file1.csv and file2.csv, right click the tFileList and choose Trigger>Run If. In the run if condition type ((String)globalMap.get("tFileList_1_CURRENT_FILE")).toLowerCase().matches("file1.csv") and drag it to the tFileInputDelimited set up to handle file1.csv. Do the same for file2.csv, changing the filename in the run if condition.

Tableau TDE or connect to files directly?

I have a personal license for Tableau. I am using it to connect to .csv and .xlsx files currently but am running into some issues.
1) The .csv files are massive (10+ gig)
2) The Excel files are starting to reach the 1mil row limit
3) I need to add certain columns to the .csv files sometimes (like unique ID and a few formulas) which means that I need to open sections of them in Excel, modify what I need to, then save a new file
Would it be better to create an extract for each of these files and then connect the Tableau Workbook to the extract instead of the file? Currently I am connected directly to files and then extract data from there and refresh everyday.
I don't know about others, but I'm using that exactly guideline. I'll have some Workbooks that will simply serve to extract data from some datasource (be it SQL, xlsx, csv, mdb, or any other), and all analysis will be performed in other Workbooks, that'll connect only to tdes
The advantages are:
1) Whenever you need to update a data source, you'll need to only update once (and replace the tde file) and all your workbooks will be up to date. If you connect to the same data source and extract to different tde files, you'll have to extract to all those different tde files (and worry about having updated the extract in that specific Workbook). And even if you extract to the same tde (which doesn't make much sense), it can be confusing (am I connected to the tde or to the file? Does the extract I made in the other workbook updated this one too? Well, yes it did, but it can be confusing)
2) You don't have to worry about replacing a datasource, especially when it's a csv, xlsx or mdb file. You can keep many different versions of those files, and choose which one is the best one. For instance, I'll have table_v1.mdb, table_v2.mdb, ..., and a single table_v1.tde, which will be the extract of one of those mdb files. And I still have the previous versions in case I need them.
3) When you have a SQL connection, or anything that is not a file (csv, xlsx, mdb), extracts are very handy for basically the same reasons above, with (at least) one upside. You don't need to connect to a server every time you want to perform an analysis. That means you can do everything offline, and the person using Tableau doesn't need to have access to the SQL table (or any other source).
One good practice is always keeping a back-up when updating a tde (because, well, shit happens)
10 gig csv, wow. Yes, you should absolutely use a data extract, that would be much quicker. For that much data you could look at other connections such as MS Access or a SQL instance.
If your data have that many rows, I would try to set up a small MySQL instance on your local machine and keep the data there instead. You would be able to connect Tableau directly to the MySQL instance and would be able to easily edit the source data.

Filling table from several input files

I have the following scenario: several csv files contain different columns of the same table. Can I fill the redshift table from them somehow, and, ideally, with the help of the data pipeline? I couldn't find the way I can achieve this. Can anyone help with the solution or maybe simple example if it's possible?
You can do it by converting your csv files into json format prior to their load. Then particular Json tag will not be found in the file: copy will just dismiss it.

Fastest way to upload text files into HDFS(hadoop)

Iam trying to upload 1 million text files into HDFS.
So, uploading those files using Eclipse is taking around 2 hours.
Can anyone please suggest me any fast technique to do this thing.?
What Iam thinking of is : To zip all the text files into a single zip and then upload that into HDFS and finally using some unzipping technique , I would extract those files onto HDFS.
Any help will be appreciated.
Distcp is a good way to upload files to HDFS, but for your particular use case (you want to upload local files to a single node cluster running in the same computer) the best thing is not to upload the files to HDFS at all. You can use localfs (file://a_file_in_your_local_disk) instead of HDFS, so no need to upload the files.
See this other SO question for examples on how to do this.
Try DistCp. DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses Map/Reduce to effect its distribution, error handling and recovery, and reporting. You can use it to copy data from your local FS to HDFS as well.
Example : bin/hadoop distcp file:///Users/miqbal1/dir1 hdfs://localhost:9000/