Multiple sources to hive database - streamsets

I am getting multiple sources like the below and how to process to hive table using streamsets piepline
Ex:
1st day - 10 flat files(.csv format)
2nd day - 10 flat files and 10 pdf files
3rd day - 10 oracle tables and 10 flat files
using streamsets need to process the data into hive with dynamic sources

You will need a separate pipeline for each source. Also, there is no off-the-shelf origin for PDF files. You would need to look at using a scripting origin or custom Java origin to do that.

Related

ETL with Dataprep - Union Dataset

I'm a newcomer to GCP, and I'm learning every day and I'm loving this platform.
I'm using GCP's dataprep to join several csv files (with the same column structure), treat some data and write to a BigQuery.
I created a storage (butcket) to put all 60 csv files inside. In dataprep can I define a data set to be the union of all these files? Or do you have to create a dataset for each file?
Thank you very much for your time and attention.
If you have all your files inside a directory in GCS you can import that directory as a single dataset. The process is the same as importing single files. You have to make sure though, that the column structure is exactly the same for all the files inside the directory.
If you create a separate dataset for each file you are more flexible on the structure they have when you use the UNION page to concatenate them.
However, if your use case is just to load all the files (~60) to a single table in Bigquery without any transformation, I would suggest to just use a BigQuery load job. You can use a wildcard in the Cloud Storage URI to specify the files you want. Currently, BigQuery load jobs are free of charge, so it would be a very cost-effective solution compared to the use of Dataprep.

Loop through .csv files using Talend

complete noob here to Talend/Data Integration in general. Have done simple things like loading a CSV to Oracle table using Talend. Below is the requirement now and looking for some ideas to get started please
Request:
Have a folder in Unix Environment where the source application is pushing out .csv files daily#5AM. They are named as below
Filename_20200301.csv
Filename_20200302.csv
Filename_20200303.csv
.
.
and so on till current day.
I have to create a Talend Job to parse through these csv files every morning and load them into an oracle table where my BI/reporting team can consume the data. This table will be used as a Lookup table, and the source is making sure not to send duplicate records in csv.
The files would usually have about 250-300 rows per day. Plan is to keep an eye and if volume of rows increase in future then maybe think of limiting the time frame of the date to rolling 12 months.
Currently i have files from March 1st, 2020 onwards to today.
The destination Oracle schema/table is always the same.
Tools: Talend Data Fabric 7.1
I can think of the below steps but no idea how to get started on step1) and step2)
1) Connect to a Unix server/shared location . I have the server details/Username/Password but what component to use in Metadata?
2) Parse through the files on the above location. Should i use TfileList? Where does the TFileInputDelimited come in?
3) Maybe use Tmap for some cleanup/changing datatypes before using TDBOutput to push into oracle. I have used these components in the past , just have to figure out how to insert into oracle table instead of truncate/load. 
Any thoughts/other cool ways to doing it please. Am i going down the right path? 
For Step 1, you can use the tFTPGet which will save your files from the Unix server/shared location to your local machine or job server.
Then for Step 2, as what you mentioned, you can use a combination of tFileList and tFileInputDelimited
Set tFileList to the directory to directory where your files are now saved (based on Step 1)
tFileList will iterate through the files found in the directory.
Next, tFileInputDelimited will parse each csv one by one
After that you can flow it through a tMap to do whatever transformation you need and write into your Oracle db. An additional optional step you can do is a tUnite so you will write into your db all in one go.
Hope this helps.
Please use below flow,
tFTPFileList --> tFileInputDelimited --> tMap --> tOracleOutput
If you are not picking the file from local server, please use tFileList instead of tFTPFileList

How to creating an external table in redshift spectrum, where file location will change everyday?

We are planning to source data from another AWS account's S3 by using AWS redshift spectrum. But Source informed that bucket key will change every day and latest data will be available in the bucket key location with latest timestamp.
Can anyone suggest what is the best way to create this external table?
External table in Spectrum can be either configured to point to a prefix in S3 (kind of like folder in a normal filesystem) or you can use a manifest file to specify the exact list of files the table should comprise of ( they can even reside in different s3 buckets).
So you will have to create the table every day and point it to the correct location. If all the files end up in the same s3 prefix you will have to use manifest file to specify the current one.
a hint not directly related to the question:
What you could also do, is to create tables daily with a timestamp in the name, and every day create a view pointing to the latest table. This way it will be easy to have a look at the historical data, or of you use the data for eg. machine learning - pin the input to a immutable version of data so that you can reproducably fetch training data - but this of course depends on your requirements.

Talend Open Studio Big Data - Iterate and load multiple files in DB

I am new to talend and need guidance on below scenario:
We have set of 10 Json files with different structure/schema and needs to be loaded into 10 different tables in Redshift db.
Is there a way we can write generic script/job which can iterate through each file and load it into database?
For e.g.:
File Name: abc_< date >.json
Table Name: t_abc
File Name: xyz< date >.json
Table Name: t_xyz
and so on..
Thanks in advance
With Talend Enterprise version one can benefit of dynamic schema. However based on my experiences with json-s they are somewhat nested structures usually. So you'd have to figure out how to flatten them, once thats done it becomes a 1:1 load. However with open studio this will not work due to the missing dynamic schema.
Basically what you could do is: write some java code that transforms your JSON into CSV. Use either psql from commandline or if your Talend contains new enough PostgreSQL JDBC driver then invoke the client side \COPY from it to load the data. If your file and the database table column order matches it should work without needing to specify how many columns you have, so its dynamic, but the data newer "flows" through talend.
Really not cool but also theoretically possible solution: If Redshift supports JSON (Postgres does) then one can create a staging table, with 2 columns: filename, content. Once the whole content is in this staging table, INSERT-SELECT SQL could be created that transforms the JSON into tabular format that can be inserted into the final table.
However, with your toolset you probably have no other choice than to load these files with 1 job per file. And I'd suggest 1 dedicated job to each file. They would each look for their own files and triggered / scheduled individually or be part of a bigger job where you scan the folders and trigger the right job for the right file.

AWS Glue, data filtering before loading into a frame, naming s3 objects

I have 3 questions, for the following context:
I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database,
Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,
Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.
Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?
Thanks in advance
This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743
Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile