I will have 5-6 files on HDFS everyday which I have to parse and load to HIVE tables.
How can I iterate each file to the next component (tfileinputpositional).
I cant fine tfilelist/tloop on Big Data Batch job on Talend studio.
Please help me.
Related
I have S3 folder which has multiple records :
s3://bucket/app-events/year=201909
This folder has multiple parquet file and based on watermark file which contains:
{"last_modified_time": "2021-09-01 00:00:00", "end_time": "2022-09-02 23:59:59"}
This decides from which day and time the files should start reading. We are reading file one by one in a loop and writing it simultaneously, which is taking time in case of huge data . Is there any technique to merge the record when we are reading record one by one .
I am running a pyspark glue job with 10 DPU, the data in s3 is around 45 GB files split into 6 .csv files.
first question:
Its taking a lot of time to write data to Redshift from glue even tho I am running 10 DPUs
second:
How can I make it more faster and efficient , should I write the data back after transformation to s3 in parquet format and then may be use a COPY command to directly write data to Redshift?
Please suggest the best Ideas and approaches.
We have a spring batch application which inserts data into few tables and then selects data from few tables based on multiple business conditions and writes the data in feed file(flat text file). The application while run generates empty feed file only with headers and no data. The select query when ran separately in SQL developer runs for 2 hours and fetches the data (approx 50 million records). We are using the below components in the application JdbcCursorItemReader and FlatFileWrtier. Below is the configuration details used.
maxBatchSize=100
fileFetchSize=1000
commitInterval=10000
There are no errors or exceptions while the application is run. Wanted to know if we are missing anything here or is any spring batch components not properly used.Any pointers in this regard would be really helpful.
I have 3 questions, for the following context:
I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database,
Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,
Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.
Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?
Thanks in advance
This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743
Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile
I have a Spark job that reads from an Oracle table into a dataframe. The way it seems the jdbc.read method works is to pull an entire table in at once, so I constructed a spark-submit job to work in batch. Whenever I have data I need manipulated I put it in a table and run the spark-submit.
However, I would like this to be more event driven...essentially I want it so anytime data is moved into this table it is run through Spark, so I can have events in a UI drive these insertions and spark is just running. I was thinking about using a spark streaming context just to have it watching and operating on the table all the time, but with a long wait between streaming contexts. This way I can use the results (also written to Oracle in part) to trigger a deletion of the read table and not run data more than once.
Is this a bad idea? Will this work? It seems more elegant than using a cron-job.