Apache beam - Is it possible to write to an Excel file? - apache-beam

I would like to create an Apache Beam pipeline for Dataflow to get data from a database, transform it and upload the result as Excel files (xls and xlsx) with style on GCP that I could then share.
The Apache POI library allows me to create Excel files with style but I fail to integrate it with Apache Beam in the pipeline creation process because it's not really a processing on the PCollection.
If anyone would have an idea how I could do this without having to write CSV.
Thanks

Related

How to zip (archive/compress in .zip) a file inside SOA using BPEL?

I have a requirement to pick a .dat file from a remote server and move to a content server using soap but as a zip file. What is the best way to achieve the same? I am using SOA 12c. Is java embedding a way to do so? What are the other ways?

Talend issue while copying local files to HDFS

Hi I want to know how to copy files to HDFS from source file system(Local File system),if source file already copied to HDFS,then how to eliminate or ignore that file to copy again in HDFS using Talend.
Thanks
Venkat
To copy files from local file system to the HDFS, you need to use tHDFSPut components if you have Talend for big data. If you use Talend for data integration you can easily use tSystem component with the right command.
To avoid duplicated files, you need to create a table in a RDBMS and keep track of all copied files. Each time the job start copying file, it should check if it already exists in the table.

Analytics for Apache Hadoop - what files are uploaded for Analyzing data with Oozie?

The Analytics for Apache Hadoop documentation lists the following steps for analysing data with Oozie:
Analyzing data with Oozie
Install required drivers.
Use webHDFS to upload the workflow related files to HDFS.
For example, upload the files to /user/biblumix/apps/oozie
...
Source: https://www.ng.bluemix.net/docs/services/AnalyticsforHadoop/index.html
Question: What files are typically uploaded in step 2? The wording suggests that the files are oozie files (e.g. xml files). However, the link takes you to the section Upload your data.
I performed some testing, and I had to upload a workflow.xml in addition to the data files that my oozie job processes.

How to check the content of a file if it has changes before processing it on a job using the spring batch framework

How to check the content of a file if it has changes before processing it on a job using the spring batch framework. My idea is to compare it on the existing database where I wrote that file content (the previous content of the file). To avoid processing it again if there is no changes on the content of that file. I am new in using spring batch framework . Can you give me some idea or sample codes to do that?
See the Spring Integration Documentation.
You can use a file inbound channel adapter, configured with a FileSystemPersistentAcceptOnceFileListFilter. If the modified time on the file changes, the file will be resent to the message channel.
Then, using the Spring Batch Integration components (e.g. JobLaunchingGateway) to launch your batch job to process the file.
You need to be careful, though, to not pick up the file while it is in the process of being modified. It's generally better to remove or rename the file after processing and have the writer create a temporary file and rename it to the final file name after writing. This will avoid the problem of the adapter "seeing" a partially updated file.

Spring batch FlatFileItemReader read multiple files

As per spring batch docs they don't recommend using MuliResourceItemReader because of restart issue and recommend to use one file in each folder.
"It should be noted that, as with any ItemReader, adding extra input
(in this case a file) could cause potential issues when restarting. It
is recommended that batch jobs work with their own individual
directories until completed successfully."
If I have a folder with following structure dest/<timestamp>/file1.txt, file2.txt
How do I configure FlatFileItemReader to read a file with pattern for each folder in a path.
I would prefer Spring Integration project for reading files from a directory since it is not Spring Batch Framework's business to poll a directory.
In the most basic scenario, Spring Integration will poll the files in the directory, and for each file it will run a job with the filename as a parameter. This will leave out the file polling logic from your batch jobs.
I should suggest this excellent article by Dave Syer for the basic concepts of integrating these two technologies. Take a close look at the sections dealing with FileToJobLaunchRequestAdapter
Source code of this adapter will also help understanding the internals.
I also got a similar set of requirement to read multiple text/csv files and achieved by using org.springframework.batch.item.file.MultiResourceItemReader.
The detailed implementation is provided in the below link.
http://parameshk.blogspot.in/2013/11/spring-batch-flat-file-reader-reads.html