FileWatcher on a directory - scala

I have a Spark/Scala application and my requirement here is to look for a file in a directory
and process it and finally cleaning up that directory.
Isn't this possible to do this within the spark application itself like
- Watching for a file in a directory
- When it finds the file continue the process
- Cleans up the directory before ending the app
- Repeat the above for the next new run and so on...
We currently do this file-watching process using an external application
so in order to remove the dependency on that third-party application
we would like to do this within our spark/scala application itself.
Is there a feasible solution using just scala/spark for a file-watcher?
Please guide me.

File streams in spark streaming?
https://spark.apache.org/docs/latest/streaming-programming-guide.html#file-streams

Related

How logging to be done into a file in a specific folder

Some macOS apps write their logs into a folder like /Library/Logs or ~/Library/Logs? How this can be achieved?
I tried by creating a folder in ~/Library/Logs using FileManager.createDirectory. But I think creating a file and write every time on it using file manager functions will make the app more complex.

drag and drop ear file on wildfly to deploy project

I'm trying to deploy my project on wildfly using drag and drop way.
In fact, I drag and drop the ear project to wildfly server, as result, I got myProject-ear.ear.dodeploy on wildfly-10.0.0.Final\standalone\deployments.
I want to have myProject-ear.ear.deployed instead of myProject-ear.ear.dodeploy after drag and drop the ear project on the server.
Have you please any idea about solving my issue. Thanks a lot.
Whether drag&drop (or actually creating the war/jar/ear/... file in the deployments directory) is sufficient can be configured in the Wildfly configuration file (standalone.xml in your case). But since you create that file and see a ...dodeploy file popping up should tell you the deployment scanner has found your file and is acting.
Once the deployment finished, you should instead see a file named .deployed or .failed. In case of failure a log snippet inside the file could hint to the reason for failure.
But be aware of something: A drag&drop usually triggers a copy operation. Depending on the size of your file that copy may take some time. Wildfly's deployment scanner checks the directory every XXX seconds (configurable). So if your copy process started but the deployment scanner identifies the file before the copy is complete, Wildfly tries to deploy an imcomplete archive. This should result in an error message but may cause what you experience.
So it may be better to first copy the file to another directory (on the same disk), then just move/rename the file into the deployments folder - this operation is atomic, and the deployment scanner will immediately see the full file.
Another way would be to stop the deployment scanner completely and stop/start JBoss after every change to the deployments directory. This is anyway advisable if you run short on PermGen Space.

Analytics for Apache Hadoop - what files are uploaded for Analyzing data with Oozie?

The Analytics for Apache Hadoop documentation lists the following steps for analysing data with Oozie:
Analyzing data with Oozie
Install required drivers.
Use webHDFS to upload the workflow related files to HDFS.
For example, upload the files to /user/biblumix/apps/oozie
...
Source: https://www.ng.bluemix.net/docs/services/AnalyticsforHadoop/index.html
Question: What files are typically uploaded in step 2? The wording suggests that the files are oozie files (e.g. xml files). However, the link takes you to the section Upload your data.
I performed some testing, and I had to upload a workflow.xml in addition to the data files that my oozie job processes.

How to check the content of a file if it has changes before processing it on a job using the spring batch framework

How to check the content of a file if it has changes before processing it on a job using the spring batch framework. My idea is to compare it on the existing database where I wrote that file content (the previous content of the file). To avoid processing it again if there is no changes on the content of that file. I am new in using spring batch framework . Can you give me some idea or sample codes to do that?
See the Spring Integration Documentation.
You can use a file inbound channel adapter, configured with a FileSystemPersistentAcceptOnceFileListFilter. If the modified time on the file changes, the file will be resent to the message channel.
Then, using the Spring Batch Integration components (e.g. JobLaunchingGateway) to launch your batch job to process the file.
You need to be careful, though, to not pick up the file while it is in the process of being modified. It's generally better to remove or rename the file after processing and have the writer create a temporary file and rename it to the final file name after writing. This will avoid the problem of the adapter "seeing" a partially updated file.

Spring batch FlatFileItemReader read multiple files

As per spring batch docs they don't recommend using MuliResourceItemReader because of restart issue and recommend to use one file in each folder.
"It should be noted that, as with any ItemReader, adding extra input
(in this case a file) could cause potential issues when restarting. It
is recommended that batch jobs work with their own individual
directories until completed successfully."
If I have a folder with following structure dest/<timestamp>/file1.txt, file2.txt
How do I configure FlatFileItemReader to read a file with pattern for each folder in a path.
I would prefer Spring Integration project for reading files from a directory since it is not Spring Batch Framework's business to poll a directory.
In the most basic scenario, Spring Integration will poll the files in the directory, and for each file it will run a job with the filename as a parameter. This will leave out the file polling logic from your batch jobs.
I should suggest this excellent article by Dave Syer for the basic concepts of integrating these two technologies. Take a close look at the sections dealing with FileToJobLaunchRequestAdapter
Source code of this adapter will also help understanding the internals.
I also got a similar set of requirement to read multiple text/csv files and achieved by using org.springframework.batch.item.file.MultiResourceItemReader.
The detailed implementation is provided in the below link.
http://parameshk.blogspot.in/2013/11/spring-batch-flat-file-reader-reads.html