Streamset Ignoring File in HDFS - streamsets

I am getting below Error while running my pipeline. Pipeline takes files from HDFS, Merges them and again stores the files on HDFS.
Error:
WARN Ignoring file 'Filename and Location' in spool directory as is lesser than offset file.
Category : SpoolDirRunnable.
Kindly let me know the possible fix.

This file probably was already processed by this pipeline.
You can try to run the pipeline with "reset origin" option.

Related

Scala Spark - overwrite parquet file failed to delete file or dir

I'm trying to create parquet files for several days locally. The first time I run the code, everything works fine. The second time it fails to delete a file. The third time it fails to delete another file. It's totally random which file can not be deleted.
The reason I need this to work is because I want to create parquet files everyday for the last seven days. So the parquet files that are already there should be overwritten with the updated data.
I use Project SDK 1.8, Scala version 2.11.8 and Spark version 2.0.2.
After running that line of code the second time:
newDF.repartition(1).write.mode(SaveMode.Overwrite).parquet(
OutputFilePath + "/day=" + DateOfData)
this error occurs:
WARN FileUtil:
Failed to delete file or dir [C:\Users\...\day=2018-07-15\._SUCCESS.crc]:
it still exists.
Exception in thread "main" java.io.IOException:
Unable to clear output directory file:/C:/Users/.../day=2018-07-15
prior to writing to it
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:91)
After the third time:
WARN FileUtil: Failed to delete file or dir
[C:\Users\day=2018-07-20\part-r-00000-8d1a2bde-c39a-47b2-81bb-decdef8ea2f9.snappy.parquet]: it still exists.
Exception in thread "main" java.io.IOException: Unable to clear output directory file:/C:/Users/day=2018-07-20 prior to writing to it
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:91)
As you see it's another file than the second time running the code.
And so on.. After deleting the files manually all parquet files can be created.
Does somebody know that issue and how to fix it?
Edit: It's always a crc-file that can't be deleted.
Thanks for your answers. :)
The solution is not to write in the Users directory. There seems to be a permission problem. So I created a new folder in the C: directory and it works perfect.
this problem occurs when you open the destination directory in windows. You just need to close the directory.
Perhaps another Windows process has a lock on the file so it can't be deleted.

pyspark - capture malformed JSON file name after load fails with FAILFAST option

To detect malformed/corrupt/incomplete JSON file, I have used FAILFAST option so that process fails. How do I capture corrupted file name out of 100s files because I need to remove that file from the path and copy good version of file from s3 bucket?
df = spark_session.read.json(table.load_path, mode='FAILFAST').cache()

Informatica Session Failing

I created a mapping that pulls data from a flat file that shows me usage data for specific SSRS reports. The file is overwritten each day with the previous days usage data. My issue is, sometimes the report doesn't have any usage for that day and my ETL sends me a "Failed" email because there wasn't any data in the Source. The job from running if there is no data in the source or to prevent it from failing.
--Thanks
A simple way to solve this is to create a "Passthrough" mapping that only contains a flat file source, source qualifier, and a flat file target.
You would create a session that runs this mapping at the beginning of your workflow and have it read your flat file source. The target can just be a dummy flat file that you keep overwriting. Then you would have this condition in the link to your next session that would actually process the file:
$s_Passthrough.SrcSuccessRows > 0
Yes, there are several ways, you can do this.
You can provide an empty file to ETL job when there is no source data. To do this, use a pre-session command like touch <filename> in the Informatica workflow. This will create an empty file with the <filename> if it is not present. The workflow will run successfully with 0 rows.
If you have a script that triggers the Informatica job, then you can put a check there as well like this:
if [ -e <filename> ]
then
pmcmd ...
fi
This will skip the job from executing.
Have another session before the actual dataload. Read the file, use a FALSE filter and some dummy target. Link this one to the session you already have and set the following link condition:
$yourDummySessionName.SrcSuccessRows > 0

Spark Streaming - textFileStream: Tmp file error

I have an Spark Streaming application that is scanning declared directory via textFileStream method. I have batch set to 10 seconds and I'm processing new files and concatenate their content into bigger parquet files. Files are downloaded/created directly in declared directory and that's cause me an issue I cannot bypass.
Lets get following scenario as example:
Directory is empty for most of a time, then around 100 small files come in the same time as streaming batch triggers. Spark stream founds 70 files with .json extension and 30 files with _tmp (still under creation/download). This situation obviously crash my app since process try to work with _tmp file witch in the meantime changed into fully created/downloaded .json file.
I've tried to filter out the rdd with following method:
val jsonData = ssc.textFileStream("C:\\foo").filter(!_.endsWith("tmp"))
But it still cause
jsonData.foreachRDD{ rdd=>
/.../
}
to process _tmp files and throws exception of no such file as *_tmp
Question
I've read about some staging directory from which I should move files after create/download process finish, but copy nor move(why is that? copy operation creates new file so...) operation doesn't trigger textFileStream and process is ignoring those files. Is there any other method to filter out those files and wait till they are complete before processing?
Env spec
Directory on Windows 8.1 file system, eventually that would be a Unix like machine
Spark 1.5 for Hadoop 2.6
Actually, .filter(!_.endsWith("tmp")) will filter the content of files inside c:\foo, it means the data, by default line by line. If it finds a line (not a filename) ending with tmp will remove it from the stream.

Spark coalesce looses file when program is aborted

in Scala/Spark i am using a DataFrame and write it into a single file using:
val dataFrame = rdd.toDF()
dataFrame.coalesce(1).write.mode(SaveMode.Overwrite).parquet(filePath)
This works fine. But I figured out using the console and Hadoop's ls command, that while it is coalesced, the file and folder is not on the Hadoop file system.
When you type hadoop fs -ls hdfs://path, there is no such file or directory. After the coalesce is done, the path is again there and also the file which was coalesced.
This might happen because the coalesce needs to delete the file and create a new one?!
Here the problem is now: When i kill the process/app while the file is not on the file system, the complete file is deleted. So an failure of the system would destroy the file.
Do you have an idea how to prevent the file loss? I thought Spark/Hadoop would care of this.
Thanks, Alex