pyspark - capture malformed JSON file name after load fails with FAILFAST option - pyspark

To detect malformed/corrupt/incomplete JSON file, I have used FAILFAST option so that process fails. How do I capture corrupted file name out of 100s files because I need to remove that file from the path and copy good version of file from s3 bucket?
df = spark_session.read.json(table.load_path, mode='FAILFAST').cache()

Related

How to copy CSV file from blob container to another blob container with Azure Data Factory?

I would like to copy any file in Blob container to another Blob container. No transformation is needed. How to do it?
However I get validate error:
Copy data1:
Dataset yellow_tripdata_2020_1 location is a folder, the wildcard file name is required for
Copy data1
As the error states: the wildcard file name is required for Copy data1.
On your data source, in the file field, you should enter a pattern that matches the files you want to copy. So *.* if you want to copy all the files, and something like *.csv if you only want to copy over CSV files.

Unzip gzip files in Azure Data factory

I am wondering if it is possible to set up a source and sick in ADF that will unzip a gzip file and shows the extracted txt file. What happened is that the sink was incorrectly defined where both the source/sink had gzip compression.
So what ended up is that "fil1.gz" is now "file1.gz.gz".
This is how the file looks in Azure blob:
This is how the file looks like in an S3 bucket (the end is cut off, but the end is "txt.gz"):
I saw that in COPY there is Zipdeflate and deflate compression, but I get an error that it does not support this type of activity.
I created a sink in an ADF pipeline where I am trying to unzip it. In the datasource screen I used Zipdeflate, but it puts the file name with "deflate" extention, and not with the 'txt'.
Thank you
create a "copy data" object
Source:
as your extenstion is gz, you should choose GZip as compresion type, tick binary copy
Target:
Blob Storage Binary
compresion- none
Such copy pipeline will unzip your text file(s)

Reading a text file from local machine

I am trying to read a text file from local path using spark. But it's throwing an exception.Error image
The code I used to read file is this:
val assetFile = sc.textFile(assetFilePath)
assestFilePath is a variable whichrepresent the path to somehere in my local machine.
val adFile = sc.textFile(adFilePath)
adFilePath is a variable whichrepresent the path to somehere in my local machine.
sc.textFile will by default read from HDFS not from local file system, but Spark supports multiple file system apart from HDFS like LocalFileSyetem, Amazon S3, Azure, Swift FS.
So in order to read from the local file system you need to specify in the file path as protocol.
for example :
sc.textFile("file:///tmp/myfile.txt")
This will read a file named myfile.txt from tmp directory present in local file system where spark driver code is running.

Spark Streaming - textFileStream: Tmp file error

I have an Spark Streaming application that is scanning declared directory via textFileStream method. I have batch set to 10 seconds and I'm processing new files and concatenate their content into bigger parquet files. Files are downloaded/created directly in declared directory and that's cause me an issue I cannot bypass.
Lets get following scenario as example:
Directory is empty for most of a time, then around 100 small files come in the same time as streaming batch triggers. Spark stream founds 70 files with .json extension and 30 files with _tmp (still under creation/download). This situation obviously crash my app since process try to work with _tmp file witch in the meantime changed into fully created/downloaded .json file.
I've tried to filter out the rdd with following method:
val jsonData = ssc.textFileStream("C:\\foo").filter(!_.endsWith("tmp"))
But it still cause
jsonData.foreachRDD{ rdd=>
/.../
}
to process _tmp files and throws exception of no such file as *_tmp
Question
I've read about some staging directory from which I should move files after create/download process finish, but copy nor move(why is that? copy operation creates new file so...) operation doesn't trigger textFileStream and process is ignoring those files. Is there any other method to filter out those files and wait till they are complete before processing?
Env spec
Directory on Windows 8.1 file system, eventually that would be a Unix like machine
Spark 1.5 for Hadoop 2.6
Actually, .filter(!_.endsWith("tmp")) will filter the content of files inside c:\foo, it means the data, by default line by line. If it finds a line (not a filename) ending with tmp will remove it from the stream.

Spark coalesce looses file when program is aborted

in Scala/Spark i am using a DataFrame and write it into a single file using:
val dataFrame = rdd.toDF()
dataFrame.coalesce(1).write.mode(SaveMode.Overwrite).parquet(filePath)
This works fine. But I figured out using the console and Hadoop's ls command, that while it is coalesced, the file and folder is not on the Hadoop file system.
When you type hadoop fs -ls hdfs://path, there is no such file or directory. After the coalesce is done, the path is again there and also the file which was coalesced.
This might happen because the coalesce needs to delete the file and create a new one?!
Here the problem is now: When i kill the process/app while the file is not on the file system, the complete file is deleted. So an failure of the system would destroy the file.
Do you have an idea how to prevent the file loss? I thought Spark/Hadoop would care of this.
Thanks, Alex