Prevent glue to read all files from S3 folder - pyspark

We have one S3 folder that is being used for storing different files for ETL processing separately. The ETL processing of one files is reading all other files placed in the same S3 folder. I don't see an option to read only file from the folder. The Location property in the table is set to the folder level.
Code:
gluedb = "srcgluedb"
gluetbl = "gluesrctable"
dfRead=glue_context.create_dynamic_frame.from_catalog(database=gluedb, table_name=gluetbl)
df = dfRead.toDF()

Related

Azure Data Factory Copy Tar gzip activity issue

We are copying data from source as tar.gzip e.g.
https://api.crunchbase.com/bulk/v4/bulk_export.tar.gz?user_key=user_key
The data is zipped as tar.gz
I would like to copy the zipped tar gzips file to sink with the individual csv's. So the result of the copy activity would look like the following in the destination folder:
At present my source looks like the following:
And my destination (SINK) looks like the following:
So, basically I would like to copy a source file which looks like the following:
bulk_export_sample.tar.gz
And is exported / unzipped during the copy activity as the CSVs shown in the image above image
You have not used Compression type in Source settings.
You need to select Compression type as TarGZip(.tgz/.tar.gz) in your Source connections settings. This will unzip files from zipped folder.
Reference - https://learn.microsoft.com/en-us/answers/questions/92973/extract-files-from-targz-files-store-in-blob-conta.html

Iterate through folders in azure data factory

I've a requirement like: I've three folders in azure blob container, and inside those three folders have three zip files and the zip files contains respective source files (*.csv) with same structure. I want to loop through the each folders and extract each of the zip files into an output folder then I want to load all the three csv files into target sql table. How can I achieve this by using azure data factory?
Azure storage account
productblob (blob container)
Folder1 >> product1.zip >> product1.csv
Folder2 >> product2.zip >> product2.csv
Folder3 >> product3.zip >> product3.csv
I've already tried to loop through the folders and got the output in Foreach iterator activity but unable to extract the zip files.
After looping to ForEcah activity, you could follow the following steps:
Select a binary dataset and give file path as Foreach output(by creating a parameter in Dataset and in Source defining the value to this parameter). Select compression type as ZipDeflate.
In the sink, select the path where you want to save the unzipped files. (Select Flatten hierarchy in Sink if you want only the files.)

Copy activity with simultaneous renaming of a file. From blob to blob

I have a "copy data" activity in Azure Data Factory. I want to copy .csv files from blob container X to Blob container Y. I don't need to change the content of the files in any way, but I want to add a timestamp to the name, e.g. rename it. However, I get the following error "Binary copy does not support copying from folder to file". Both the source and the sink are set up as binary.
If you want to copy the files and rename them, you pipeline should like this:
Create a Get Metadata active to get the file list(dataset Binary1):
Create For Each active to copy the each file:#activity('Get Metadata1').output.childItems:
Foreach inner active, create a copy active with source dataset
Binary2(same with Binary2) with dataset parameter to specify the source file:
Copy active sink setting, create the sink Binary3 also with
parameter to rename the files:
#concat(split(item().name,'.')[0],utcnow(),'.',split(item().name,'.')[1]):
Run the pipeline and check the output:
Note: The example I made just copy the files to the same container but with new name.

SciSpark: Reading from local folder instead of HDFS folder

I use an enterprise cluster that has both local and HDFS filesystems. Files I need to process are in netcdf format and hence I use SciSpark to load. On a workstation that has no HDFS, the code reads from local folder. However, when HDFS folders are present, it attempts read from HDFS only. As the size of files present in the folder is huge (running cumulatively into hundreds of GBs to TBs), I am having to copy them to HDFS, which is inefficient and inconvenient. The (Scala) code I use for loading the files is shown below:
val ncFilesRDD = sc.netcdfDFSFiles(ncDirectoryPath, List("x1", "x2", "x3"))
val ncFileCRDArrayRDD = ncFilesRDD.map(x => (x.variables.get("x1").get.data.toArray,
x.variables.get("x2").get.data.toArray,
x.variables.get("x3").get.data.toArray
))
I would very much appreciate any help in modifying the code that will enable me to use local directory instead of HDFS.
The Source Code Comment Doc for netcdfDFSFiles says
that since the files are read from HDFS
Not sure if you can use netcdfDFSFiles to read from local.
But there is another function netcdfFileList which says : The URI could be an OpenDapURL or a filesystem path.
Since it can take filesystem path , you can use it like
val ncFilesRDD = sc.netcdfFileList("file://your/path", List("x1", "x2", "x3"))
The file:// will look for local dir only.

pyspark - capture malformed JSON file name after load fails with FAILFAST option

To detect malformed/corrupt/incomplete JSON file, I have used FAILFAST option so that process fails. How do I capture corrupted file name out of 100s files because I need to remove that file from the path and copy good version of file from s3 bucket?
df = spark_session.read.json(table.load_path, mode='FAILFAST').cache()