Scala data reading from Amazon S3 - scala

I have been struggling reading nested folders stored on one of bucket on S3, using Scala.
I wrote script with my credentials. In bucket - there are many folders. Let say one folder name is "folder1". In this folder there are many subfolders and so on. I want to get names of each subfolder(any each inside them) for folder1.
val yourAWSCredentials = new BasicAWSCredentials(AWS_ACCESS_KEY, AWS_SECRET_KEY)
val amazonS3Client = new AmazonS3Client(yourAWSCredentials)
print(amazonS3Client.listObjects(bucketName,"folder1").getObjectSummaries())
But this returns not the structure I need. May be there is easier way to get the path?

Amazon S3 is not a regular hierarchical file system. It does not actually have folders.
You need to understand S3 prefixes and delimiters. See Listing Keys Hierarchically Using a Prefix and Delimiter.
Also see Max files per directory in S3.

Related

Specify parquet file name when saving in Databricks to Azure Data Lake

Is there a way to specify the name of a parquet file when I am saving it in Databricks to Azure Data Lake? For example, when I try to run the following statement:
append_df.write.mode('append').format('parquet').save('/mnt/adls/covid/base/Covid19_Cases')
a folder called Covid_Cases gets created and there are parquet files with random names inside of it.
What I would like to do is to use the saved parquet file in Data Factory copy activity. In order to do that, I need to specify the parquet file's name, otherwise I can't point to a specific file.
Since spark is executing in distributed mode and files or their revatives, e.g. dataframes, are being processed in parallel, processed data will be stored in different files in same folder . You can use folder level name to Data Factory copy activity. But you really want to make it single file, you can use below approach ,
save_location= "/mnt/adls/covid/base/Covid19_Cases"+year
parquet_location = save_location+"temp.folder"
file_location = save_location+'export.parquet'
df.repartition(1).write.parquet(path=parquet_location, mode="append", header="true")
file = dbutils.fs.ls(parquet_location)[-1].path
dbutils.fs.cp(file, file_location)
dbutils.fs.rm(parquet_location, recurse=True)

Is there any pyspark method to read multiple file with different header

I have to migrate multiple files(around 2000) in same folder in azure blob storage. I want to read each file with header(as header is different for every file).
And write it into destination folder.
Is there anyway I can do it parallel via pyspark?
I am using below code, but it is only picking header from first file, which is producing wrong output.
Df.read.option(“header”, “true”).parquet(directory/*.parquet)
Df.write.option(“header”,”true”).csv(directory)
Please help me if you know how can I read all the files with source headers of their own.
Thanks!

Google Data Fusion reading files from multiple sub folders in a bucket and need to place in another folder in side sub folder

Example
sameer/student/land/compressed files
sameer/student/pro/uncompressed files
sameer/employee/land/compressed files
sameer/employee/pro/uncompressed files
In the above example I need to read files from all LAND folders present in different sub directories and need to process them and place them in PRO folders with in same sub folders.
For this I have taken two GCS nodes one from source and another from sink.
in the GCS source i have provided path gs://sameer/ , it is reading files from all sub folders and merging them into one file placing it in sink path.
Excepted output all files should be placed in sub directories where i have fetched from.
It can achieve the excepted output by running pipeline separately for each folder
I am expecting is this can be possible by a single pipeline run
It seems like your use case is simply moving files. In that case, I would suggest using the Action plugin GCS Move or GCS Copy.
It seems like the task you are trying to carry out is not possible to do in one single Data Fusion pipeline, at least at the time of writing this.
In a pipeline, all the sources and sinks have to be connected. Otherwise you will get the following error:
'Invalid DAG. There is an island made up of stages ...'
This means it is not possible to parallelise several uncompression tasks, one for each folder of files, inside the same pipeline.
At the same time, if you were to use something like the following schema, the outputs would be aggregated and replicated over all of the sinks:
Finally, I would say that the only case in which you can parallelise a task between several sources and several links is when using multiple database tables. By means of the following plug-ins (2) and (3) you can process data from multiple table inputs and export the output to multiple tables. If you would like to see all available plugins for Data fusion, please check the following link (4).

Iterate each folder in Azure Data Factory

In our DataLake storage, we received unspecified amount of folders every day. Each of these folders contain at least one file.
Example of folders:
FolderA
|_/2020
|_/03
|_/12
|_fileA.json
|_/04
|_/13
|_fileB.json
FolderB
|_/2020
|_/03
|_/12
|_fileC.json
Folder C/...
Folder D/...
So on..
Now:
1. How do I iterate every folders and get the file(s) inside it?
I would also like to do 'Copy Data' from each of these files and make a single .csv file from it. What would be the best approach to achieve it?
This can be done with a single copy activity using wildcard filtering in the source dataset, as seen here: https://azure.microsoft.com/en-us/updates/data-factory-supports-wildcard-file-filter-for-copy-activity/
Then in the sink tab of the copy activity, select Merge Files in the Copy behavior as seen here:
If you have extra requirements, another way to do this is by using Mapping Dataflows. Mark Kromer explains a similar scenario here: https://kromerbigdata.com/2019/07/05/adf-mapping-data-flows-iterate-multiple-files-with-source-transformation/
Hope this helped!

PySpark Reading Multiple Files in Parallel

I have the below requirement in my project and we are attempting to use PySpark for data processing.
We used to receive sensor data in the form of Parquet files for each vehicle and its one file per vehicle. The file has a lot of sensors but its structured data in Parquet format. Avg file size is 200MB per file.
Assume i received files as below in one batch and ready for processing.
Train FileSize Date
X1 210MB 05-Sep-18 12:10 AM
X1 280MB 05-Sep-18 05:10 PM
Y1 220MB 05-Sep-18 04:10 AM
Y1 241MB 05-Sep-18 06:10 PM
At the end of the processing, I need to receive one aggregated .csv file from every source file or one master file with aggregated data for all these vehicle.
I am aware that HDFS default block size is 128MB and each file will be split into 2 blocks. May i know how can i accomplish this requirement using PySpark? Is it possible to process all these files in parallel?
Please let me know your thoughts
I had a similar problem, and it seems that I found a way:
1. Get a list of files
2.parallelize this list (distribute among all nodes)
3.write a function that reads content of all files from the portion of the big list that was distributed to the node
4.run it with mapPartition, then collect the result as a list, each element is a collected content of each file.
Fot files stored on AWS s3 and json files:
def read_files_from_list(file_list):
#reads files from list
#returns content as list of strings, 1 json per string ['{}','{}',...]
out=[]
for x in file_list:
content = sp.check_output([ 'aws', 's3', 'cp', x, '-']) # content of the file. x here is a full path: 's3://bucket/folder/1.json'
out.append(content)
return out #content of all files from the file_list as list of strings, 1 json per string ['{}','{}',...]
file_list=['f1.json','f2.json',...]
ps3="s3://bucket/folder/"
full_path_chunk=[ps3 + f for f in file_list] #makes list of strings, with full path for each file
n_parts = 100
rdd1 = sc.parallelize(full_path_chunk, n_parts ) #distribute files among nodes
list_of_json_strings = rdd1.mapPartitions(read_files_from_list).collect()
Then, if necessary, you can create spark dataframe like this:
rdd2=sc.parallelize(list_of_json_strings) #this is a trick! via http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
df_spark=sqlContext.read.json(rdd2)
The function read_files_from_list is just an example, it should be changed to read files from hdfs using python tools.
Hope this helps :)
You can put all input files in the same directory then you can pass path of directory to spark. You can also use globbing like /data_dir/*.csv.
I had encountered similar situation recently.
You can pass a list of CSVs with their paths to spark read api like spark.read.json(input_file_paths) (source). This will load all the files in a single dataframe and all the transformations eventually performed will be done in parallel by multiple executors depending on your spark config.