Spark read files recursively from all sub folders with same name - scala

I have a process that is pushing bunch of data to the Blob store every hour and creating the following folder structure inside my storage container as below:
/year=16/Month=03/Day=17/Hour=16/mydata.csv
/year=16/Month=03/Day=17/Hour=17/mydata.csv
and so on
form inside my Spark context I want to access all the mydata.csv and process them. I figured out that I needed to set the sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true") so that we can use recursive search like below:
val csvFile2 = sc.textFile("wasb://mycontainer#mystorage.blob.core.windows.net/*/*/*/mydata.csv")
but when I execute the following command to see how many files I have received, it gives me some really large number like below
csvFile2.count
res41: Long = 106715282
ideally it should be returning me 24*16=384, also i verified on the container, it only has 384 mydata.csv files, but for some reasons i see it returns 106715282.
can someone please help me understand where I went wrong?
Regards
Kiran

SparkContext has two similar methods: textFile and wholeTextFiles.
textFile loads each line of each file as a record in the RDD. So count() will return the total number of lines across all of the files (which in most cases, such as yours, will be a large number).
wholeTextFiles loads each entire file as a record in the RDD. So count() will return the total number of files (384 in your case).

Related

Azure Data Factory For Each Loop is importing all my CSV files per iteration instead of just the file name I *think* I've told it to

I could really do with some help with ADF; I've recently started trying to use it thinking it would be similar to SSIS but wow am I having a hard time! I've built up this kinda complicated pipeline over the last few weeks which basically reads a list of files from a folder and from within a For Each loop it's supposed to check where the data starts per file and import it into a SQL table. I'll not bore you with all the issues I've had so far but atm it seems to be working aside from the For Each part of it, it's importing all the files in the folder per iteration and it seems to be the data set configuration which is not recognising the filename per iteration because if I look through the debugging I can see it pick up the list of files, set the DSFileName variable to the first of them, but the output of the data flow task is both files. So it seems like I've missed a step somewhere and I've just spent the last 5 hours looking and could really do with some help :(
I reckon to have followed the instructions here: https://www.sqlshack.com/how-to-use-iterations-and-conditions-activities-in-azure-data-factory/
Some pictures to show the debugging I've done:
Here it shows it's picking up 2 files (after I filtered out folders and stuff)
Here shows the first file name only being passed into the first data flow
Here shows the output from it, where it has picked up both files somehow and displays the count of 2 files
Here shows the Data Set set up where I believe to have correctly set the variable as the file name to be used
I just don't even know where to start now tbh, I reckon to have checked everything I can see and I'm not using any wild cards or anything. I can see it passing the 1 file name per iteration into that variable but each iteration I can see 2x counts of the file going into the table and the output of each data flow task showing both file counts.
Does anybody have any ideas or know what I've missed?
EDIT 23/07/22: Pics of the source as requested:
Data Source Settings
Data Source Options
So it turns out that adding .name to item() in the dataset parameter means it uses just the current one instead of them all.... I'm confused by this as all the documentation I've read states that item() references the CURRENT item within the For Each, did I misunderstand?
Adding .name to the dataset here is now importing just the current file per loop iteration

Is there a way to add literals as columns to a spark dataframe when reading the multiple files at once if the column values depend on the filepath?

I'm trying to read a lot of avro files into a spark dataframe. They all share the same s3 filepath prefix, so initially I was running something like:
path = "s3a://bucketname/data-files"
df = spark.read.format("avro").load(path)
which was successfully identifying all the files.
The individual files are something like:
"s3a://bucketname/data-files/timestamp=20201007123000/id=update_account/0324345431234.avro"
Upon attempting to manipulate the data, the code kept errorring out, with a message that one of the files was not an Avro data file. The actual error message received is: org.apache.spark.SparkException: Job aborted due to stage failure: Task 62476 in stage 44102.0 failed 4 times, most recent failure: Lost task 62476.3 in stage 44102.0 (TID 267428, 10.96.134.227, executor 9): java.io.IOException: Not an Avro data file.
To circumvent the problem, I was able to get the explicit filepaths of the avro files I'm interested in. After putting them in a list (file_list), I was successfully able to run spark.read.format("avro").load(file_list).
The issue now is this - I'm interested in adding a number of fields to the dataframe that are part of the filepath (ie. the timestamp and the id from the example above).
While using just the bucket and prefix filepath to find the files (approach #1), these fields were automatically appended to the resulting dataframe. With the explicit filepaths, I don't get that advantage.
I'm wondering if there's a way to include these columns while using spark to read the files.
Sequentially processing the files would look something like:
for file in file_list:
df = spark.read.format("avro").load(file)
id, timestamp = parse_filename(file)
df = df.withColumn("id", lit(id))\
.withColumn("timestamp", lit(timestamp))
but there are over 500k files and this would take an eternity.
I'm new to Spark, so any help would be much appreciated, thanks!
Two separate things to tackle here:
Specifying Files
Spark has built in handling for reading all files of a particular type in a given path. As #Sri_Karthik suggested, try supplying a path like "s3a://bucketname/data-files/*.avro" (if that doesn't work, maybe try "s3a://bucketname/data-files/**/*.avro"... i can't remember the exact pattern matching syntax spark uses), which should grab all avro files only and get rid of that error where you are seeing non-avro files in those paths. In my opinion this is more elegant than manually fetching the file paths and explicitly specifying them.
As an aside, the reason you are seeing this is likely because folders typically get marked with metadata files like .SUCCESS or .COMPLETED to indicate they are are ready for consumption.
Extracting metadata from filepaths
If you check out this stackoverflow question, it shows how you can add the filename as a new column (both for scala and pyspark). You could then use the regexp_extract function to parse out the desired elements from that filename string. I've never used scala in spark so can't help you there, but it should be similar to the pyspark version.
Why dont you try to read the files first by using wholetextfiles method and add the path name into the data itself at the beginning. Then you can filter out the file names from the data and add it as a column while creating the dataframe. I agree it's a two step process. But it should work. To get a timestamp of file you will need filesystem object which js not serializable , i.e. it cant be used in sparks parallelized operation , So you will have to create a local collection with file and timestamp and join it somehow with the RDD you created with wholetextfiles.

PySpark Reading Multiple Files in Parallel

I have the below requirement in my project and we are attempting to use PySpark for data processing.
We used to receive sensor data in the form of Parquet files for each vehicle and its one file per vehicle. The file has a lot of sensors but its structured data in Parquet format. Avg file size is 200MB per file.
Assume i received files as below in one batch and ready for processing.
Train FileSize Date
X1 210MB 05-Sep-18 12:10 AM
X1 280MB 05-Sep-18 05:10 PM
Y1 220MB 05-Sep-18 04:10 AM
Y1 241MB 05-Sep-18 06:10 PM
At the end of the processing, I need to receive one aggregated .csv file from every source file or one master file with aggregated data for all these vehicle.
I am aware that HDFS default block size is 128MB and each file will be split into 2 blocks. May i know how can i accomplish this requirement using PySpark? Is it possible to process all these files in parallel?
Please let me know your thoughts
I had a similar problem, and it seems that I found a way:
1. Get a list of files
2.parallelize this list (distribute among all nodes)
3.write a function that reads content of all files from the portion of the big list that was distributed to the node
4.run it with mapPartition, then collect the result as a list, each element is a collected content of each file.
Fot files stored on AWS s3 and json files:
def read_files_from_list(file_list):
#reads files from list
#returns content as list of strings, 1 json per string ['{}','{}',...]
out=[]
for x in file_list:
content = sp.check_output([ 'aws', 's3', 'cp', x, '-']) # content of the file. x here is a full path: 's3://bucket/folder/1.json'
out.append(content)
return out #content of all files from the file_list as list of strings, 1 json per string ['{}','{}',...]
file_list=['f1.json','f2.json',...]
ps3="s3://bucket/folder/"
full_path_chunk=[ps3 + f for f in file_list] #makes list of strings, with full path for each file
n_parts = 100
rdd1 = sc.parallelize(full_path_chunk, n_parts ) #distribute files among nodes
list_of_json_strings = rdd1.mapPartitions(read_files_from_list).collect()
Then, if necessary, you can create spark dataframe like this:
rdd2=sc.parallelize(list_of_json_strings) #this is a trick! via http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
df_spark=sqlContext.read.json(rdd2)
The function read_files_from_list is just an example, it should be changed to read files from hdfs using python tools.
Hope this helps :)
You can put all input files in the same directory then you can pass path of directory to spark. You can also use globbing like /data_dir/*.csv.
I had encountered similar situation recently.
You can pass a list of CSVs with their paths to spark read api like spark.read.json(input_file_paths) (source). This will load all the files in a single dataframe and all the transformations eventually performed will be done in parallel by multiple executors depending on your spark config.

Using Talend Open Studio DI to extract extract value from unique 1st row before continuing to process columns

I have a number of excel files where there is a line of text (and blank row) above the header row for the table.
What would be the best way to process the file so I can extract the text from that row AND include it as a column when appending multiple files? Is it possible without having to process each file twice?
Example
This file was created on machine A on 01/02/2013
Task|Quantity|ErrorRate
0102|4550|6 per minute
0103|4004|5 per minute
And end up with the data from multiple similar files
Task|Quantity|ErrorRate|Machine|Date
0102|4550|6 per minute|machine A|01/02/2013
0103|4004|5 per minute|machine A|01/02/2013
0467|1264|2 per minute|machine D|02/02/2013
I put together a small, crude sample of how it can be done. I call it crude because a. it is not dynamic, you can add more files to process but you need to know how many files in advance of building your job, and b. it shows the basic concept, but would require more work to suite your needs. For example, in my test files I simply have "MachineA" or "MachineB" in the first line. You will need to parse that data out to obtain the machine name and the date.
But here is how may sample works. Each Excel is setup as two inputs. For the header the tFileInput_Excel is configured to read only the first line while the body tFileInput_Excel is configured to start reading at line 4.
In the tMap they are combined (not joined) into the output schema. This is done for the Machine A Excel and Machine B excels, then those tMaps are combined with a tUnite for the final output.
As you can see in the log row the data is combined and includes the header info.

TfileList catches one of the 6 files only

I tried to display some results from several files in a directory. I use TFileList, and 2 tFileInputDelimited which are both linked to TFileList. I don't know why but at the end of the processing my results are lugged from just one of the 6 files I want. It appears that there are results from the list file of the directory.
Each tFileInputDelimited has ((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")) as name of the flow.
Here is my TMap:
Your job is set up so your lookup is iterative which causes some issues as Talend only seems to use the last iteration rather than doing what you might expect and iterating through every step for everything it needs (although this might be more complicated than you first think).
One option is to rework the job so you use your iterate part of the job as the main input to the tMap rather than the lookup.
Alternatively, you could iterate the data into a tBufferOutput component and then OnSubjobOk you could link the job as before but replace the iterative part with a tBufferInput component as it will store all of the data from all of the files iterated through.