I have a large number of csv files that I need to import and compare some of the fields to a list. The files were/are generated every hour into the directory. The analysis I want to perform needs to be applied to each file in the directory individually, not all of them at once. Is there a way to take in each file one at a time?
Related
I have an ADF where I am executing a stored procedure in a ForEach and using Copy Data to load the output into a CSV file
On each iteration of the ForEach the CSV is being cleared down and loading that iteration's data
I require it to preserve the already loaded data and insert the output from the iteration
The CSV should have a full dataset of all iterations
How can I achieve this? I tried using the "Merge Files" option in the Sink Copy Behavior but doesn't work for SQL to CSV
As #All About BI mentioned, currently the append behavior which you are looking for is not supported.
You can raise a feature request from the ADF portal.
Alternatively, you can check the below process to append data in CSV.
In my repro, I am generating the loop items using Set Variable activity and passing it to ForEach activity.
Inside ForEach activity, using copy data activity, executing the stored procedure in Source, and copying data of Stored procedure to a CSV file.
In the Copy data activity sink, generate the file name using the current item of ForEach loop, to get data into different files for each iteration. Also adding a constant to identify the file name which can be deleted at the end after merging the files.
File_name: #concat('1_sp_sql_data_',string(item()),'.csv')
Add another copy data activity after the ForEach activity, to combine all the files data from the ForEach iteration to a single file. Here I am using the wildcard path (*) to get all files from the folder.
In Sink, add the destination filename with copy behavior as Merge files to copy all source data to a single sink file.
After merging the files data is copied to a single file, but the files will not be deleted. So when you run the pipeline next time, there is a chance the old files were also been merged with new files again.
• To avoid, this adding delete activity to delete the files generated in ForEach activity.
• As I have added a constant to generate these files, it will be easy to delete the files based on the filename (deleting all files which start with “1_”).
Destination file:
You could try this -
Load each iteration data to a separate csv file.
Later union them all or merge
As right now we don't have the ability to append rows in csv.
When I use sparksql, I want to save the file to a different folder according to the name of the field, because there are more than 200 fields enumerated. How can I save it gracefully? At present, the way I use it is inefficient.
As shown in the figure.
I have multiple files stored in a hdfs location as follows
/user/project/202005/part-01798
/user/project/202005/part-01799
There are 2000 such part files. Each file is of the format
{'Name':'abc','Age':28,'Marks':[20,25,30]}
{'Name':...}
and so on . I have 2 questions
1) How to check whether these are multiple files or multiple partitions of the same file
2) How to read these in a data frame using pyspark
As these files are in one directory, and these are named as part-xxxxx files, so you can safely assume these are multiple part files of the same dataset. If these are partitions, they should be saved like this /user/project/date=202005/*
You can specify the dir "/user/project/202005" as input for spark like below assuming these are csv files
df = spark.read.csv('/user/project/202005/*',header=True, inferSchema=True)
I have multiple files stored in HDFS, and I need to merge them into one file using spark. However, because this operation is done frequently (every hour). I need to append those multiple files to the source file.
I found that there is the FileUtil that gives the 'copymerge' function. but it doesn't allow to append two files.
Thank you for your help
You can do this with two methods:
sc.textFile("path/source", "path/file1", "path/file2").coalesce(1).saveAsTextFile("path/newSource")
Or as #Pushkr has proposed
new UnionRDD(sc, Seq(sc.textFile("path/source"), sc.textFile("path/file1"),..)).coalesce(1).saveAsTextFile("path/newSource")
If you don't want to create a new source and overwrite the same source every hour, you can use dataframe with save mode overwrite ( How to overwrite the output directory in spark)
I want to Remove all the PDF files that reside in the reports output folder in one go there are about 1000 files it will not be easy to delete the files one by one.