Better way to read multiple CSV files - pyspark

I have 200 csv files based on dates. Of which, I require data from 50 files only. Should I read all 200 files and then filter them based on dates or should I read only those 50 needed files? which would be a better option performance wise?

the csv method of DataFrameReader accepts a list of path to create the dataframe.
If you know exactly the path you want, just generate a list and use it.

Related

reading batch of json files into dataframe

All -
I have millions of single json files, and I want to ingest all into a Spark dataframe. However, I didn't see a append call, where I can append json as additions. Instead, the only way I can make it work is:
for all json files do:
df_tmp = spark.read.json("/path/to/jsonfile", schema=my_schema)
df = df.union(df_tmp)
df is the final aggregated dataframe. This approach works a few hundreds files, but as it approaches thousands, it is getting slower and slower. I suspect this cost of dataframe create and merge are signficant, and it feels awkward as well. Is there a better approach? TIA
You can just pass the path to the folder instead of individual file and it will read all the files in it.
For example, your files are in a folder called JsonFiles, you can write,
df = spark.read.json("/path/to/JsonFiles/")
df.show()

Reading Multiple folders parallely

i have multiple part folders each containing parquet files (ex given below). Now across a part-folder the schema can be different (either the num of cols or the datatype of certain col). My requirement is that i have to read all the part folders and finally create a single df according to a predefined passed schema.
/feed=abc -> contains multiple part folders based on date like below
/feed=abc/date=20221220
/feed=abc/date=20221221
.....
/feed=abc/date=20221231
Since i am not sure what type of changes are there in which part folders i am reading each part folder individually then comparing the schema with teh predefined schema and making the necessary changes i..e, adding/dropping col or typecasting the col datatype. Once done i am writing the result into a temp location and then moving on to the next part folder and repeating the same operation. Once all the part-folders are read i am reading the temp location at one go to get the final output.
Now i want to do this operation parallely, i.e., there will be parallel thread/process (?) which will read part-folders parallely and then execute teh logic of schema comparison and any changes necessary and write into a temp location . Is this thing possible ?
i searched for parallel processing of multi-dir in here but in majority of the scenarios they have same schema across dir so somehow they are using wildcard to read the input path location and create the df, but that is not going to work in my case. The problem statement in the below path is similar to mine but in my case the num of part folders to be read is random and sometimes over 1000. Moreover there are operation involved in comparing the fixing the col types as well.
Any help will be appreciated.
Reading multiple directories into multiple spark dataframes
Divide your existing ETL into two phases. The first one transforms existing data into the appropriate schema, and the second one reads the transformed data convenient way (with * symbols). Use Airflow (or Oozie) to start one data transformer application per directory. And after all instances of the data transformer are successfully finished, run the union app.

Bulk load causing schema drift in files (adf pipeline or mapping dataflows)

Bit of a challenge here
I have around 45,000 historic .parquet files
partitioned like this yyyy,mm,dd (2021/08/19) in the dd level I have 24 files (one for each hour)
The columns in each day file are pretty wide, anything up to 250 columns. It has increased and decreased over time, hence there being schema drift when trying to load into SQL using mapping dataflows that made the file larger.
Around 200 of those columns I require and I know what they are. I even have them in a schema template. The rest are legacy or unwanted
I'd like to retain the original files in blob as they are, but load files with those 200 columns per file into SQL.
What is the best way to achieve this?
How do I iterate over every file but only take the columns I need?
I tried using a wildcard path
'2021/**/*.parquet'
within mapping dataflows to pick up All files in blob so I don't have to iterate creating multiple clusters or a foreach
I'm not even sure how to handle this or whether it should be a copy activity or a mapping df
both have their benefits but I think I can only use mapping df if I need to transform parts of these files in depth.
should I be combining the months or even years into a single file then trying to read from this files so I can exclude the additional from the columns I want to take into SQL server.
ideally this is a bulk load that need some refinement when it lands.
Thank in advance
Add a data flow to the pipeline and use a Select transformation to choose the columns you wish to propagate. You can create pattern-based rules in the data flow Select transformation to choose the columns that you wish to pick from each file schema.

Dataprep "create dataset with parameters" does not take all files

I am struggling with having files in the way that Dataprep intends to have, when importing a parameterized dataset from Google Cloud Storage (GCS).
To be specific:
I store .csv files on GCS in a location:
/20190807/file1.csv
/20190807/file2.csv
/20190807/file3.csv
/20190808/file1.csv
/20190808/file2.csv
/20190808/file3.csv
/20190809/file1.csv
/20190809/file2.csv
/20190809/file3.csv
...
Then I create a Dataset with Parameters on this location using a *wildcard.
In addition:
Encoding: I apply detect automatic structure, and I select UTF-8 (as I store all my files with this encoding).
Columns: I also make sure that all files have the same columns.
Problem:
However, for some reason or another, depending how the file has been saved I guess, Dataprep does not take all the files when importing them. When I take those two files, I cannot identify what is different about them. Both have been saved as a type application/octet-stream and I apply a UTF-8 encoding.
As a consequence, when I export my output after the dataset is wrangled in a flow, I miss some dates (eg 20190808).
Is there therefore a tool that I can compare these two files, to see what is different about them, in order to prevent these things of happening. It is not an option to store them in different locations as I do not know in advance which files will be different.
I am really surprised about this shortcoming, and it would be great to have somehow a way to only check the columns for each files instead of also checking other "hidden" differences.

"sqlContext.read.json" takes very long time to read 30,000 small JSON files (400 Kb) from S3

I get stuck with the following problem. I have around 30,000 JSON files stored in S3 inside a particular bucket. These files are very small; each one takes only 400-500 Kb, but their quantity is not so small.
I want to create DataFrame based on all these files. I am reading JSON files using wildcard as follows:
var df = sqlContext.read.json("s3n://path_to_bucket/*.json")
I also tried this approach since json(...) is deprecated:
var df = sqlContext.read.format("json").load("s3n://path_to_bucket/*.json")
The problem is that it takes a very long time to create df. I was waiting 4 hours and the Spark job was still running.
Is there any more efficient approach to collect all these JSON files and create a DataFrame based on them?
UPDATE:
Or at least is it possible to read last 1000 files instead of reading all files? I found out that one can pass options as follows sqlContext.read.format("json").options, however I cannot figure out how to read only N newest files.
If you can get the last 1000 modified file names into a simple list you can simply call:
sqlContext.read.format("json").json(filePathsList: _*)
Please note that the .option call(s) are usually used to configure schema options.
Unfortunately, I haven't used S3 before, but I think you can use the same logic in the answer to this question to get the last modified file names:
How do I find the last modified file in a directory in Java?
You are loading like 13Gb of information. Are you sure that it takes a long time in just to create the DF? Maybe it's running the rest of the application but the UI shows that.
Try just to load and print the first row of the DF.
Anyway, what is the configuration of the cluster?