There is hdfs-directory:
/home/path/date=2022-12-02, where date=2022-12-02 is a partition.
Parquet file with the partition "date=2022-12-02", has been written to this directory.
To read file with partition, I use:
spark
.read
.option("basePath", "/home/path")
.parquet("/home/path/date=2022-12-02")
The file is read successfully with all partition-fieds.
But, partition folder ("date=2022-12-02") is dropped from directory.
I can't grasp, what is the reason and how to avoid it.
There are two ways to have the date as part of your table;
Read the path like this: .parquet("/home/path/")
Add a new column and use input_file_path() function, then manipulate with the string until you get date column (should be fairly easy, taking last part after slash, splitting on equal sign and taking index 1).
I don't think there is another way to do what you want directly.
Related
i have multiple part folders each containing parquet files (ex given below). Now across a part-folder the schema can be different (either the num of cols or the datatype of certain col). My requirement is that i have to read all the part folders and finally create a single df according to a predefined passed schema.
/feed=abc -> contains multiple part folders based on date like below
/feed=abc/date=20221220
/feed=abc/date=20221221
.....
/feed=abc/date=20221231
Since i am not sure what type of changes are there in which part folders i am reading each part folder individually then comparing the schema with teh predefined schema and making the necessary changes i..e, adding/dropping col or typecasting the col datatype. Once done i am writing the result into a temp location and then moving on to the next part folder and repeating the same operation. Once all the part-folders are read i am reading the temp location at one go to get the final output.
Now i want to do this operation parallely, i.e., there will be parallel thread/process (?) which will read part-folders parallely and then execute teh logic of schema comparison and any changes necessary and write into a temp location . Is this thing possible ?
i searched for parallel processing of multi-dir in here but in majority of the scenarios they have same schema across dir so somehow they are using wildcard to read the input path location and create the df, but that is not going to work in my case. The problem statement in the below path is similar to mine but in my case the num of part folders to be read is random and sometimes over 1000. Moreover there are operation involved in comparing the fixing the col types as well.
Any help will be appreciated.
Reading multiple directories into multiple spark dataframes
Divide your existing ETL into two phases. The first one transforms existing data into the appropriate schema, and the second one reads the transformed data convenient way (with * symbols). Use Airflow (or Oozie) to start one data transformer application per directory. And after all instances of the data transformer are successfully finished, run the union app.
my_data.write
.mode(SaveMode.Overwrite)
.avro(_outputPath)
It works fine usually, but when the data is a very small amount, there are some empty Avro files.
All the number of files are quite different per try, when the data row is less than the number of files, some file is in an empty state, only column info are included.
Is there a way to handle the number of output Avro files per the data row number? Or not to create output file if there's not data?
The number of files will depend on how many partitions your dataframe has. Each partition will create its own file. If you know that there is no much data to write, you can re-partition the dataframe before writing it.
my_data.repartition(1)
.write
.mode(SaveMode.Overwrite)
.avro(_outputPath)
I have a data source that consists of a huge amount of small files. I would like to save this partitioned by column user_id to another storage:
sdf = spark.read.json("...")
sdf.write.partitionBy("user_id").json("...")
The reason for this is I want another system to be able to delete only select users' data upon request.
This works, but, I still get many files within each partition (due to my input data). For performance reasons I would like to reduce the number of files within each partition, ideally simply to one (the process will run each day, so having an output file per user per day would work well).
How do I obtain this with pyspark?
You can use repartition to ensure that each partition gets one file
sdf.repartition('user_id').write.partitionBy("user_id").json("...")
This will make sure for each partition one file is created but in case of coalesce if there are more than one partition it can cause trouble.
Just add coalesce and no. of file you want.
sdf.coalesce(1).write.partitionBy("user_id").json("...")
I am writing an ETL process where I will need to read hourly log files, partition the data, and save it. I am using Spark (in Databricks).
The log files are CSV so I read them and apply a schema, then perform my transformations.
My problem is, how can I save each hour's data as a parquet format but append to the existing data set? When saving, I need to partition by 4 columns present in the dataframe.
Here is my save line:
data
.filter(validPartnerIds($"partnerID"))
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
The problem is that if the destination folder exists the save throws an error.
If the destination doesn't exist then I am not appending my files.
I've tried using .mode("append") but I find that Spark sometimes fails midway through so I end up loosing how much of my data is written and how much I still need to write.
I am using parquet because the partitioning substantially increases my querying in the future. As well, I must write the data as some file format on disk and cannot use a database such as Druid or Cassandra.
Any suggestions for how to partition my dataframe and save the files (either sticking to parquet or another format) is greatly appreciated.
If you need to append the files, you definitely have to use the append mode. I don't know how many partitions you expect it to generate, but I find that if you have many partitions, partitionBy will cause a number of problems (memory- and IO-issues alike).
If you think that your problem is caused by write operations taking too long, I recommend that you try these two things:
1) Use snappy by adding to the configuration:
conf.set("spark.sql.parquet.compression.codec", "snappy")
2) Disable generation of the metadata files in the hadoopConfiguration on the SparkContext like this:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
The metadata-files will be somewhat time consuming to generate (see this blog post), but according to this they are not actually important. Personally, I always disable them and have no issues.
If you generate many partitions (> 500), I'm afraid the best I can do is suggest to you that you look into a solution not using append-mode - I simply never managed to get partitionBy to work with that many partitions.
If you're using unsorted partitioning your data is going to be split across all of your partitions. That means every task will generate and write data to each of your output files.
Consider repartitioning your data according to your partition columns before writing to have all the data per output file on the same partitions:
data
.filter(validPartnerIds($"partnerID"))
.repartition([optional integer,] "partnerID","year","month","day")
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
See: DataFrame.repartition
I want to create a Spark Streaming application coded in Scala.
I want my application to:
read from a HDFS Text File line by line
analyze every line as String and if needed modify it and:
keep state that is needed for the analysis in some kind of data structures (Hashes probably)
output of everything on text files (any kind)
I've had no problems with the first step:
val lines = ssc.textFileStream("hdfs://localhost:9000/path/")
My analysis consist in searching a match in the Hashes for some fields of the String analyzed, that's why I need to maintain a state and do the process iteratively.
The data in those Hashes is also extracted by the strings analyzed.
What can I do for next steps?
Since you just have to read one HDFS text file line by line, you probably do not need to Spark Streaming for that. You can just use Spark.
val lines = sparkContext.textFile("...")
Then you can use mapPartition to do a distributed processing of the whole partitioned file.
val processedLines = lines.mapPartitions { partitionAsIterator =>
processPartitionAndReturnNewIterator(partitionAsIterator)
}
In that function, you can walk through the lines in the partition, store state stuff in a hashmap, etc. and finally return another iterator of output records corresponding to that partition.
Now if you want share state across partitions, then you probably have to do some more aggregations like groupByKey() or reduceByKey() on processedLines dataset.