In pySpark I would change the number of partitions when I load the data.
df_sp = spark.read\
.format('csv')\
.option("header", "true")\
.option("mode", "FAILFAST")\
.option("inferSchema", "true")\
.option("sep", ",")\
.load(os.path.join(dirPath, nameFile))
Using pySpark, it possible to tune the number of partition at loading time?
Yes, change spark.sql.files.maxPartitionBytes. It's 134217728 (128 MB) by default.
You can call repartition(number of partitions) at the end of your code. But please make sure to follow the guidelines as it involves full shuffle operation. Alternatively, if you would like to decrease the number of partitions you could use coalesce.
Related
I have a spark job where I am writing data to parquet to s3.
val partitionCols = Seq("date", "src")
df
.coalesce(10)
.write
.mode(SaveMode.Overwrite)
.partitionBy(partitionCols: _*)
.parquet(params.outputPathParquet)
When I run the job on EMR it overwrites all the partitions and writes it to s3
eg: data looks like this:
s3://foo/date=2021-01-01/src=X
s3://foo/date=2021-11-01/src=X
s3://foo/date=2021-10-01/src=X
where
params.outputPathParquet = s3://foo
When I run the job for another day
eg: 2021-01-02 it replaces all existing partitions and data looks like the following
s3://foo/date=2021-01-02/src=X
Any ideas what might be happening ?
If you just need append data, you can change the SaveMode
.mode(SaveMode.Append)
If you need overwrite some specific partition, take a look at this question: Overwrite specific partitions in spark dataframe write method
I am trying to partition my DataFrame and write it to parquet file. It seems to me, that repartitioning works on dataframe in memory, but does not affect the parquet partitioning. What is even more strange that coalesce works. Lets say I have DataFrame df:
df.rdd.partitions.size
4000
var df_new = df.repartition(20)
df_new.rdd.partitions.size
20
However, when I try to write a parquet file I get the following:
df_new.write.parquet("test.paruqet")
[Stage 0:> (0 + 8) / 4000]
which would create 4000 files, however, if I do this I get the following:
var df_new = df.coalesce(20)
df_new.write.parquet("test.paruqet")
[Stage 0:> (0 + 8) / 20]
I can get what I want to reduce partitions. The problem is when I need to increase the number of partitions I cannot do it. Like if I have 8 partitions and I try to increase them to 100, it always write only 8.
Does somebody know how to fix this?
First of all, you should not provide a file path to the parquet() method, but a folder instead. Spark will handle the parquet filenames on its own.
Then, you must be aware that coalesce only reduces the number of partitions (without shuffle) while repartition lets you re-partition (with shuffle) your DataFrame in any number of partitions you need (more or less). Check out this SO question for more details on repartition vs. coalesce.
In your case, you want to increase the number of partitions, so you need to use repartition
df.repartition(20).write.parquet("/path/to/folder")
I want to understand how spark determines the number of csv files it creates while saving a data frame as csv file. Does the number of partitions affect this number? and why are some empty files created? I have the code like follows
dataframe.coalesce(numPartitions).write
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.option("header", "true")
.mode("overwrite")
.save("outputpath")
There are multiples files when you save in csv or any other format, Its because of a multiple number of the partition of your dataframe. If you have n number of partition then you get n number of files saved in output.
Does the number of partitions affect this number?
Yes, the number of partition is equal to the number of files. While saviong the datarfame/rdd each partition is written as a single file.
why are some empty files created?
All the partitions may not contain data
Hope this helps!
According to the docs of Spark 1.6.3, repartition(partitionExprs: Column*) should preserve the number of partitions in the resulting dataframe:
Returns a new DataFrame partitioned by the given partitioning
expressions preserving the existing number of partitions
(taken from https://spark.apache.org/docs/1.6.3/api/scala/index.html#org.apache.spark.sql.DataFrame)
But the following example seems to show something else (note that spark-master is local[4] in my case):
val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[4]"))
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val myDF = sc.parallelize(Seq(1,1,2,2,3,3)).toDF("x")
myDF.rdd.getNumPartitions // 4
myDF.repartition($"x").rdd.getNumPartitions // 200 !
How can that be explained? I'm using Spark 1.6.3 as a standalone application (i.e. running locally in IntelliJ IDEA)
Edit: This question does not adress the issue from Dropping empty DataFrame partitions in Apache Spark (i.e. how to repartiton along a column without producing empty partitions), but why the docs say something different from what I observe in my example
It is something related to Tungsten project which was enabled in Spark. It uses hardware optimization and calls hash partitioning which triggers shuffle operation. By default spark.sql.shuffle.partitions is set to be 200. You can verify by calling explain on your dataframe before repartitioning and after just calling:
myDF.explain
val repartitionedDF = myDF.repartition($"x")
repartitionedDF.explain
I am new in apache spark and using scala API. I have 2 questions regarding RDD.
How to persist some partitions of a rdd, instead of entire rdd in apache spark? (core rdd implementation provides rdd.persist() and rdd.cache() methods but i do not want to persist entire rdd. I am interested only some partitions to persist.)
How to create one empty partition while creating each rdd? (I am using repartition and textFile transformations.In these cases i can get expected number of partitions but i also want one empty partition for each rdd.)
Any help is appreciated.
Thanks in advance