How to force Glue to create smaller files in S3 [duplicate] - scala

Suppose within an AWS Glue job, one sees the following output in the logs:
21/07/27 18:25:36 INFO DAGScheduler: Got job 1 (toPandas at /tmp/test.py:742) with 100000 output partitions
Does Spark dynamically set the number of output partitions? Is there any way to set the number of output partitions in advance for a particular job?

you can try following method on your dataframe.
repartition() - when you want to increase a number of partition
coalesce() - when you want to decrease the number of parition.

Related

How to partition a table in Databricks by data-size/row count not by column

I've seen databricks examples that use the partionBy method. But partitions are recommended to be 128MB. I'd think there was a way to basically achieve that as closely as possible? Take the total size, divide it by 128mb, then partition by a number of partitions rather than by a dimension.
Any suggestions for how this is achieved would be appreciated.
The setting spark.sql.files.maxPartitionBytes has indeed impact on the max size of the partitions when reading the data on the Spark cluster.By using this configuration we can control the partition based on the size of the data .

PySpark - Does Coalesce(1) Retain the Order of Range Partitioning?

Looking into the Spark UI and physical plan, I found that orderBy is accomplished by Exchange rangepartitioning(col#0000 ACS NULLS FIRST, 200) and then Sort [col#0000 ACS NULLS FIRST], true, 0.
From what I understand, rangepartitioning would define minimum and maximum values for each partition and order the data with column value within the min and max into that partition so as to achieve global ordering.
But now I have 200 partitions and I want to output to a single csv file. If I do a repartition(1), spark will trigger a shuffle and the ordering will be gone. However, I tried coalesce(1) and it retained the global ordering. Yet I don't know if it was merely pure luck since coalesce does not necessarily decrease number of partitions and keep the ordering of partitions. Does anyone know how to repartition to keep the ordering after rangepartitioning? Thanks a lot.
As you state yourself maintaining order is not part of the coalesce API contract. You you have to choose:
collect the ordered dataframe as a list of Row instances and write to csv outside spark
write the partitions to individual CSV files with spark and concatenate the partitions with some other tool, e.g. "hadoop dfs getmerge" on the command line.

Number Of Parallel Task in Spark Streaming and Kafka Integration

I am very new to Spark Streaming.I have some basic doubts..Can some one please help me to clarify this:
My message size is standard.1Kb each message.
Number of Topic partitions is 30 and using dstream approach to consume message from kafka.
Number of cores given to spark job as :
( spark.max.cores=6| spark.executor.cores=2)
As I understand that Number of Kafka Partitions=Number of RDD partitions:
In this case dstream approach:
dstream.forEachRdd(rdd->{
rdd.forEachPartition{
}
**Question**:This loop forEachPartiton will execute 30 times??As there are 30 Kafka partitions
}
Also since I have given 6 cores,How many partitions will be consumed in parallel from kafka
Questions: Is it 6 partitions at a time
or
30/6 =5 partitions at a time?
Can some one please give little detail on it on how this exactly work in dstream approach.
"Is it 6 partitions at a time or
30/6 =5 partitions at a time?"
As you said already, the resulting RDDs within the Direct Stream will match the number of partitions of the Kafka topic.
On each micro-batch Spark will create 30 tasks to read each partition. As you have set the maximum number of cores to 6 the job is able to read 6 partitions in parallel. As soon as one of the tasks finishes a new partition can be consumed.
Remember, even if you have no new data in on of the partitions, the resulting RDD still get 30 partitions so, yes, the loop forEachPartiton will iterate 30 times within each micro-batch.

Does Spark maintain parquet partitioning on read?

I am having a lot trouble finding the answer to this question. Let's say I write a dataframe to parquet and I use repartition combined with partitionBy to get a nicely partitioned parquet file. See Below:
df.repartition(col("DATE")).write.partitionBy("DATE").parquet("/path/to/parquet/file")
Now later on I would like to read the parquet file so I do something like this:
val df = spark.read.parquet("/path/to/parquet/file")
Is the dataframe partitioned by "DATE"? In other words if a parquet file is partitioned does spark maintain that partitioning when reading it into a spark dataframe. Or is it randomly partitioned?
Also the why and why not to this answer would be helpful as well.
The number of partitions acquired when reading data stored as parquet follows many of the same rules as reading partitioned text:
If SparkContext.minPartitions >= partitions count in data, SparkContext.minPartitions will be returned.
If partitions count in data >= SparkContext.parallelism, SparkContext.parallelism will be returned, though in some very small partition cases, #3 may be true instead.
Finally, if the partitions count in data is somewhere between SparkContext.minPartitions and SparkContext.parallelism, generally you'll see the partitions reflected in the dataset partitioning.
Note that it's rare for a partitioned parquet file to have full data locality for a partition, meaning that, even when the partitions count in data matches the read partition count, there is a strong likelihood that the dataset should be repartitioned in memory if you're trying to achieve partition data locality for performance.
Given your use case above, I'd recommend immediately repartitioning on the "DATE" column if you're planning to leverage partition-local operations on that basis. The above caveats regarding minPartitions and parallelism settings apply here as well.
val df = spark.read.parquet("/path/to/parquet/file")
df.repartition(col("DATE"))
You would get the number of partitions based on the spark config spark.sql.files.maxPartitionBytes which defaults to 128MB. And the data would not be partitioned as per the partition column which was used while writing.
Reference https://spark.apache.org/docs/latest/sql-performance-tuning.html
In your question, there are two ways we could say the data are being "partitioned", which are:
via repartition, which uses a hash partitioner to distribute the data into a specific number of partitions. If, as in your question, you don't specify a number, the value in spark.sql.shuffle.partitions is used, which has default value 200. A call to .repartition will usually trigger a shuffle, which means the partitions are now spread across your pool of executors.
via partitionBy, which is a method specific to a DataFrameWriter that tells it to partition the data on disk according to a key. This means the data written are split across subdirectories named according to your partition column, e.g. /path/to/parquet/file/DATE=<individual DATE value>. In this example, only rows with a particular DATE value are stored in each DATE= subdirectory.
Given these two uses of the term "partitioning," there are subtle aspects in answering your question. Since you used partitionBy and asked if Spark "maintain's the partitioning", I suspect what you're really curious about is if Spark will do partition pruning, which is a technique used drastically improve the performance of queries that have filters on a partition column. If Spark knows values you seek cannot be in specific subdirectories, it won't waste any time reading those files and hence your query completes much quicker.
If the way you're reading the data isn't partition aware, you'll get a number of partitions something like what's in bsplosion's answer. Spark won't employ partition pruning, and hence you won't get the benefit of Spark automatically ignoring reading certain files to speed things up1.
Fortunately, reading parquet files in Spark that were written with partitionBy is a partition-aware read. Even without a metastore like Hive that tells Spark the files are partitioned on disk, Spark will discover the partitioning automatically. Please see partition discovery in Spark for how this works in parquet.
I recommend testing reading your dataset in spark-shell so that you can easily see the output of .explain, which will let you verify that Spark correctly finds the partitions and can prune out the ones that don't contain data of interest in your query. A nice writeup on this can be found here. In short, if you see PartitionFilters: [], it means that Spark isn't doing any partition pruning. But if you see something like PartitionFilters: [isnotnull(date#3), (date#3 = 2021-01-01)], Spark is only reading in a specific set of DATE partitions, and hence the query execution is usually a lot faster.
1A separate detail is that parquet stores statistics about data in its columns inside of the files themselves. If these statistics can be used to eliminate chunks of data that can't match whatever filtering you're doing, e.g. on DATE, then you'll see some speedup even if the way you read the data isn't partition-aware. This is called predicate pushdown. It works because the files on disk will still contain only specific values of DATE when using .partitionBy. More info can be found here.

How to determine number of partitons of rdd in spark given the number of cores and executors ?

What will be the number of partitions for 10 nodes cluster with 20 executors and code reading a folder with 100 files?
It is different in different modes that you are running and you can tune it up using the spark.default.parallelism setting. From Spark Documentation :
For operations like parallelize with no parent RDDs, it depends on
the cluster manager:
Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger
Link to related Documentation:
http://spark.apache.org/docs/latest/configuration.html#execution-behavior
You can yourself change the number of partitions yourself depending upon the data that you are reading.Some of the Spark api's provide an additional setting for the number of partition.
Further to check how many partitions are getting created do as #Sandeep Purohit says
rdd.getNumPartitions
And it will result into the number of partitions that are getting created !
You can also change the number of partitons after it is created by using two Api's namely : coalesce and repartition
Link to Coalesce and Repartition : Spark - repartition() vs coalesce()
From Spark doc:
By default, Spark creates one partition for each block of the file
(blocks being 64MB by default in HDFS), but you can also ask for a
higher number of partitions by passing a larger value. Note that you
cannot have fewer partitions than blocks.
Number of partitions also depends upon the size of the file. If the file size is too big, you may choose to have more partitions.
The number of partitions for the scala/java objects RDD will be dependent on the core of the machines and if you are creating RDD using Hadoop input files then it will dependent on block size of the hdfs (version dependent) you can find number of partitions in RDD as follows
rdd.getNumPartitions