data distribution in spark dataframes while reading files from s3 - scala

I am trying to read 1TB of parquet data from s3 into spark dataframes and have assigned 80 executors with 30 gb and 5 cores to process and ETL the data.
However i am seeing the data is not distributed equally among the executors to make use of the cores while reading the data. My understanding is that the input is divided into chunks and then distributed equally among the executors for processing . I am not using any shuffles or joins of any kind and also the explain plan does not have any hash partitioning or aggregations of any kind . Please suggest if this is expected and how we can better redistribute the data to make use of all the cores.

Related

spark repartition issue for filesize

Need to merge small parquet files.
I have multiple small parquet files in hdfs.
I like to combine those parquet files each to nearly 128 mb each
2. So I read all the files using spark.read()
And did repartition() on that and write to the hdfs location
My issue is
I have approx 7.9 GB of data, when I did repartition and saved to hdfs it is getting nearly to 22 GB.
I had tied with repartition , range , colasce but not getting the solution
I think that it may be connected with your repartition operation. You are using .repartition(10) so Spark is going to use RoundRobin to repartition your data so probably ordering is going to change. Order of data is important during compresion, you can read more in this question
You may try to add sort or repartition your data by expresion instead of only number of partitions to optimize file size

Can I use Kafka for multiple independent consumers sequential reads?

I have the following use case:
50 students write their own code which consumes a preloaded dataset, and they will repeat it many times.
They all need to do the same task: read the data in order, and process it.
The dataset is a time series containing 600 million messages, each message is about 1.3KB.
Processing will probably be in Spark, but not mandatory.
The dataset is fixed and ReadOnly.
The data should be read at "reasonable speed" > 30MB/sec for each consumer.
I was thinking of setting kafka cluster with 3+ brokers, 1 topic, and 50 partitions.
My issue with the above plan is that each student (== consumer) must read all the data, regardless of what other consumers do.
Is Kafka a good fit for this? If so, how?
What if I relax the requirement of reading the dataset in order? i.e. a consumer can read the 600M messages in any order.
Is it correct that in this case each consumer will simply pull the full topic (starting with "earliest)?
An alternative is to set an HDFS storage (we use Azure so it's called Storage Account) and simply supply a mount point. However, I do not have control of the throughput in this case.
Throughput calculation:
let's say 25 consumers run concurrently, each reading at 30MB/s -> 750MB/s .
Assuming data is read from disk, and disk rate is 50MB/s, I need to read concurrently from 750/50 = 15 disks.
Does it mean I need to have 15 brokers? I did not see how one broker can allocate partitions to several disks attached to it.
similar posts:
Kafka topic partitions to Spark streaming
How does one Kafka consumer read from more than one partition?
(Spring) Kafka appears to consume newly produced messages out of order
Kafka architecture many partitions or many topics?
Is it possible to read from multiple partitions using Kafka Simple Consumer?
Processing will probably be in Spark, but not mandatory
An alternative is to set an HDFS storage (we use Azure)
Spark can read from Azure Blob Storage, so I suggest you start with that first. You can easily scale up Spark executors in parallel for throughput.
If want to use Kafka, don't base consumption rate on disk speed alone, especially when Kafka can do zero-copy transfers. Use kafka-consumer-perf-test script to test how fast your consumers can go with one partition. Or, better, if your data has some key other than timestamp that you can order by, then use that.
It's not really clear if each "50 students" does the same processing on the data set, or some pre computations can be done, but if so, Kafka Streams KTables can be setup to aggregate some static statistics of the data, if it's all streamed though a topic, that way, you can distribute load for those queries, and not need 50 parallel consumers.
Otherwise, my first thought would be to simply use a TSDB like OpenTSDB, Timescale or Influx, maybe Druid . Which could also be used with Spark, or queried directly.
If you are using Apache Spark 3.0+ there are ways around consumer per partition bound, as it can use more executor threads than partitions are, so it's mostly about how fast your network and disks are.
Kafka stores latest offsets in memory, so probably for your use case most of reads will be from memory.
Desired minimum number of partitions to read from Kafka. By default, Spark has a 1-1 mapping of topicPartitions to Spark partitions consuming from Kafka. If you set this option to a value greater than your topicPartitions, Spark will divvy up large Kafka partitions to smaller pieces. Please note that this configuration is like a hint: the number of Spark tasks will be approximately minPartitions. It can be less or more depending on rounding errors or Kafka partitions that didn't receive any new data.
https://spark.apache.org/docs/3.0.1/structured-streaming-kafka-integration.html

Spark DataFrame cache keeps growing

How does spark decide how many times to replicate a cached partition?
The storage level in the storage tab on the spark UI says “Disk Serialized 1x Replicated”, but it looks like partitions get replicated onto multiple executors. We have noticed this happening with DISK_ONLY storage level using spark 2.3. We are caching a dataset with 101 partitions (size on disk is 468.4 GB). Data is distributed initially on 101 executors (we have 600 executors total). As we run queries on this dataset, the size on disk grows as well as the number of executors data is distributed on. We also noticed that commonly one block/partition is replicated on multiple executors on the same node – if it is stored on disk, why is this not shared between executors on same node?
persistedDs = dataset.repartition(101).persist(StorageLevel.DISK_ONLY)
Initial Load
After Running queries on Cached Dataset
One executor can have 2 partitions cached in it. Also, note that the RDD is cached multiple time in the attached screenshot.
Data Distribution on 101 Executors

Does Spark maintain parquet partitioning on read?

I am having a lot trouble finding the answer to this question. Let's say I write a dataframe to parquet and I use repartition combined with partitionBy to get a nicely partitioned parquet file. See Below:
df.repartition(col("DATE")).write.partitionBy("DATE").parquet("/path/to/parquet/file")
Now later on I would like to read the parquet file so I do something like this:
val df = spark.read.parquet("/path/to/parquet/file")
Is the dataframe partitioned by "DATE"? In other words if a parquet file is partitioned does spark maintain that partitioning when reading it into a spark dataframe. Or is it randomly partitioned?
Also the why and why not to this answer would be helpful as well.
The number of partitions acquired when reading data stored as parquet follows many of the same rules as reading partitioned text:
If SparkContext.minPartitions >= partitions count in data, SparkContext.minPartitions will be returned.
If partitions count in data >= SparkContext.parallelism, SparkContext.parallelism will be returned, though in some very small partition cases, #3 may be true instead.
Finally, if the partitions count in data is somewhere between SparkContext.minPartitions and SparkContext.parallelism, generally you'll see the partitions reflected in the dataset partitioning.
Note that it's rare for a partitioned parquet file to have full data locality for a partition, meaning that, even when the partitions count in data matches the read partition count, there is a strong likelihood that the dataset should be repartitioned in memory if you're trying to achieve partition data locality for performance.
Given your use case above, I'd recommend immediately repartitioning on the "DATE" column if you're planning to leverage partition-local operations on that basis. The above caveats regarding minPartitions and parallelism settings apply here as well.
val df = spark.read.parquet("/path/to/parquet/file")
df.repartition(col("DATE"))
You would get the number of partitions based on the spark config spark.sql.files.maxPartitionBytes which defaults to 128MB. And the data would not be partitioned as per the partition column which was used while writing.
Reference https://spark.apache.org/docs/latest/sql-performance-tuning.html
In your question, there are two ways we could say the data are being "partitioned", which are:
via repartition, which uses a hash partitioner to distribute the data into a specific number of partitions. If, as in your question, you don't specify a number, the value in spark.sql.shuffle.partitions is used, which has default value 200. A call to .repartition will usually trigger a shuffle, which means the partitions are now spread across your pool of executors.
via partitionBy, which is a method specific to a DataFrameWriter that tells it to partition the data on disk according to a key. This means the data written are split across subdirectories named according to your partition column, e.g. /path/to/parquet/file/DATE=<individual DATE value>. In this example, only rows with a particular DATE value are stored in each DATE= subdirectory.
Given these two uses of the term "partitioning," there are subtle aspects in answering your question. Since you used partitionBy and asked if Spark "maintain's the partitioning", I suspect what you're really curious about is if Spark will do partition pruning, which is a technique used drastically improve the performance of queries that have filters on a partition column. If Spark knows values you seek cannot be in specific subdirectories, it won't waste any time reading those files and hence your query completes much quicker.
If the way you're reading the data isn't partition aware, you'll get a number of partitions something like what's in bsplosion's answer. Spark won't employ partition pruning, and hence you won't get the benefit of Spark automatically ignoring reading certain files to speed things up1.
Fortunately, reading parquet files in Spark that were written with partitionBy is a partition-aware read. Even without a metastore like Hive that tells Spark the files are partitioned on disk, Spark will discover the partitioning automatically. Please see partition discovery in Spark for how this works in parquet.
I recommend testing reading your dataset in spark-shell so that you can easily see the output of .explain, which will let you verify that Spark correctly finds the partitions and can prune out the ones that don't contain data of interest in your query. A nice writeup on this can be found here. In short, if you see PartitionFilters: [], it means that Spark isn't doing any partition pruning. But if you see something like PartitionFilters: [isnotnull(date#3), (date#3 = 2021-01-01)], Spark is only reading in a specific set of DATE partitions, and hence the query execution is usually a lot faster.
1A separate detail is that parquet stores statistics about data in its columns inside of the files themselves. If these statistics can be used to eliminate chunks of data that can't match whatever filtering you're doing, e.g. on DATE, then you'll see some speedup even if the way you read the data isn't partition-aware. This is called predicate pushdown. It works because the files on disk will still contain only specific values of DATE when using .partitionBy. More info can be found here.

How to determine number of partitons of rdd in spark given the number of cores and executors ?

What will be the number of partitions for 10 nodes cluster with 20 executors and code reading a folder with 100 files?
It is different in different modes that you are running and you can tune it up using the spark.default.parallelism setting. From Spark Documentation :
For operations like parallelize with no parent RDDs, it depends on
the cluster manager:
Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger
Link to related Documentation:
http://spark.apache.org/docs/latest/configuration.html#execution-behavior
You can yourself change the number of partitions yourself depending upon the data that you are reading.Some of the Spark api's provide an additional setting for the number of partition.
Further to check how many partitions are getting created do as #Sandeep Purohit says
rdd.getNumPartitions
And it will result into the number of partitions that are getting created !
You can also change the number of partitons after it is created by using two Api's namely : coalesce and repartition
Link to Coalesce and Repartition : Spark - repartition() vs coalesce()
From Spark doc:
By default, Spark creates one partition for each block of the file
(blocks being 64MB by default in HDFS), but you can also ask for a
higher number of partitions by passing a larger value. Note that you
cannot have fewer partitions than blocks.
Number of partitions also depends upon the size of the file. If the file size is too big, you may choose to have more partitions.
The number of partitions for the scala/java objects RDD will be dependent on the core of the machines and if you are creating RDD using Hadoop input files then it will dependent on block size of the hdfs (version dependent) you can find number of partitions in RDD as follows
rdd.getNumPartitions