Combine DataFrames in Pyspark

Combine DataFrames in Pyspark - pyspark

I have a vendor giving me multiple zipped data file on an S3 bucket which I need to read all together for analysis using Pyspark. How do I modify the sc.textFile() command?
Also, if I am loading 10 files, how do I reference them? Or are they all going into a single RDD?
On a broader level, how would I tweak the partitions, memory on an AMAZON EMR cluster? Each zipped file is 3MB in size or 1.3GB unzipped.
Thanks

You can have a script which will move all the unzip files into a directory and then as part of yur spark code you can refer to that directory
rdd = sc.textFile(("s3://path/to/data/")
As you mentioed it's 1.3 GB data which is not huge for spark to process, you can leave to spark to have required partitions, however you can define them while creating rdd.
For Amazon EMR, you can spin smaller nodes based on the type of reuirement
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html
Based on kind of processing(memory intensive/ compute intensive), choose machine type.
HTH

Related

What is Data Locality in Apache Spark?

I am newbie to apache-spark
I have a hard time understanding Data Locality in Apache Spark. I have tried to read this article https://data-flair.training/blogs/apache-spark-performance-tuning/
which says "PROCESS_LOCAL" and "NODE_LOCAL". Are there settings I need to configure?
Can someone take an example and explain it to me?
Thanks,
Padd

Data locality in simple terms means doing computation on the node where data resides.
Just to elaborate:
Spark is cluster computing system. It is not a storage system like HDFS or NOSQL. Spark is used to process the data stored in such distributed system.
Typical installation is like spark is installed on same nodes as that of HDFS/NOSQL.
In case there is a spark application which is processing data stored HDFS. Spark tries to place computation tasks alongside HDFS blocks.
With HDFS the Spark driver contacts NameNode about the DataNodes (ideally local) containing the various blocks of a file or directory as well as their locations (represented as InputSplits), and then schedules the work to the SparkWorkers.
Note : Spark’s compute nodes / workers should be running on storage nodes.
This is how data locality achieved in Spark.
The advantage is performance gain as less data is transferred over the network.
Try the below articles for reading - There are n number of documents availavle in google to search for
http://www.russellspitzer.com/2017/09/01/Spark-Locality/
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html

does EMR cluster size matters to read data from S3 using spark

Setup: latest (5.29) AWS EMR, spark, 1 master 1 node.
step 1. I have used S3Select to parse a file & collect all file keys for pulling from S3.
step 2. Use pyspark iterate the keys in a loop and do the following
spark
.read
.format("s3selectCSV")
.load(key)
.limit(superhighvalue)
.show(superhighvalue)
It took be x number of minutes.
When I increase the cluster to 1 master and 6 nodes, I am not seeing difference in time. It appears to me that I am not using the increased core nodes.
Everything else, config wise are defaults out of the box, I am not setting anything.
So, my question is does cluster size matters to read and inspect (say log or print) data from S3 using EMR, Spark?

Few thing to keep in mind.
are you sure that the executors have indeed increased because of
increase of nodes? or u can specify them during spark submit
--num-executors 6. MOre nodes doenst mean nore executors are spinned.
next thing, wht is the size of csv file? some 1MB? then u will not see much difference. Make sure to have atleast 3-4 GB

Yes, size does matter. For my use case, sc.parallelize(s3fileKeysList), parallelize turned out to be the key.

How to optimize Spark for writing large amounts of data to S3

I do a fair amount of ETL using Apache Spark on EMR.
I'm fairly comfortable with most of the tuning necessary to get good performance, but I have one job that I can't seem to figure out.
Basically, I'm taking about 1 TB of parquet data - spread across tens of thousands of files in S3 - and adding a few columns and writing it out partitioned by one of the date attributes of the data - again, parquet formatted in S3.
I run like this:
spark-submit --conf spark.dynamicAllocation.enabled=true --num-executors 1149 --conf spark.driver.memoryOverhead=5120 --conf spark.executor.memoryOverhead=5120 --conf spark.driver.maxResultSize=2g --conf spark.sql.shuffle.partitions=1600 --conf spark.default.parallelism=1600 --executor-memory 19G --driver-memory 19G --executor-cores 3 --driver-cores 3 --class com.my.class path.to.jar <program args>
The size of the cluster is dynamically determined based on the size of the input data set, and the num-executors, spark.sql.shuffle.partitions, and spark.default.parallelism arguments are calculated based on the size of the cluster.
The code roughly does this:
va df = (read from s3 and add a few columns like timestamp and source file name)
val dfPartitioned = df.coalesce(numPartitions)
val sqlDFProdDedup = spark.sql(s""" (query to dedup against prod data """);
sqlDFProdDedup.repartition($"partition_column")
.write.partitionBy("partition_column")
.mode(SaveMode.Append).parquet(outputPath)
When I look at the ganglia chart, I get a huge resource spike while the de-dup logic runs and some data shuffles, but then the actual writing of the data only uses a tiny fraction of the resources and runs for several hours.
I don't think the primary issue is partition skew, because the data should be fairly distributed across all the partitions.
The partition column is essentially a day of the month, so each job typically only has 5-20 partitions, depending on the span of the input data set. Each partition typically has about 100 GB of data across 10-20 parquet files.
I'm setting spark.sql.files.maxRecordsPerFile to manage the size of those output files.
So, my big question is: how can I improve the performance here?
Simply adding resources doesn't seem to help much.
I've tried making the executors larger (to reduce shuffling) and also to increase the number of CPUs per executor, but that doesn't seem to matter.
Thanks in advance!

Zack, I have a similar use case with 'n' times more files to process on a daily basis. I am going to assume that you are using the code above as is and trying to improve the performance of the overall job. Here are couple of my observations:
Not sure what the coalesce(numPartitions) number actually is and why its being used before de-duplication process. Your spark-submit shows you are creating 1600 partitions and thats good enough to start with.
If you are going to repartition before write then the coalesce above may not be beneficial at all as re-partition will shuffle data.
Since you claim writing 10-20 parquet files it means you are only using 10-20 cores in writing in the last part of your job which is the main reason its slow. Based on 100 GB estimate the parquet file ranges from approx 5GB to 10 GB, which is really huge and I doubt one will be able to open them on their local laptop or EC2 machine unless they use EMR or similar (with huge executor memory if reading whole file or spill to disk) because the memory requirement will be too high. I will recommend creating parquet files of around 1GB to avoid any of those issues.
Also if you create 1GB parquet file, you will likely speed up the process 5 to 10 times as you will be using more executors/cores to write them in parallel. You can actually run an experiment by simply writing the dataframe with default partitions.
Which brings me to the point that you really don't need to use re-partition as you want to write.partitionBy("partition_date") call. Your repartition() call is actually forcing the dataframe to only have max 30-31 partitions depending upon the number of days in that month which is what is driving the number of files being written. The write.partitionBy("partition_date") is actually writing the data in S3 partition and if your dataframe has say 90 partitions it will write 3 times faster (3 *30). df.repartition() is forcing it to slow it down. Do you really need to have 5GB or larger files?
Another major point is that Spark lazy evaluation is sometimes too smart. In your case it will most likely only use the number of executors for the whole program based on the repartition(number). Instead you should try, df.cache() -> df.count() and then df.write(). What this does is that it forces spark to use all available executor cores. I am assuming you are reading files in parallel. In your current implementation you are likely using 20-30 cores. One point of caution, as you are using r4/r5 machines, feel free to up your executor memory to 48G with 8 cores. I have found 8cores to be faster for my task instead of standard 5 cores recommendation.
Another pointer is to try ParallelGC instead of G1GC. For the use case like this when you are reading 1000x of files, I have noticed it performs better or not any worse than G1Gc. Please give it a try.
In my workload, I use coalesce(n) based approach where 'n' gives me a 1GB parquet file. I read files in parallel using ALL the cores available on the cluster. Only during the write part my cores are idle but there's not much you can do to avoid that.
I am not sure how spark.sql.files.maxRecordsPerFile works in conjunction with coalesce() or repartition() but I have found 1GB seems acceptable with pandas, Redshift spectrum, Athena etc.
Hope it helps.
Charu

Here are some optimizations for faster running.
(1) File committer - this is how Spark will read the part files out to the S3 bucket. Each operation is distinct and will be based upon
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
Description
This will write the files directly to part files instead or initially loading them to temp files and copying them over to their end-state part files.
(2) For file size you can derive it based upon getting the average number of bytes per record. Below I am figuring out the number of bytes per record to figure the number of records for 1024 MBs. I would try it first with 1024MBs per partition, then move upwards.
import org.apache.spark.util.SizeEstimator
val numberBytes : Long = SizeEstimator.estimate(inputDF.rdd)
val reduceBytesTo1024MB = numberBytes/123217728
val numberRecords = inputDF.count
val recordsFor1024MB = (numberRecords/reduceBytesTo1024MB).toInt + 1
(3) [I haven't tried this] EMR Committer - if you are using EMR 5.19 or higher, since you are outputting Parquet. You can set the Parquet optimized writer to TRUE.
spark.sql.parquet.fs.optimized.committer.optimization-enabled true

Read parquet file to multiple partitions [duplicate]

So I have just 1 parquet file I'm reading with Spark (using the SQL stuff) and I'd like it to be processed with 100 partitions. I've tried setting spark.default.parallelism to 100, we have also tried changing the compression of the parquet to none (from gzip). No matter what we do the first stage of the spark job only has a single partition (once a shuffle occurs it gets repartitioned into 100 and thereafter obviously things are much much faster).
Now according to a few sources (like below) parquet should be splittable (even if using gzip!), so I'm super confused and would love some advice.
https://www.safaribooksonline.com/library/view/hadoop-application-architectures/9781491910313/ch01.html
I'm using spark 1.0.0, and apparently the default value for spark.sql.shuffle.partitions is 200, so it can't be that. In fact all the defaults for parallelism are much more than 1, so I don't understand what's going on.

You should write your parquet files with a smaller block size. Default is 128Mb per block, but it's configurable by setting parquet.block.size configuration in the writer.
The source of ParquetOuputFormat is here, if you want to dig into details.
The block size is minimum amount of data you can read out of a parquet file which is logically readable (since parquet is columnar, you can't just split by line or something trivial like this), so you can't have more reading threads than input blocks.

The new way of doing it (Spark 2.x) is setting
spark.sql.files.maxPartitionBytes
Source: https://issues.apache.org/jira/browse/SPARK-17998 (the official documentation is not correct yet, misses the .sql)
From my experience, Hadoop settings no longer have effect.

Maybe your parquet file only takes one HDFS block. Create a big parquet file that has many HDFS blocks and load it
val k = sc.parquetFile("the-big-table.parquet")
k.partitions.length
You'll see same number of partitions as HDFS blocks. This worked fine for me (spark-1.1.0)

You have mentioned that you want to control distribution during write to parquet. When you create parquet from RDDs parquet preserves partitions of the RDD. So, if you create RDD and specify 100 partitions and from dataframe with parquet format then it will be writing 100 separate parquet files to fs.
For read you could specify spark.sql.shuffle.partitions parameter.

To achieve that you should use SparkContext to set Hadoop configuration (sc.hadoopConfiguration) property mapreduce.input.fileinputformat.split.maxsize.
By setting this property to a lower value than hdfs.blockSize, than you will get as much partitions as the number of splits.
For example:
When hdfs.blockSize = 134217728 (128MB),
and one file is read which contains exactly one full block,
and mapreduce.input.fileinputformat.split.maxsize = 67108864 (64MB)
Then there will be two partitions those splits will be read into.

How to save RDD in Spark existing in different slave machines to different S3 location

I want to divide a large file which exists in S3 into multiple chunks.
So I am trying to solve this problem by reading the file into Spark running on Amazon EMR with number of instances (say 10 for example).
This would create approx 10 partitions of RDDs which are stored in individual slave nodes.
While saving the files (rdd.saveAsTextfile(s3-path)), how can I Specify separate path for individual RDD stored in different nodes ?