Remove temporary files on spark driver and executors - scala

I save a RDD with saveAsObjectFile so the temporary files are distributed on driver and executors. At the end of program I want to remove all of these files. How to remove them?

There's no built-in support for deleting data via Spark. However, you can use foreachPartition on the original RDD to run any arbitrary piece of code on each partition, meaning - it would run at least once on each of the executors that actually saved some data.
So - if you run code that deletes the folder you saved into (making sure that it won't fail if it runs more than once on the same executor, as a single executor can hold multiple partitions) you'd get what you need.
For example, using Apache Commons:
// save
rdd.saveAsObjectFile("/my/path")
// use data...
// before shutting down - iterate over saved RDD's partitions and delete folder:
import org.apache.commons.io.FileUtils
rdd.foreachPartition(i =>
// deleteDirectory doesn't fail if directory does not exist
FileUtils.deleteDirectory(new File("/my/path"))
)
EDIT: note that this is a bit hacky, and might not be 100% bullet-proof: for example, if during the application execution one of the executors crashed, its partitions might be recalculated on other executors hence the data on that executor won't be deleted.

Related

Spark/Synapse Optimal handling huge gzipped xml file (+600mb compressed size)

For a task we need to process huge transactional xml files which are gz(ipped). Each line in the uncompressed file can be interpreted as its own xml record.
When working with small files like 100 MiB this works fine. The moment à collect() is performed on the huge input file it tends to fail OOM and the jvm crashes.
As this is a compressed (gz) file it can not be processed in parallel (AFAIK).
I was thinking about
using the toLocalIterator() to split it first up into smaller packets of 200K xml entries which are distributed to the other nodes for their cost om processing. Apparently the toLocalIterator() does also the collect() first (to test)
Other option is to use the some kind of index value and filter on it ("index > 5000") and set the limit(5000) to simulate paging through the 2 Million or more entries.
But I have no clue to what I should pay attention to parralize. Any tips are welcome.
Settings to pay attention and how to apply them in Azure Synapse etc.
how to push the read xml over the nodes to be processed in their executor/tasks.
could streaming a single file be an option?
any tips are welcome
Currently my code is done in scala due the fact the java libraries are easily accessible to convert the xml to json and extract the values I need.
Many thanks in advance (also for reading this)
TL;DR suggestion:
Step 1. Increase driver memory and test
Step 2. Increase executor memory and test if first step fails
Slightly longer version:
The fact that it gives OOM on collect() operation doesn't indicate whether it is OOM on spark.read operation or df.collect()
Spark scheduler will run the DAG when it encounters an Action but not when it is a Transformation.
So if collect is your first action, it is at that point it actually runs the DAG and the OOM may even be on read but manifesting as OOM on collect
Spark UI will provide insights on where the OOM happens
You are right that uncompressing gzip wont be parallelised. On read operation, it will use a single executor and even a single core. So I would increase executor memory until there is sufficient memory to gzip the whole file into memory - not just to the exact file size, leave the usual 400MB / 0.7% buffer.
If the error is indeed happening on the collect() operation, then you need to sufficiently increase driver memory.
Your app will not be parallelised at read. Your app will not be parallelised at collect()
You app can be parallelised during the transformations in between them and you can force the parallelisation to the extent you want to tune it by repartitioning your dataframe / dataset / rdd further.
Finally, I would consider again whether you do need the collect or whether you can store the output as a number of partitioned files?
I think the zipped file is always going to be a bottleneck so one alternative is to unzip it and see if that helps. I would also consider loading up the xml to a table in Synapse (which can deal with gzipped files). This would have the effect of unzipping it, you could then pass it into a Synapse Notebook with the synapsesql method, eg in Scala:
// Get the table with the XML column from the database and expose as temp view
val df = spark.read.synapsesql("yourPool.dbo.someXMLTable")
df.createOrReplaceTempView("someXMLTable")
You could process the XML as I have done here and then write it back to the Synapse dedicated SQL pool as an internal table:
val df2 = spark.sql("""
SELECT
colA,
colB,
xpath_string(pkData,'/DataSet/EnumObject[name="Inpatient"]/value') xvalue
FROM someXMLTable
""")
// Write that dataframe back to the dedicated SQL pool
df2.write.synapsesql("yourPool.dbo.someXMLTable_processed", Constants.INTERNAL)
This would ensure you are keeping things in parallel, no collect required. NB there are a couple of assumptions in there around uploading the gzipped files to a dedicated SQL pool and that the xpath_string does what you need which need to be checked and confirmed. The proposed pattern:

Combine DataFrames in Pyspark

I have a vendor giving me multiple zipped data file on an S3 bucket which I need to read all together for analysis using Pyspark. How do I modify the sc.textFile() command?
Also, if I am loading 10 files, how do I reference them? Or are they all going into a single RDD?
On a broader level, how would I tweak the partitions, memory on an AMAZON EMR cluster? Each zipped file is 3MB in size or 1.3GB unzipped.
Thanks
You can have a script which will move all the unzip files into a directory and then as part of yur spark code you can refer to that directory
rdd = sc.textFile(("s3://path/to/data/")
As you mentioed it's 1.3 GB data which is not huge for spark to process, you can leave to spark to have required partitions, however you can define them while creating rdd.
For Amazon EMR, you can spin smaller nodes based on the type of reuirement
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html
Based on kind of processing(memory intensive/ compute intensive), choose machine type.
HTH

How spark loads the data into memory

I have total confusion in the spark execution process. I have referred may articles and tutorials, nobody is discussing in detailed. I might be wrongly understanding spark. Please correct me.
I have my file of 40GB distributed across 4 nodes (10GB each node) of the 10 node cluster.
When I say spark.read.textFile("test.txt") in my code, will it load data(40GB) from all the 4 nodes into driver program (master node)?
Or this RDD will be loaded in all the 4 nodes separately. In that case, each node RDD should hold 10GB of physical data, is it?
And the whole RDD holds 10GB data and perform tasks for each partition i.e 128MB in spark 2.0. And finally shuffles the output to the driver program (master node)
And I read somewhere "numbers of cores in Cluster = no. of partitions" does it mean, the spark will move the partitions of one node to all 10 nodes for processing?
Spark doesn't have to read the whole file into memory at once. That 40GB file is split into many 128MB (or whatever your partition size is) partitions. Each of those partitions is a processing task. Each core will only work on one task at a time, with a preference to work on tasks where the data partition is stored on the same node. Only the 128MB partition that is being worked on needs to be read, the rest of the file is not read. Once the task completes (and produces some output) then the 128MB for the next task cab be read in and the data read in for the first task can be freed from memory. Because of this only the small amount of data being processed at a time needs to be loaded in to memory and not the entire file at once.
Also strictly speaking spark.read.textFile("test.txt") does nothing. It reads no data and does no processing. It creates an RDD but an RDD doesn't contain any data. And RDD is just an execution plan. spark.read.textFile("test.txt") declared that the file test.txt will be read an used as a source of data if and when the RDD is evaluated but doesn't do anything on its own.

Spark: efficiency of dataframe checkpoint vs. explicitly writing to disk

Checkpoint version:
val savePath = "/some/path"
spark.sparkContext.setCheckpointDir(savePath)
df.checkpoint()
Write to disk version:
df.write.parquet(savePath)
val df = spark.read.parquet(savePath)
I think both break the lineage in the same way.
In my experiments checkpoint is almost 30 bigger on disk than parquet (689GB vs. 24GB). In terms of running time, checkpoint takes 1.5 times longer (10.5 min vs 7.5 min).
Considering all this, what would be the point of using checkpoint instead of saving to file? Am I missing something?
Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system. If you have a large RDD lineage graph and you want freeze the content of the current RDD i.e. materialize the complete RDD before proceeding to the next step, you generally use persist or checkpoint. The checkpointed RDD then could be used for some other purpose.
When you checkpoint the RDD is serialized and stored in Disk. It doesn't store in parquet format so the data is not properly storage optimized in the Disk. Contraty to parquet which provides various compaction and encoding to store optimize the data. This would explain the difference in the Size.
You should definitely think about checkpointing in a noisy cluster. A cluster is called noisy if there are lots of jobs and users which compete for resources and there are not enough resources to run all the jobs simultaneously.
You must think about checkpointing if your computations are really expensive and take long time to finish because it could be faster to write an RDD to
HDFS and read it back in parallel than recompute from scratch.
And there's a slight inconvenience prior to spark2.1 release;
there is no way to checkpoint a dataframe so you have to checkpoint the underlying RDD. This issue has been resolved in spark2.1 and above versions.
The problem with saving to Disk in parquet and read it back is that
It could be inconvenient in coding. You need to save and read multiple times.
It could be a slower process in the overall performance of the job. Because when you save as parquet and read it back the Dataframe needs to be reconstructed again.
This wiki could be useful for further investigation
As presented in the dataset checkpointing wiki
Checkpointing is actually a feature of Spark Core (that Spark SQL uses for distributed computations) that allows a driver to be restarted on failure with previously computed state of a distributed computation described as an RDD. That has been successfully used in Spark Streaming - the now-obsolete Spark module for stream processing based on RDD API.
Checkpointing truncates the lineage of a RDD to be checkpointed. That has been successfully used in Spark MLlib in iterative machine learning algorithms like ALS.
Dataset checkpointing in Spark SQL uses checkpointing to truncate the lineage of the underlying RDD of a Dataset being checkpointed.
One difference is that if your spark job needs a certain in memory partitioning scheme, eg if you use a window function, then checkpoint will persist that to disk, whereas writing to parquet will not.
I'm not aware of a way with the current versions of spark to write parquet files and then read them in again, with a particular in memory partitioning strategy. Folder level partitioning doesn't help with this.

Read parquet file to multiple partitions [duplicate]

So I have just 1 parquet file I'm reading with Spark (using the SQL stuff) and I'd like it to be processed with 100 partitions. I've tried setting spark.default.parallelism to 100, we have also tried changing the compression of the parquet to none (from gzip). No matter what we do the first stage of the spark job only has a single partition (once a shuffle occurs it gets repartitioned into 100 and thereafter obviously things are much much faster).
Now according to a few sources (like below) parquet should be splittable (even if using gzip!), so I'm super confused and would love some advice.
https://www.safaribooksonline.com/library/view/hadoop-application-architectures/9781491910313/ch01.html
I'm using spark 1.0.0, and apparently the default value for spark.sql.shuffle.partitions is 200, so it can't be that. In fact all the defaults for parallelism are much more than 1, so I don't understand what's going on.
You should write your parquet files with a smaller block size. Default is 128Mb per block, but it's configurable by setting parquet.block.size configuration in the writer.
The source of ParquetOuputFormat is here, if you want to dig into details.
The block size is minimum amount of data you can read out of a parquet file which is logically readable (since parquet is columnar, you can't just split by line or something trivial like this), so you can't have more reading threads than input blocks.
The new way of doing it (Spark 2.x) is setting
spark.sql.files.maxPartitionBytes
Source: https://issues.apache.org/jira/browse/SPARK-17998 (the official documentation is not correct yet, misses the .sql)
From my experience, Hadoop settings no longer have effect.
Maybe your parquet file only takes one HDFS block. Create a big parquet file that has many HDFS blocks and load it
val k = sc.parquetFile("the-big-table.parquet")
k.partitions.length
You'll see same number of partitions as HDFS blocks. This worked fine for me (spark-1.1.0)
You have mentioned that you want to control distribution during write to parquet. When you create parquet from RDDs parquet preserves partitions of the RDD. So, if you create RDD and specify 100 partitions and from dataframe with parquet format then it will be writing 100 separate parquet files to fs.
For read you could specify spark.sql.shuffle.partitions parameter.
To achieve that you should use SparkContext to set Hadoop configuration (sc.hadoopConfiguration) property mapreduce.input.fileinputformat.split.maxsize.
By setting this property to a lower value than hdfs.blockSize, than you will get as much partitions as the number of splits.
For example:
When hdfs.blockSize = 134217728 (128MB),
and one file is read which contains exactly one full block,
and mapreduce.input.fileinputformat.split.maxsize = 67108864 (64MB)
Then there will be two partitions those splits will be read into.