Parallelize a RDD variable expected from an external file in Spark - scala

Based on the materials I read and some online posts, I think Spark will broadcast all a RDD variable from external file by: sc.textFile, for example:
val rdd = sc.textFile(file_path)
however, when my colleague read my code and requests me code with sc.parallelize, I am so confused about it as i think the sc.parallelize is redundant, I asked my colleague again and he gave me a answer:
To my experience up till now, spark doesn't good at handling the dividence of external file over multiple nodes and workers, so you need set partitions, forcing the worker to apply multiple workers to do the job.
So based on my colleague's suggestions, what is the easiest way that I can set partitions when I am reading a large volume file if sc.textFile can not do that. A possible way is to collect first and then sc.parallelize, but i think it wast too much time and it is redundant.

You can call rdd.repartion(..). Collect and parrallelise is not the right way to achieve what you describe.
The reason that your colleague observed this behaviour is probably due to small files, as partitioning is driven by the HDFS blocks when reading. So if you files are smaller than the block size, all your data will end up in the same executor.

Related

Spark/Synapse Optimal handling huge gzipped xml file (+600mb compressed size)

For a task we need to process huge transactional xml files which are gz(ipped). Each line in the uncompressed file can be interpreted as its own xml record.
When working with small files like 100 MiB this works fine. The moment à collect() is performed on the huge input file it tends to fail OOM and the jvm crashes.
As this is a compressed (gz) file it can not be processed in parallel (AFAIK).
I was thinking about
using the toLocalIterator() to split it first up into smaller packets of 200K xml entries which are distributed to the other nodes for their cost om processing. Apparently the toLocalIterator() does also the collect() first (to test)
Other option is to use the some kind of index value and filter on it ("index > 5000") and set the limit(5000) to simulate paging through the 2 Million or more entries.
But I have no clue to what I should pay attention to parralize. Any tips are welcome.
Settings to pay attention and how to apply them in Azure Synapse etc.
how to push the read xml over the nodes to be processed in their executor/tasks.
could streaming a single file be an option?
any tips are welcome
Currently my code is done in scala due the fact the java libraries are easily accessible to convert the xml to json and extract the values I need.
Many thanks in advance (also for reading this)
TL;DR suggestion:
Step 1. Increase driver memory and test
Step 2. Increase executor memory and test if first step fails
Slightly longer version:
The fact that it gives OOM on collect() operation doesn't indicate whether it is OOM on spark.read operation or df.collect()
Spark scheduler will run the DAG when it encounters an Action but not when it is a Transformation.
So if collect is your first action, it is at that point it actually runs the DAG and the OOM may even be on read but manifesting as OOM on collect
Spark UI will provide insights on where the OOM happens
You are right that uncompressing gzip wont be parallelised. On read operation, it will use a single executor and even a single core. So I would increase executor memory until there is sufficient memory to gzip the whole file into memory - not just to the exact file size, leave the usual 400MB / 0.7% buffer.
If the error is indeed happening on the collect() operation, then you need to sufficiently increase driver memory.
Your app will not be parallelised at read. Your app will not be parallelised at collect()
You app can be parallelised during the transformations in between them and you can force the parallelisation to the extent you want to tune it by repartitioning your dataframe / dataset / rdd further.
Finally, I would consider again whether you do need the collect or whether you can store the output as a number of partitioned files?
I think the zipped file is always going to be a bottleneck so one alternative is to unzip it and see if that helps. I would also consider loading up the xml to a table in Synapse (which can deal with gzipped files). This would have the effect of unzipping it, you could then pass it into a Synapse Notebook with the synapsesql method, eg in Scala:
// Get the table with the XML column from the database and expose as temp view
val df = spark.read.synapsesql("yourPool.dbo.someXMLTable")
df.createOrReplaceTempView("someXMLTable")
You could process the XML as I have done here and then write it back to the Synapse dedicated SQL pool as an internal table:
val df2 = spark.sql("""
SELECT
colA,
colB,
xpath_string(pkData,'/DataSet/EnumObject[name="Inpatient"]/value') xvalue
FROM someXMLTable
""")
// Write that dataframe back to the dedicated SQL pool
df2.write.synapsesql("yourPool.dbo.someXMLTable_processed", Constants.INTERNAL)
This would ensure you are keeping things in parallel, no collect required. NB there are a couple of assumptions in there around uploading the gzipped files to a dedicated SQL pool and that the xpath_string does what you need which need to be checked and confirmed. The proposed pattern:

How to do simple cache file in Flink-Scala?

I am new to Flink. I am really confused how to do file caching and load it into a dataset ? I can't find a simple example. I am confused why we need to create a dataset first to call "RichMapFunction" ? How I cache file that with nothing do with any other dataset? In sample I found, it kind of performed join with other dataset. Thank you.
For the case to join two data sets, and one data set is small, use broadcast to avoid shuffle. Without broadcasting, it is a pain to shuffle a large data set.
E.g. one dataset has 1 billion records, another one has 100 records. With broadcast, the small dataset will be distributed to all task managers processing those 1 billion records - no moving 1 billion record for join. Without broadcast, the typical behaviour for joining operation is to shuffle the 1 billion records and 100 records, so that records with same key are in the same machine, which is much more expensive compared to broadcast.
The RichMapFunction provides the open() method and method to access RuntimeContext. In the open() function, the Flink job can get broadcasted dataset through getRuntimeContext(). getBroadcastVariable(). The open() function is called only one time for each operator, so the broadcasted dataset is initialised one time and then it can be applied to all incoming records. That is the reason why to use RichMapFunction() instead of MapFunction().
Note - Broadcast applies to the case that the dataset to broadcast is small. Need to create a dataset first and then broadcast the dataset to all operator. Please refer to here for the usage of the API.
For distributed file caching, it is for the case that the operation(e.g. Map operation) needs to load external file one time and use it in the operation.
E.g. A trained model is saved on HDFS. In Flink job, it needs to load the model and apply the model to each record. For this case, the Flink job can use distributed file cache API. The model file will be pulled from HDFS to local machine, and all tasks running on that machine can share the pulled file locally, which saves network and time.
You do not need to create a dataset for the file to be distributed, but using registerCachedFile(). Like the same reason for broadcasting dataset, using RichMapFunction allows the Flink job to load/init distributed file one time.
Please refer to this document for the usage.

Parallel Writes Of Multiple DataFrames in Spark

I have a scenario , where I have to write multiple data frames into parquet format.
I have used this
df.write
.format("parquet")
.mode(<write-mode>)
.option("compression", "gzip")
.save(<file-path>)
Now , I have around 15 data frames that will be writing the data to parquet.
I see at a time only one task is getting executed (so , only 1 data frame is getting written) . Also when I checked the number of active executors in the spark-ui , I see only 1 executor is being used
My questions are :
Can I use parallel writes to the parquet store for a single data
frame ?
Can I do parallel writes of multiple data frames to
parquet(instead of the write happening sequentially)?
Certainly possible. https://georgheiler.com/2019/12/05/parallel-aggregation-of-dataframes/
Is a nice example I created recently.
The main thing is you need to issue actions / writes in parallel on the driver. Thus means you somehow manually need to manage threads and concurrency.
But equally important is to set sparks scheduler to fair mode. Otherwise it will submit in parallel and only work in parallel if additional slots are free. At least in your setting this doesn't seem to be like that. So only when setting it to fair, the scheduler will give every running action an equal share of slots.
You potentially could create multiple pools of resources.

Spark save(write) parquet only one file

if i write
dataFrame.write.format("parquet").mode("append").save("temp.parquet")
in temp.parquet folder
i got the same file numbers as the row numbers
i think i'm not fully understand about parquet but is it natural?
Use coalesce before write operation
dataFrame.coalesce(1).write.format("parquet").mode("append").save("temp.parquet")
EDIT-1
Upon a closer look, the docs do warn about coalesce
However, if you're doing a drastic coalesce, e.g. to numPartitions =
1, this may result in your computation taking place on fewer nodes
than you like (e.g. one node in the case of numPartitions = 1)
Therefore as suggested by #Amar, it's better to use repartition
You can set partitions as 1 to save as single file
dataFrame.repartition(1).write.format("parquet").mode("append").save("temp.parquet")
Although previous answers are correct you have to understand repercusions that come after repartitioning or coalescing to a single partition. All your data will have to be transferred to a single worker just to immediately write it to a single file.
As it is repeatidly mentioned throughout the internet, you should use repartition in this scenario despite the shuffle step that gets added to the execution plan. This step helps to use your cluster's power instead of sequentially merging files.
There is at least one alternative worth mentioning. You can write a simple script that would merge all the files into a single one. That way you will avoid generating massive network traffic to a single node of your cluster.

How to transform RDD, Dataframe or Dataset straight to a Broadcast variable without collect?

Is there any way (or any plans) to be able to turn Spark distributed collections (RDDs, Dataframe or Datasets) directly into Broadcast variables without the need for a collect? The public API doesn't seem to have anything "out of box", but can something be done at a lower level?
I can imagine there is some 2x speedup potential (or more?) for these kind of operations. To explain what I mean in detail let's work through an example:
val myUberMap: Broadcast[Map[String, String]] =
sc.broadcast(myStringPairRdd.collect().toMap)
someOtherRdd.map(someCodeUsingTheUberMap)
This causes all the data to be collected to the driver, then the data is broadcasted. This means the data is sent over the network essentially twice.
What would be nice is something like this:
val myUberMap: Broadcast[Map[String, String]] =
myStringPairRdd.toBroadcast((a: Array[(String, String)]) => a.toMap)
someOtherRdd.map(someCodeUsingTheUberMap)
Here Spark could bypass collecting the data altogether and just move the data between the nodes.
BONUS
Furthermore, there could be a Monoid-like API (a bit like combineByKey) for situations where the .toMap or whatever operation on Array[T] is expensive, but can possibly be done in parallel. E.g. constructing certain Trie structures can be expensive, this kind of functionality could result in awesome scope for algorithm design. This CPU activity can also be run while the IO is running too - while the current broadcast mechanism is blocking (i.e. all IO, then all CPU, then all IO again).
CLARIFICATION
Joining is not (main) use case here, it can be assumed that I sparsely use the broadcasted data structure. For example the keys in someOtherRdd by no means covers the keys in myUberMap but I don't know which keys I need until I traverse someOtherRdd AND suppose I use myUberMap multiple times.
I know that all sounds a bit vague, but the point is for more general machine learning algorithm design.
While theoretically this is an interesting idea I will argue that although theoretically possible it has very limited practical applications. Obviously I cannot speak for PMC so I cannot say if there are any plans to implement this type of broadcasting mechanism at all.
Possible implementation:
Since Spark already provides torrent broadcasting mechanism which behavior is described as follows:
The driver divides the serialized object into small chunks and
stores those chunks in the BlockManager of the driver.
On each executor, the executor first attempts to fetch the object from its BlockManager.
If it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or
other executors if available.
Once it gets the chunks, it puts the chunks in its own
BlockManager, ready for other executors to fetch from.
it should be possible to reuse the same mechanism for direct node-to-node broadcasting.
It is worth noting that this approach cannot completely eliminate driver communication. Even though blocks could be created locally you still need a single source of truth to advertise a set of blocks to fetch.
Limited applications
One problem with broadcast variables is that there are quite expensive. Even if you can eliminate driver bottleneck two problems remain:
Memory required to store deserialized object on each executor.
Cost of transferring broadcasted data to every executor.
The first problem should be relatively obvious. It is not only about direct memory usage but also about GC cost and its effect on overall latency. The second one is rather subtle. I partially covered this in my answer to Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark but let's discus this further.
From network traffic perspective broadcasting a whole dataset is pretty much equivalent to creating Cartesian product. So if dataset is large enough for driver becoming a bottleneck it is unlikely to be a good candidate for broadcasting and targeted approach like hash join can be preferred in practice.
Alternatives:
There are some methods which can be used to achieve similar results as direct broadcast and address issues enumerated above including:
Passing data via distributed file system.
Using replicated database collocated with worker nodes.
I don't know if we can do it for RDD but you can do it for Dataframe
import org.apache.spark.sql.functions
val df:DataFrame = your_data_frame
val broadcasted_df = functions.broadcast(df)
now you can use variable broadcasted_df and it will be broadcasted to executor.
Make sure broadcasted_df dataframe is not too big and can be send to executor.
broadcasted_df will be broadcaster in operations like for example
other_df.join(broadcasted_df)
and in this case join() operation executes faster because every executor has 1 partition of other_df and whole broadcasted_df
For your question i am not sure you can do what you want. You can not use one rdd inside #map() method of another rdd because spark doesn't allowed transformations inside transformations. And in your case you need to call collect() method to create map from your RDD because you can only use usual map object inside #map() method you can not use RDD there.